English 中文(简体)
mahout lucene document clustering howto?
原标题:

I m reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms. http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.

In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?

sorry in advance for my poor grammar

Thank you

最佳回答

If you have vectors, you can run KMeansDriver. Here is the help for the same.

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.

问题回答

A pretty good howto is here: integrating apache mahout with apache lucene

@ maiky You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper





相关问题
adding an index to sql server

I have a query that gets run often. its a dynmaic sql query because the sort by changes. SELECT userID, ROW_NUMBER(OVER created) as rownumber from users where divisionID = @divisionID and ...

Linq to SQL nvarchar problem

I have discovered a huge performance problem in Linq to SQL. When selecting from a table using strings, the parameters passed to sql server are always nvarchar, even when the sql table is a varchar. ...

TableView oval button for Index/counts

Can someone help me create an index/count button for a UITableView, like this one? iTunes http://img.skitch.com/20091107-nwyci84114dxg76wshqwgtauwn.preview.jpg Is there an Apple example, or other ...

Move or copy an entity to another kind

Is there a way to move an entity to another kind in appengine. Say you have a kind defines, and you want to keep a record of deleted entities of that kind. But you want to separate the storage of ...

热门标签