Interpreting output from mahout clusterdumper

Question

Interpreting output from mahout clusterdumper

I conducted a clustering test on bypass pages (more than 25 thousand documents, a set of personal data). I did clusterdump:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt

At the output, after starting the cluster damper, 25 "VL-xxxxx {}" elements are displayed:

 VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]} ... VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}

How to interpret this conclusion?

In short: I am looking for document identifiers belonging to a particular cluster.

What's the point:

Vl-x?
n = yc = [z: z ', ...]
r = [z '': z '' ', ...]

Does 0: 0.017 mean that “0” is the identifier of the document that belongs to this cluster?

I already read mahout on wiki pages, which means CL, n, c and r. But can someone please explain them to me better or point to a resource where this is explained in more detail?

Sorry if I ask some stupid questions, but I am new to wih apache mahout and use it as part of my course for clustering.

+4

hadoop cluster-analysis mahout k-means

lucif Apr 27 '11 at 13:52

source share

4 answers

user1167371 · Answer 1 · 2012-01-24T15:53:57+0000

By default, kmeans clustering uses a WeightedVector, which does not include a data point name. So, you want to make the sequence file yourself using NamedVector. There is a uniform correspondence between the number of seq files and matching tasks. So if your display ability is 12, you want to slice your data into 12 parts when creating seqfiles NamedVecotr:
```
 vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]); 
```

Basically, you need to download cluster points from your HDFS system and write your own code to display the results. Here is the code that I wrote to display membership in a cluster point.

 import java.io.*; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Set; import java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.mahout.clustering.WeightedVectorWritable; import org.apache.mahout.common.Pair; import org.apache.mahout.common.iterator.sequencefile.PathFilters; import org.apache.mahout.common.iterator.sequencefile.PathType; import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable; import org.apache.mahout.math.NamedVector; public class ClusterOutput { /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub try { BufferedWriter bw; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); File pointsFolder = new File(args[0]); File files[] = pointsFolder.listFiles(); bw = new BufferedWriter(new FileWriter(new File(args[1]))); HashMap<String, Integer> clusterIds; clusterIds = new HashMap<String, Integer>(5000); for(File file:files){ if(file.getName().indexOf("part-m")<0) continue; SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while (reader.next(key, value)) { NamedVector vector = (NamedVector) value.getVector(); String vectorName = vector.getName(); bw.write(vectorName + "\t" + key.toString()+"\n"); if(clusterIds.containsKey(key.toString())){ clusterIds.put(key.toString(), clusterIds.get(key.toString())+1); } else clusterIds.put(key.toString(), 1); } bw.flush(); reader.close(); } bw.flush(); bw.close(); bw = new BufferedWriter(new FileWriter(new File(args[2]))); Set<String> keys=clusterIds.keySet(); for(String key:keys){ bw.write(key+" "+clusterIds.get(key)+"\n"); } bw.flush(); bw.close(); } catch (IOException e) { e.printStackTrace(); } } }

Carlos Andres Castro · Answer 2 · 2015-03-18T20:02:53+0000

To complete the answer:

VL-x: is the cluster identifier
n = y: number of elements in the cluster
c = [z, ...]: is the centroid of the cluster, and z is the weight of various sizes
r = [z, ...]: is the radius of the cluster.

More details here: https://mahout.apache.org/users/clustering/cluster-dumper.html

Sean owen · Answer 3 · 2011-04-28T06:10:47+0000

I think you need to read the source code - download from http://mahout.apache.org . VL-24130 is just a cluster identifier for a converged cluster.

Jugal · Answer 4 · 2013-02-04T12:24:50+0000

You can use mahout clusterdump https://cwiki.apache.org/MAHOUT/cluster-dumper.html

Interpreting output from mahout clusterdumper

More articles: