Interpreting output from mahout clusterdumper

I conducted a clustering test on bypass pages (more than 25 thousand documents, a set of personal data). I did clusterdump:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt 

At the output, after starting the cluster damper, 25 "VL-xxxxx {}" elements are displayed:

 VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]} ... VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]} 

How to interpret this conclusion?

In short: I am looking for document identifiers belonging to a particular cluster.

What's the point:

  • Vl-x?
  • n = yc = [z: z ', ...]
  • r = [z '': z '' ', ...]

Does 0: 0.017 mean that “0” is the identifier of the document that belongs to this cluster?

I already read mahout on wiki pages, which means CL, n, c and r. But can someone please explain them to me better or point to a resource where this is explained in more detail?

Sorry if I ask some stupid questions, but I am new to wih apache mahout and use it as part of my course for clustering.

+4
source share
4 answers
  • By default, kmeans clustering uses a WeightedVector, which does not include a data point name. So, you want to make the sequence file yourself using NamedVector. There is a uniform correspondence between the number of seq files and matching tasks. So if your display ability is 12, you want to slice your data into 12 parts when creating seqfiles NamedVecotr:

     vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]); 
  • Basically, you need to download cluster points from your HDFS system and write your own code to display the results. Here is the code that I wrote to display membership in a cluster point.

     import java.io.*; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Set; import java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.mahout.clustering.WeightedVectorWritable; import org.apache.mahout.common.Pair; import org.apache.mahout.common.iterator.sequencefile.PathFilters; import org.apache.mahout.common.iterator.sequencefile.PathType; import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable; import org.apache.mahout.math.NamedVector; public class ClusterOutput { /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub try { BufferedWriter bw; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); File pointsFolder = new File(args[0]); File files[] = pointsFolder.listFiles(); bw = new BufferedWriter(new FileWriter(new File(args[1]))); HashMap<String, Integer> clusterIds; clusterIds = new HashMap<String, Integer>(5000); for(File file:files){ if(file.getName().indexOf("part-m")<0) continue; SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while (reader.next(key, value)) { NamedVector vector = (NamedVector) value.getVector(); String vectorName = vector.getName(); bw.write(vectorName + "\t" + key.toString()+"\n"); if(clusterIds.containsKey(key.toString())){ clusterIds.put(key.toString(), clusterIds.get(key.toString())+1); } else clusterIds.put(key.toString(), 1); } bw.flush(); reader.close(); } bw.flush(); bw.close(); bw = new BufferedWriter(new FileWriter(new File(args[2]))); Set<String> keys=clusterIds.keySet(); for(String key:keys){ bw.write(key+" "+clusterIds.get(key)+"\n"); } bw.flush(); bw.close(); } catch (IOException e) { e.printStackTrace(); } } } 
+4
source

To complete the answer:

  • VL-x: is the cluster identifier
  • n = y: number of elements in the cluster
  • c = [z, ...]: is the centroid of the cluster, and z is the weight of various sizes
  • r = [z, ...]: is the radius of the cluster.

More details here: https://mahout.apache.org/users/clustering/cluster-dumper.html

+1
source

I think you need to read the source code - download from http://mahout.apache.org . VL-24130 is just a cluster identifier for a converged cluster.

0
source
-1
source

All Articles