How to save data entry id in Mahout K clustering

I use mahout to start k-mean clustering, and I had a problem identifying data input during clustering, for example, I have 100 data records

id      data
0       0.1 0.2 0.3 0.4
1       0.2 0.3 0.4 0.5
...     ...
100     0.2 0.4 0.4 0.5

after clustering, I need to return the identifier from the cluster result to see which point belongs to the cluster, but there is no way to maintain the identifier.

In the official mahout example of clustering synthetic control data, only data was entered into mahout without id, for example

28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741  27.5532 32.8217 27.8789 ...

and the cluster result has only the cluster identifier and the point value:

VL-539{n=38 c=[29.950, 30.459, ...
   Weight:  Point:
   1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
   2.0: [24.214, 33.150, 31.521, 31.986, 29.064

but there is no point identifier, so can anyone have an idea on how to add saving point identifier when performing mahout clustering? many thanks!

+5
3

, NamedVectors.

, - , .

, Mahout, , .

, .

, -, , ID 4 .

Hadoop Job, , , CSV, SequenceFile .

Mahout , ( ) .

:

public class DenseVectorizationDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
            ToolRunner.printGenericCommandUsage(System.err); return -1;
        }
        Job job = new Job(getConf(), "Create Dense Vectors from CSV input");
        job.setJarByClass(DenseVectorizationDriver.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DenseVectorizationMapper.class);
        job.setReducerClass(DenseVectorizationReducer.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(VectorWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}


public class DenseVectorizationMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
/*
 * This mapper class takes the input from a CSV file whose fields are separated by TAB and emits
 * the same key it receives (useless in this case) and a NamedVector as value.
 * The "name" of the NamedVector is the ID of each row.
 */
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        System.out.println("LINE: "+line);
        String[] lineParts = line.split("\t", -1);    
        String id = lineParts[0];

        //you should do some checks here to assure that this piece of data is correct

        Vector vector = new DenseVector(lineParts.length -1);
        for (int i = 1; i < lineParts.length -1; i++){
            String strValue = lineParts[i];
            System.out.println("VALUE: "+strValue);
            vector.set(i, Double.parseDouble(strValue));

        }

        vector =  new NamedVector(vector, id);

        context.write(key, new VectorWritable(vector));
    }
}


public class DenseVectorizationReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
/*
 * This reducer simply writes the output without doing any computation.
 * Maybe it would be better to define this hadoop job without reduce phase.
 */
    @Override
    public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{

        VectorWritable writeValue = values.iterator().next();
        context.write(key, writeValue);
    }
}
+2

, ... . , Mahout (), Apache-commons-math, K- . , . : http://code.google.com/p/noolabsimplecluster/ , () [0..1], !

0

The clusteredPoints directory created by kmeans contains this mapping. Note that you should have used the -cl option to get this data.

0
source

All Articles