How to vectorize a text file in mahout?

I have a text file with a tag and tweets.

positive,I love this car negative,I hate this book positive,Good product. 

I need to convert each row to a vector value. If I use the seq2sparse command, then the whole document is converted to a vector, but I need to convert each line as a vector not the whole document. ex: key: positive value: vectorvalue (tweet) How can we achieve this in mahout?


/ * That's what I did * /

  StringTokenizer str= new StringTokenizer(line,","); String label=str.nextToken(); while (str.hasMoreTokens()) { tweetline =str.nextToken(); System.out.println("Tweetline"+tweetline); StringTokenizer words = new StringTokenizer(tweetline," "); while(words.hasMoreTokens()){ featureList.add(words.nextToken());} } Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length); FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label); vectorEncoder.setProbes(1); System.out.println("Feature List: "+featureList); for (Object feature: featureList) { vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector); } context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector)); 

Thank you in advance

+4
source share
1 answer

You can write it to the hdfs path of the application using SequenceFile.Writer

  FS = FileSystem.get(HBaseConfiguration.create()); String newPath = "/foo/mahouttest/part-r-00000"; Path newPathFile = new Path(newPath); Text key = new Text(); VectorWritable value = new VectorWritable(); SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile, key.getClass(), value.getClass()); ..... key.set("c/"+label); value.set(unclassifiedInstanceVector ); writer.append(key,value); 
0
source

All Articles