Need help implementing this algorithm with the Hadoop MapReduce

Question

Need help implementing this algorithm with the Hadoop MapReduce

I have an algorithm that will go through a large data set to read some text files and find specific terms in these lines. I have it implemented in Java, but I did not want to publish the code so that it would not look. I am looking for someone to implement it for me, but it's true, I really need a lot of help !!! It was not planned for my project, but the data set turned out to be huge, so the teacher told me that I should do it like this.

EDIT (I didn’t specify the version I previos) The dataset that I have is in a Hadoop cluster, and I have to make it an implementation of MapReduce

I read about MapReduce and that I am doing the standard implementation first, and then it will be easier / easier to do this with mapreduce. But it didn’t happen, since the algorithm is rather stupid and nothing special, and the map is reduced ... I can’t get around it.

So, here is a short pseudo-code of my algorithm

LIST termList (there is method that creates this list from lucene index) FOLDER topFolder INPUT topFolder IF it is folder and not empty list files (there are 30 sub folders inside) FOR EACH sub folder GET file "CheckedFile.txt" analyze(CheckedFile) ENDFOR END IF Method ANALYZE(CheckedFile) read CheckedFile WHILE CheckedFile has next line GET line FOR(loops through termList) GET third word from line IF third word = term from list append whole line to string buffer ENDIF ENDFOR END WHILE OUTPUT string buffer to file

In addition, as you can see, every time an “analysis” is called, a new file must be created, I realized that it is difficult to write down a map shortcut to many outputs ???

I understand mapreduce intuition, and my example seems to be quite suitable for mapreduce, but when it comes to this, it’s obvious that I don’t know enough, and I’m STUCK!

Please, help.

+6

java mapreduce hadoop

Julia Jun 06 '10 at 10:57

source share

2 answers

Squarecog · Answer 1 · 2010-06-06T23:03:58+0000

You can simply use an empty reducer and split your task into running one cartographer per file. Each converter will create its own output file in the output folder.

Karl Walsh · Answer 2 · 2010-06-07T09:49:57+0000

Map Reduce is easily implemented using some of the good Java 6 concurrency features, especially Future, Callable, and ExecutorService.

I created a Callable that will parse the file the way you specified

 public class FileAnalyser implements Callable<String> { private Scanner scanner; private List<String> termList; public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException { this.termList = termList; scanner = new Scanner(new File(filename)); } @Override public String call() throws Exception { StringBuilder buffer = new StringBuilder(); while (scanner.hasNextLine()) { String line = scanner.nextLine(); String[] tokens = line.split(" "); if ((tokens.length >= 3) && (inTermList(tokens[2]))) buffer.append(line); } return buffer.toString(); } private boolean inTermList(String term) { return termList.contains(term); } }

We need to create a new callable for each file found and send it to the executor’s service. The result of the presentation is the future, which we can use later to get the result of the analysis of the file.

 public class Analayser { private static final int THREAD_COUNT = 10; public static void main(String[] args) { //All callables will be submitted to this executor service //Play around with THREAD_COUNT for optimum performance ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT); //Store all futures in this list so we can refer to them easily List<Future<String>> futureList = new ArrayList<Future<String>>(); //Some random term list, I don't know what you're using. List<String> termList = new ArrayList<String>(); termList.add("terma"); termList.add("termb"); //For each file you find, create a new FileAnalyser callable and submit //this to the executor service. Add the future to the list //so we can check back on the result later for each filename in all files { try { Callable<String> worker = new FileAnalyser(filename, termList); Future<String> future = executor.submit(worker); futureList.add(future); } catch (FileNotFoundException fnfe) { //If the file doesn't exist at this point we can probably ignore, //but I'll leave that for you to decide. System.err.println("Unable to create future for " + filename); fnfe.printStackTrace(System.err); } } //You may want to wait at this point, until all threads have finished //You could maybe loop through each future until allDone() holds true //for each of them. //Loop over all finished futures and do something with the result //from each for (Future<String> current : futureList) { String result = current.get(); //Do something with the result from this future } } }

My example here is far from complete and far from effective. I did not consider the sample size, if it is really huge, you can continue the cycle over the future list by deleting the completed elements, something similar to:

 while (futureList.size() > 0) { for (Future<String> current : futureList) { if (current.isDone()) { String result = current.get(); //Do something with result futureList.remove(current); break; //We have modified the list during iteration, best break out of for-loop } } }

Alternatively, you can implement a producer-consumer setting, where the producer transfers the calling challenges to the executing service and creates the future, and the consumer takes the result of the future and discards it in the future.

This may require that the product and the consumer be the threads themselves, as well as a synchronized list for adding / removing futures.

Any questions, please ask.

Need help implementing this algorithm with the Hadoop MapReduce

More articles: