Hadoop Map-Reduce OutputFormat for assigning a result to a memory variable (not files)?

(from newbie Hadoop)

I want to avoid files where possible in the Hadoop illustration example. I was able to read data from a non-file input (thanks to http://codedemigod.com/blog/?p=120 ), which generates random numbers.

I want to store the result in memory so that I can do another (non-Map-Reduce) processing of business logic on it. Essetially:

conf.setOutputFormat(InMemoryOutputFormat) JobClient.runJob(conf); Map result = conf.getJob().getResult(); // ? 

The closest thing that seems to do what I want is to save the result in binary format and read it back with the equivalent input format. This is like unnecessary code and unnecessary calculations (I don’t understand where the map reduction depends on?).

+4
source share
1 answer

The problem with this idea is that Hadoop does not have the concept of "distributed memory". If you want to get the result "in memory", the next question should be "what kind of memory is the machine?" If you really want to access it, you will have to write your own output format, and then use either the existing structure to exchange memory on different machines, or write your own again.

My suggestion was to simply write HDFS as usual, and then for non-MapReduce business logic, just start by reading data from HDFS using the FileSystem API, i.e.:

 FileSystem fs = new JobClient(conf).getFs(); Path outputPath = new Path("/foo/bar"); FSDataInputStream in = fs.open(outputPath); // read data and store in memory fs.delete(outputPath, true); 

Of course, he does some unnecessary reads and writes to disk, but if your data is small enough to fit in memory, why are you worried about it anyway? I would be surprised if this were a serious bottleneck.

+7
source

All Articles