In short, I have a client who wants the data contained in a bunch of ASCII text files ("aka" input files) to go to Accumulo.
These files are output from different data feed devices and will be generated continuously on non-Hadoop / non-Accumulo node (s) (aka "feed nodes"). It is expected that the overall data rate on all channels will be very high.
For simplicity, suppose all data is presented in one direct index table and one inverted index table in Accumulo.
I have already written the Accumulo client module using pyaccumulo, which can establish a connection with Accumulo through the Thrift proxy server, read and analyze input files from the local file system (rather than HDFS), create the corresponding forward and reverse index mutations in the code, and use BatchWriter to write mutations in the forward and reverse index tables. So far, so good. But there is more.
From various sources, I learned that there are at least a few standard Accumulo high-speed swallowing approaches that can be applied in my scenario, and I ask for advice on which options make the most sense in terms of resource use and ease of implementation and maintenance . Here are a few options:
- BatchWriter Clients on Feeds: Launch my Accumulo Client on Feeds. This option has the disadvantage of sending mutations in the forward and reverse directions over the network. In addition, Accumulo / Thrift libraries must be available at feed points to support the Accumulo client. However, this parameter has the advantage that it parallelizes the work of parsing input files and creating mutations and, apparently, minimizes disk I / O in the Hadoop cluster compared to the parameters below.
- The BatchWriter client on the main Accumulo node server: scp / sftp input files from the feed nodes to the Accumulo node master, in a certain local file system directory. Then run my Accumulo client just for the Accumulo node wizard. The advantage of this option is that it does not send mutations in the forward and reverse direction from the feed nodes to the Accumulo node wizard, and does not require the Accumulo / Thrift libraries on the feed nodes. However, the disadvantage is that the Accumulo node wizard does the whole work on parsing input files and creating mutations and uses the Accumulo master local drive as a waypoint for input files.
- MapReduce with AccumulOOutputFormat: scp / sftp input files from feed nodes to the Accumulo node wizard. Then periodically copy them to HDFS and run the MapReduce job, which reads and analyzes input files from HDFS, creates mutations, and uses AccumuloOutputFormat to write them. This option has advantages # 2 above, plus it parallelizes the work on parsing input files and creating mutations. However, the drawback is that it constantly expands and splits MapReduce jobs and causes all the overhead associated with these processes. It also has the disadvantage that it uses two waypoints on the disk (local and HDFS) with the corresponding disk I / O. It sounds somewhat painful to exercise and maintain a continuous meal.
- MapReduce with AccumuloOutput * File * Format (rfiles): scp / sftp input files from the feed nodes to the Accumulo node wizard. Then periodically copy them to HDFS and run the MapReduce job, which reads and analyzes input files from HDFS, creates mutations, and uses AccumuloOutputFileFormat to write rfiles. Then use the Accumulo shell to βswallowβ the rfiles. This option has all the advantages of No. 3 above, but I donβt know if it has any other advantages (is that so? The Accumulo manual says about mass absorption: βIn some cases, it can load data faster and not by swallowing through clients using BatchWriters . "What cases?). It also has all the disadvantages of # 3 above, except that it uses three waypoints on the disk (local, HDFSx2) with corresponding disk I / O. It seems painful to exercise and maintain a continuous meal.
Personally, I like option # 2 the most if the Accumulo node wizard can handle the load by itself (non-parallel analysis of the input file). Option # 2, in which I could run my Accumulo client on each Accumulo node and send output from different feed nodes to different Accumulo nodes or cyclically, still has the disadvantage of transmitting forward and reverse index mutations to the cloud network for the Accumulo wizard , but has the advantage that parallel parsing of the input file is performed in parallel.
What I need to know: Have I missed any viable options? Did I miss any advantages / disadvantages of each option? Are any of the advantages / disadvantages trivial or very important, regardless of my problematic context, especially the trade-off between network bandwidth / processor / I / O? Are MapReduce with or without rfiles really worth the trouble compared to BatchWriter? Does anyone have a "war story"?
Thanks!
performance hadoop accumulo
jhop
source share