What is the most efficient way to move data from the Hive and to MongoDB?

Question

What is the most efficient way to move data from the Hive and to MongoDB?

Is there an elegant, easy, and fast way to move data from Hive to MongoDB?

+6

Alex N. Sep 09 '12 at 17:51

3 answers

Have you looked at Sqoop ? It should make it very easy to move data between Hadoop and SQL / NoSQL databases. This article also provides an example of using it with Hive.

+1

HypnoticSheep 12 sept '12 at 17:29

source share

Take a look at the hadoop-MongoDB connector project:

http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

"This relationship takes the form of allowing both reading MongoDB data in Hadoop (for use in MapReduce jobs, and for other components of the Hadoop ecosystem), as well as for writing Hadoop results to MongoDB."

not sure if it will work for your use case, but worth a look.

+1

Jean-philippe bond Sep 14 '12 at 16:01

source share

Lorand bendig · Accepted Answer · 2012-09-16T00:18:31+0000

You can export using Hadoop-MongoDB . Just run the Hive request in the main working method. This output will then be used by Mapper to insert data into MongoDB .

Example:

Here I insert a comma delimited text file (id; firstname; lastname) into the MongoDB collection using a simple catch request:

 import java.io.IOException; import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; import java.sql.Statement; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.mongodb.hadoop.MongoOutputFormat; import com.mongodb.hadoop.io.BSONWritable; import com.mongodb.hadoop.util.MongoConfigUtil; public class HiveToMongo extends Configured implements Tool { private static class HiveToMongoMapper extends Mapper<LongWritable, Text, IntWritable, BSONWritable> { //See: https://issues.apache.org/jira/browse/HIVE-634 private static final String HIVE_EXPORT_DELIMETER = '\001' + ""; private IntWritable k = new IntWritable(); private BSONWritable v = null; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String [] split = value.toString().split(HIVE_EXPORT_DELIMETER); k.set(Integer.parseInt(split[0])); v = new BSONWritable(); v.put("firstname", split[1]); v.put("lastname", split[2]); context.write(k, v); } } public static void main(String[] args) throws Exception { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println("Unable to load Hive Driver"); System.exit(1); } try { Connection con = DriverManager.getConnection( "jdbc:hive://localhost:10000/default"); Statement stmt = con.createStatement(); String sql = "INSERT OVERWRITE DIRECTORY " + "'hdfs://localhost:8020/user/hive/tmp' select * from users"; stmt.executeQuery(sql); } catch (SQLException e) { System.exit(1); } int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Path inputPath = new Path("/user/hive/tmp"); String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll"; MongoConfigUtil.setOutputURI(conf, mongoDbPath); /* Add dependencies to distributed cache via DistributedCache.addFileToClassPath(...) : - mongo-hadoop-core-xxxjar - mongo-java-driver-xxxjar - hive-jdbc-xxxjar HadoopUtils is an own utility class */ HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf); HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf); Job job = new Job(conf, "HiveToMongo"); FileInputFormat.setInputPaths(job, inputPath); job.setJarByClass(HiveToMongo.class); job.setMapperClass(HiveToMongoMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(MongoOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setNumReduceTasks(0); job.submit(); System.out.println("Job submitted."); return 0; } }

One of the drawbacks is that storing intermediate Hive output requires an 'intermediate region (/ user / hive / tmp). Also, as far as I know, the Mongo-Hadoop connector does not support upserts.

I'm not quite sure, but you can also try to extract data from Hive without starting the hiveserver that provides the Thrift service so you can save some overhead. Look at the source code for the Hive method org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting) , which actually executes the request. Then you can hack something like this:

 ... LogUtils.initHiveLog4j(); CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class)); ss.in = System.in; ss.out = new PrintStream(System.out, true, "UTF-8"); ss.err = new PrintStream(System.err, true, "UTF-8"); SessionState.start(ss); Driver qp = new Driver(); processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver ...

Side notes:

There is also a hive-mongo implementation that you can also check. It's also worth taking a look at the Hive-HBase implementation to get an idea of whether you want to implement the same for MongoDB .

What is the most efficient way to move data from the Hive and to MongoDB?

More articles: