Pulling data from MySQL in Hadoop

I'm just starting to learn Hadoop, and I'm interested in the following: suppose I have a bunch of large MySQL worksheets that I want to parse.

  • It seems that I should dump all the tables into text files in order to bring them to the Hadoop file system - is this correct or is there some way that Hive or Pig or any other access to data from MySQL directly?
  • If I dump all production tables into text files, do I need to worry about performance impact during the dump? (Depends on which storage engine the tables use? What should I do if this is the case?)
  • Is it better to unload each table into a single file or split each table into 64 MB files (or regardless of the size of my block)?
+5
source share
2 answers

Importing data from mysql is very easy. I recommend that you use the Cloudera hadoop distribution, which includes a program called "sqoop", which provides a very simple interface for importing data directly from mysql (other databases are supported). Sqoop can be used with mysqldump or a regular mysql query (select * ...). With this tool, there is no need to manually split tables into files. But for hadoop, it is much better to have one large file.

Useful Links:
Sqoop User Guide

+10
source

2)
, , - , .

, , / . [ ]

, , ( ?), .

, , / , .


3)
, HDFS , . 64 .
. Apache - HDFS

re: Wojtek answer - SQOOP clicky ( )

,

+1

All Articles