Pulling data from MySQL in Hadoop

Question

Pulling data from MySQL in Hadoop

I'm just starting to learn Hadoop, and I'm interested in the following: suppose I have a bunch of large MySQL worksheets that I want to parse.

It seems that I should dump all the tables into text files in order to bring them to the Hadoop file system - is this correct or is there some way that Hive or Pig or any other access to data from MySQL directly?
If I dump all production tables into text files, do I need to worry about performance impact during the dump? (Depends on which storage engine the tables use? What should I do if this is the case?)
Is it better to unload each table into a single file or split each table into 64 MB files (or regardless of the size of my block)?

+5

mysql hadoop

grautur Jun 19 '10 at 8:04

source share

2 answers

2)
, , - , .

, , / . [ ]

, , ( ?), .

, , / , .

3)
, HDFS , . 64 .
. Apache - HDFS

re: Wojtek answer - SQOOP clicky ( )

,

+1

Ralph Willgoss 20 . '10 8:00

wlk · Accepted Answer · 2010-06-19T13:39:06+0000

Importing data from mysql is very easy. I recommend that you use the Cloudera hadoop distribution, which includes a program called "sqoop", which provides a very simple interface for importing data directly from mysql (other databases are supported). Sqoop can be used with mysqldump or a regular mysql query (select * ...). With this tool, there is no need to manually split tables into files. But for hadoop, it is much better to have one large file.

Useful Links:
Sqoop User Guide

Pulling data from MySQL in Hadoop

More articles: