Distributing data nodes across multiple data centers

Has anyone tried to test the performance of data nodes in multiple data centers? Especially on networks with small pipes. I cannot find too much information about this, and the information I found is either old (circa 2010) or proprietary (it looks like DataStax is something). I know that Hadoop supports rack validation, but as I said, I have not seen any documentation for setting up a system for multiple data centers.

+6
source share
1 answer

I tried this with a 12x DataNode cluster located in a 2: 1 ratio, split between two data centers about 120 miles apart. The delay between data centers was ~ 4 ms after 2 x 1GbE.

2 racks were configured on site A, 1 rack configured on site B. Each β€œrack” had 4 cars in it. We mainly tested site B as the "DR" site. The replication rate was set to 3.

In short, it works, but the performance was really, really poor. You definitely need to use compression on your source, display and reduce the output to reduce your I / O records, and if links between sites are used for anything else, you will get timeouts when transferring data. A TCP window would actually limit our transmission to about 4 Mbps instead of the potential 100 Mbps + on the 1 Gbps line.

Save the headache and just use distcp jobs to replicate the data.

+5
source

All Articles