Java vs Python on Hadoop

I am working on a project using Hadoop, and it seems to be using Java and provides streaming support for Python. Is there a significant influence on the choice of one of them? I'm early enough in the process when I can go anyway, if there is a significant difference in performance, anyway.

+50
java python hadoop
Sep 26 '09 at 21:55
source share
3 answers

Java is less dynamic than Python, and more effort has been added to its virtual machine, which makes it faster. Python is also held back by its Global Interpreter Lock, which means that it cannot push threads from one process to another core.

It doesn’t matter if it depends on any significant difference, on what you intend to do. I suspect both languages ​​will work for you.

+13
Sep 26 '09 at 22:03
source share

With Python, you are likely to grow faster, and with Java, it will certainly work faster.

Google "benchmarksgame", if you want to see a very accurate speed comparison between all popular languages, but if I remember correctly, you are talking about 3-5 times faster.

However, there is little to do with processor processing these days, so if you feel like Python will be better developed, keep that in mind!




In response to a comment (how Java can be faster than Python):

All languages ​​are handled differently. Java is the fastest after C and C ++ (which can be as fast or up to 5 times faster than java, but it seems to be about 2 times faster on average). The rest are 2-5 + times slower. Python is one of the fastest after Java. I guess C # is about as fast as Java, or maybe faster, but in test games there was only Mono (which was a little slower) because they did not run it in windows.

Most of these statements are based on games with computer languages, which are usually fairly fair, because supporters / experts in each language setting test written in their particular language to ensure correct code.

For example, this shows all tests with Java vs C ++, and you can see that the speed ranges from are approximately equal to java, which is 3 times slower (the first column between 1 and 3), and java uses much more memory!

Now this page shows java vs python (from a Python perspective). Thus, the speed range from python is 2x slower than Java to 174x slower, python usually outperforms java in code size and memory usage.

Another interesting point is the tests that allocated a lot of memory; in fact, Java performed much better than Python in memory size. I'm pretty sure that java usually loses memory due to overhead on the VM, but as soon as this gets worse, java is probably more efficient than most (again, except for C).

This is Python 3, by the way, the other python platform under test (Just called Python) is much worse.

If you really want to know how this happens faster, the virtual machine is surprisingly smart. It compiles for machine language AFTER the code is run, so it knows which code paths are most likely to be optimized for them. Memory allocation - this is art - is really useful in OO language. It can perform some amazing runtime optimizations that no non-VM languages ​​can do. It can work in a small amount of memory under duress and is the language of choice for embedded devices along with C / C ++.

I was working on a signal analyzer for Agilent (think expensive o-scope), where almost everything (except the sample) was done in Java. This includes screen drawing, including tracing (AWT) and interaction with controls.

I am currently working on a project for all future cable boxes. The manual, along with most other applications, will be written in Java.

Why not be faster than Python?

+24
Sep 26 '09 at 23:51
source share

You can convert Hadoop mapreduce conversions either as "streaming" or as a "regular jar." If you use streaming, you can write your code in any language you like, including Python or C ++. Your code will simply be read from STDIN and displayed on STDOUT. However, in versions of hadoop up to 0.21, streaming streaming only used text streams, not binary, for your processes. Therefore, your files should have been text files, unless you are doing some funky conversions. But now there is an added patch that now allows you to use binary formats with a stream of haops.

If you use a “custom jar” (ie you wrote mapreduce code in Java or Scala using hadoop libraries), then you will have access to functions that allow you to input and output binary files (serialize in binary format) from your stream processes (and save the results to disk). This way, future runs will be much faster (depending on how much your binary format is smaller than your text format).

So, if your hadoop job is bound to I / O, then the "custom jar" approach will be faster (since Java is faster as previous posters showed, and reading from disk will also be faster).

But you must ask yourself how valuable your time is. I find myself much more productive with python and write map-reduce that reads STDIN, and writing to STDOUT is very simple. Therefore, I personally recommend switching to the python route - even if you have to compose binary code. Since hasoop 0.21 processes non-utf8 byte arrays, and since there is a binary (byte array) for python, an alternative for use in python ( http://dumbotics.com/2009/02/24/hadoop-1722-and-typed- bytes / ) which shows that python code is only about 25% slower than the "custom jar" java code, I would definitely go along the python path.

+14
Jul 14 '11 at 10:15
source share



All Articles