How do you process 1 GB of text data?

Question

How do you process 1 GB of text data?

Task: Process 3 text files of about 1 GB in size and turn them into csv files. The source files have a custom structure, so regular expressions will be useful.

Problem: No problem. I use php for it, and that is fine. I do not need to process files faster. I'm just curious how you approach the problem as a whole. In the end, I would like to see simple and convenient solutions that could work faster than php.

@felix I'm sure of that. :) If I finished with the whole project, I will probably post this as a ping-pong cross-code code.

@mark My approach currently works this way, except that I cache a few hundred lines to keep writing files low. A well-designed memory trade is likely to last some time. But I'm sure other approaches can beat php a lot, like making full use of the * nix toolkit.

+4

php regex text

c0rnh0li0 Sep 26 '10 at 11:30

source share

5 answers

Firstly, it probably doesn't really matter which language you use for this, as it is likely to be related to I / O. More importantly, you are using an efficient approach / algorithm. In particular, you want to avoid reading the entire file into memory, if possible, and avoid combining the result into a huge string before writing to disk.

Instead, use the streaming approach: read the input line, process it, and then write the output line.

+6

Mark byers Sep 26 '10 at 11:37

source share

I would reach for sed .

+1

High performance mark Sep 26 '10 at 12:08

source share

How to process a large amount of text data that you request? perl -p 's/regex/magic/eg' (or some similar / more complex options).

This is largely ideal for work, excluding rare situations with the need for the fastest possible fastest maximum performance (where almost any scripting language fails).

It is widely available, fast and concise. I am involved in teaching perl to several employees, and they seem to be in constant awe of the seemingly wonderful skills that he can perform in one or two lines of code. And joking aside, it is quite viable to do this while remaining completely readable (assuming that you have a reasonable understanding of the language and no desire to create hell for future attendants).

+1

pdehaan Sep 26 '10 at 13:24

source share

Perl is an old text processing wizard for good reasons. There are many Perl benefits that I think are found in Python today, but in a more accessible way, so when it comes to parsing text, I usually run Python (I used to parse GB files with Python).

AWK or sed is probably lightning fast too, but not as easily expandable as Perl och Python. In your particular case, you don't want to do much more than just parse and format the output, but if you want to do more, it would be easier to use Perl / Python already.

I cannot find arguments against Python against anything else, so I think that would be my suggestion.

0

Daniel Andersson Sep 26 '10 at 12:22

source share

RolandasR · Accepted Answer · 2010-09-26T12:02:17+0000

http://hadoop.apache.org/pig/

How do you process 1 GB of text data?

More articles: