Parsing large text files in real time (Java)

Question

Parsing large text files in real time (Java)

I am interested in parsing a fairly large text file in Java (1.6.x) and wondering what approach would be considered best practice?

The file is likely to be about 1 MB in size and will consist of thousands of records per line:

Entry { property1=value1 property2=value2 ... }

and etc.

My first instinct is to use regular expressions, but I have no experience using Java in a production environment, and therefore I'm not sure how powerful the java.util.regex classes are.

To clarify a bit, my application will be a web application (JSP) that parses a given file and displays the various values it retrieves. There is only one file that receives parsing (it is located in a third-party directory on the host).

The application will have a rather low use (maybe only a few users use it a couple of times a day), but it is very important that information is extracted as quickly as possible when using them.

In addition, are there any precautions for loading a file into memory each time it is analyzed?

Can someone recommend an approach to take here?

thanks

+7

java regex parsing

Christopher McAtackney Apr 23 '09 at 11:23

source share

9 answers

If this is the correct grammar, use a parser constructor such as the GOLD Parsing System . This allows you to specify the format and use an effective parser to get the tokens you need, which allows you to get error handling almost for free.

+5

Lucero Apr 23 '09 at 11:26

source share

I am wondering why this is not in XML, and then you can use the available XML tools. I think especially about SAX, in which case you can easily parse / process this without storing it all in memory.

So can you convert this to XML?

If you can't and you need a parser, take a look at JavaCC

+4

Brian agnew Apr 23 '09 at 11:26

source share

Use the Scanner class and process the file one at a time. I'm not sure why you mentioned regex. A regular expression is almost never the right answer to any question about parsing because of the ambiguity and lack of symmetry about what is happening in which context.

+3

mP. Apr 23 '09 at 11:33

source share

You can use the Antlr parser generator to create a parser that can parse your files.

+2

paweloque Apr 23 '09 at 11:47

source share

Without answering the parsing question ... but you can parse the files and generate static pages as soon as new files appear. Therefore, you would not have performance problems ... (And I think 1Mb is not a big file, so you can load it into memory if you do not load too many files at once ...)

+1

pgras Apr 23 '09 at 12:03

source share

This seems like a fairly simple file format, so you can use a recursive descent parser . Compared to JavaCC and Antlr, its advantages are that you can write some simple methods, get the data you need, and you do not need to study the formalism of the parser generator. Its cons - it may be less effective. The recursive descent parser is basically stronger than regular expressions. If you can come up with a grammar for this type of file, it will serve you for any decision you choose.

+1

Yuval F Apr 23 '09 at 12:25

source share

If you need Java regex restrictions, don't worry about that. Assuming that you are competent enough in the development of regular expressions, performance should not be a problem. The feature set is satisfactorily rich, including my favorite possessive quantifiers .

+1

Alan moore Apr 23 '09 at 13:23

source share

another solution is to do some form of preprocessing (either offline or as a cron job) that creates a very optimized data structure that is then used to serve many web requests (without having to reprocess the file).

although looking at the scenario in question, this does not seem to be required.

+1

Chii Apr 23 '09 at 14:26

source share

Neil coffey · Accepted Answer · 2009-04-23T12:59:06+0000

If it is about 1 MB and literally in the format in which you specify, then it sounds like you are overestimating things.

If your server is not a ZX Spectrum or something else, just use regular expressions to parse it, delete the data in the hash map (and save it there), and don't worry about it. It will take several megabytes in memory, but so ...?

Update: to give you a concrete idea of performance, some measurements I got from String.split () performance (which uses regular expressions) shows that on a 2 GHz machine, it takes milliseconds to separate 10,000 lines with 100 characters (others in short, about 1 megabyte of data is actually closer than 2 MB to the net byte size, since the strings are 2 bytes per char). Obvioualy, this is not exactly the operation you perform, but you understand: everything is not so bad ...

Parsing large text files in real time (Java)

More articles: