Parsing poorly formatted log files?

I work with some log files that are very poorly formatted, the column separator is an element that (often) appears inside the field and is not escaped. For instance:

sam,male,september,brown,blue,i like cats, and i like dogs 

Where:

 name,gender,month,hair,eyes,about 

So, as you can see, about contains a column separator, which means that one parsing by the separator will not work, because it will split about me into two separate columns. Now imagine this using the chat system ... you can visualize the problems that I'm sure.

So, theoretically, what's the best approach to solving this? I'm not looking for a language-specific implementation, but a more general pointer to the right direction or some ideas on how others solved it ... without doing it manually.

Edit:

I must clarify, my actual magazines are in much worse condition. There are fields with separator characters all over the world, there is no pattern that I can find.

+4
source share
7 answers

Here are two ideas you could try:

  • Length / format patterns . I think you could identify some patterns in separate columns of the file. For example, values ​​in some columns may be shorter, and values ​​in some columsn may be shorter. The values ​​in some columns are usually numbers or from a limited set of values ​​(for example, months), or at least often contain some substring.

    When you can identify these patterns (based on statistics computed from elements with the correct breakdown of the elements), then you must create an algorithm that uses them to guess which separator should be ignored (for example, when the column is shorter than expected).

  • Grammar rules , another idea inspired by your example, are commas that don't usually escape, followed by some lines (like the words "and" or "about"?) If so, can you use this information to guess, which separators should be escaped.

Finally, if none of these special technologies can solve your problem, you can use some heavy statistics to evaluate. There are some machine learning frameworks that can perform heavy statistics for you, but this is still a pretty tricky problem. For example, in .NET you can use Infer.NET from Microsoft Research.

+1
source

If only the last column has invalid commas, then most language implementations of line splitting can limit the number of splits performed, for example. in Python s.split(',',5)

If you want to parse the file as a CSV (comma separated) parser, then I think the best approach is to run a commit that performs proper escaping before passing it to the csv parser.

+4
source

I suppose you can make certain assumptions about the data type. Like gender , month , hair and eyes have a range of values, then check that.

It can also make sense that all fields except about and possibly name will not contain a comma, so you can probably simulate greed when the first 5 or 6 commas behave as delimiters, and everything else is part of about . Recheck if necessary.

+3
source

It would be impossible to perfectly parse them unless shielding is used.

Lee Ryan noted that if only the last column can have these values, you have an option.

If this is not the case, are there any columns where you are guaranteed to always have unlimited reserved characters? Also, are there any columns in which you will always have only a specific set of values?

If any of them is true, you can first define these fields and separate everything else to separate it from there.

I will need to know more about your information in order to move on.

+2
source

One thing that I propose to do, if possible, is to save something in each data record that indicates the assumptions that have been made (maybe keep the original row), so if something turns out to be wrong with the record, the corresponding data can hopefully reconstruct (by manually examining it, if nothing else).

0
source

If the 6th column is always the last and always not displayed, this perl bit should do the trick:

 $file = '/path/to/my/log.txt'; open(LOG, $file); @lines = <LOG>; foreach $line (@lines) { chomp($line); if ($line =~ /([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_\, ]+)/) { print "Name: $1\n"; print "Gender: $2\n"; print "Month: $3\n"; print "Color #1: $4\n"; print "Color #2: $5\n"; print "Random Text: $6\n"; } } close(LOG) 
0
source

Your magazines are mixed: you cannot be sure which of the many possible interpretations to make. Working with uncertainty is a job for probability theory. A natural tool then is probabilistic context-free grammar - there are algorithms for finding the most probable analysis. (I had no occasion to use it myself, although I did simpler tasks with such a statistical approach. Peter Norvig spelling article / a> consider one such example in detail.)

For this specific simplified task: you can list all the possible ways to divide the string into N parts (where you already know what to expect from N), calculate the probability of each of them according to some model and choose the best answer.

(Another example of processing data with differences is erased: I had a tag data set from half a million Flickr photos. Tags came out of their API with all words as well as spaces. I calculated the most likely word boundaries using word frequencies placed in tables with Internet photography sites, plus code, for example this answer is SO .)

0
source

All Articles