Java CSV parser with unlimited quotes

I have a CSV file that has some problems with quotes:

"Albanese Confectionery","157137","ALBANESE BULK ASST. MINI WILD FRUIT WORMS 2" 4/5LB",9,90,0,0,0,.53,"21",50137,"3441851137","5 lb",1,4,4,$6.7,$6.7,$26.8 

SuperCSV is choking on these fruit worms (pun intended). I know that 2" should be 2"" , but this is not so. LibreOffice actually parses it correctly (which surprises me). I thought to just write my own parser, but there is a comma inside the line in other lines:

 "Albanese Confectionery","157230","ALBANESE BULK JET FIGHTERS,ASSORTED 4/5 B",9,90,0,0,0,.53,"21",50230,"3441851230","5 lb",1,4,4,$6.7,$6.7,$26.8 

Does anyone know of a Java library that will handle such crazy things? Or should I try all available? Or am I better off hacking it myself?

+4
source share
3 answers

The correct solution is to find the person who generated the data and beat it on the head with the keyboard until they fix the problem at its end.

Once you have exhausted this route, you can try some of the other CSV parsers on the market, I have used OpenCSV with success in the past.

Even if OpenCSV does not solve the problem out of the box, the code is quite easy to read and is available under the Apache license, so it would be possible to modify the algorithm to work with your elusive data and probably easier than starting from scratch.

+6
source

It’s amazing even me here, but I think I’ll hack it myself. I mean, you only need to read the lines and generate tokens, breaking into quotes / commas, depending on what you want. Thus, you can customize the logic as it suits you. It is not very difficult. The file seems to be broken, so going through some existing solutions seems to be more work.

One moment - if LibreOffice already parses it correctly, could you just save the file from there, thereby creating a more reasonable file. However, if you think LibreOffice might guess, just write a tokenizer yourself.

+1
source

+1 for "choking on bad worms" - I almost choked on my coffee, which :)

If you really cannot install CSV, then you can simply provide your own tokenizer (Super CSV is very flexible!).

You usually write your own implementation of readColumns() , but expand the default readColumns() faster and override the readLine() method to intercept String (and fix non-exclusive quotes) before it is tokenized.

I made the assumption that any quotation marks that are not related to the delimiter or at the beginning / end of a line should be escaped. This is far from ideal, but it works for sample input. You can implement this as you like - I was too early in the morning to use regex :)

That way, you don’t have to modify Super CSV at all (it just plugs in), so you get all the other features like cell processors and bean.

 package org.supercsv; import java.io.IOException; import java.io.Reader; import org.supercsv.io.Tokenizer; import org.supercsv.prefs.CsvPreference; public class FruitWormTokenizer extends Tokenizer { public FruitWormTokenizer(Reader reader, CsvPreference preferences) { super(reader, preferences); } @Override protected String readLine() throws IOException { final String line = super.readLine(); if (line == null) { return null; } final char quote = (char) getPreferences().getQuoteChar(); final char delimiter = (char) getPreferences().getDelimiterChar(); // escape all quotes not next to a delimiter (or start/end of line) final StringBuilder b = new StringBuilder(line); for (int i = b.length() - 1; i >= 0; i--) { if (quote == b.charAt(i)) { final boolean validCharBefore = i - 1 < 0 || b.charAt(i - 1) == delimiter; final boolean validCharAfter = i + 1 == b.length() || b.charAt(i + 1) == delimiter; if (!(validCharBefore || validCharAfter)) { // escape that quote! b.insert(i, quote); } } } return b.toString(); } } 

You can simply put this tokenizer on the constructor of your CsvReader.

+1
source

All Articles