Processing "," - "CSV with Univocity

Any idea how I can get the correct lines? some lines are glued and I canโ€™t figure out how to stop it or why.

col. 0: Date col. 1: Col2 col. 2: Col3 col. 3: Col4 col. 4: Col5 col. 5: Col6 col. 6: Col7 col. 7: Col7 col. 8: Col8 col. 0: 2017-05-23 col. 1: String col. 2: lo rem ipsum col. 3: dolor sit amet col. 4: mcdonalds.com/online.html col. 5: null col. 6: "","-""-""2017-05-23" col. 7: String col. 8: lo rem ipsum col. 9: dolor sit amet col. 10: burgerking.com col. 11: https://burgerking.com/ col. 12: 20 col. 13: 2 col. 14: fake col. 0: 2017-05-23 col. 1: String col. 2: lo rem ipsum col. 3: dolor sit amet col. 4: wendys.com col. 5: null col. 6: "","-""-""2017-05-23" col. 7: String col. 8: lo rem ipsum col. 9: dolor sit amet col. 10: buggagump.com col. 11: null col. 12: "","-""-""2017-05-23" col. 13: String col. 14: cheese col. 15: ad eum col. 16: mcdonalds.com/online.html col. 17: null col. 18: "","-""-""2017-05-23" col. 19: String col. 20: burger col. 21: ludus dissentiet col. 22: www.mcdonalds.com col. 23: https://www.mcdonalds.com/ col. 24: 25 col. 25: 3 col. 26: fake col. 0: 2017-05-23 col. 1: String col. 2: wine col. 3: id erat utamur col. 4: bubbagump.com col. 5: https://buggagump.com/ col. 6: 25 col. 7: 3 col. 8: fake done 

CSV example (perhaps copying / pasting may corrupt \ r \ n). Available here: https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0

 "Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8" "2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-" "2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake" "2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-" "2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-" "2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-" "2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake" "2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake" 

Building Settings:

  CsvParserSettings settings = new CsvParserSettings(); settings.setDelimiterDetectionEnabled(true); settings.setQuoteDetectionEnabled(true); settings.setLineSeparatorDetectionEnabled(false); // all the same using `true` settings.getFormat().setLineSeparator("\r\n"); CsvParser parser = new CsvParser(settings); List<String[]> rows; rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv")); for (String[] row : rows) { System.out.println(""); int i = 0; for (String element : row) { System.out.println("col. " + i++ + ": " + element); } } System.out.println("done"); 
+1
java parsing csv univocity
source share
1 answer

When you test the automatic detection process, I suggest you print the detected format using:

 CsvFormat format = parser.getDetectedFormat(); System.out.println(format); 

This will print:

 CsvFormat: Comment character=# Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=- Quote escape escape character=null 

As you can see, the parser does not detect the correct quote output. Although the format detection process is usually very good, it is not guaranteed that it will always be correct, especially with small test samples. In your example, I donโ€™t understand why he chose the symbol - as an escape symbol, so I opened this issue to investigate and see what makes him detect this.

What can you do right now as a workaround if you know that none of your input files will ever have - as an escape code to determine the format, check what it took from the input, and then parse the contents, eg:

 public List<String[]> parse(File input, CsvFormat format) { CsvParserSettings settings = new CsvParserSettings(); if (format == null) { //no format specified? Let detect what we are dealing with settings.detectFormatAutomatically(); CsvParser parser = new CsvParser(settings); parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process format = parser.getDetectedFormat(); //capture the format parser.stopParsing(); //stop the parser - no need to read anything yet. System.out.println(format); if (format.getQuoteEscape() == '-') { //got something weird detected? Let amend it. format.setQuoteEscape('"'); } return parse(input, format); //now parse with the intended format } else { settings.setFormat(format); //this parses with the format adjusted earlier. CsvParser parser = new CsvParser(settings); return parser.parseAll(input); } } 

Now just call the parse method:

 List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv rn small.csv"), null); 

And your data will be correctly extracted. Hope this helps!

+2
source share

All Articles