Comparing two csv files in Java

Question

Comparing two csv files in Java

We need to compare two CSV files. Let say that the file has several lines, and the second file can have the same number of lines or more. Most lines can remain the same on both files. Take a look at the best approach to distinguishing between these two files and read only those lines that have a difference in the second file from the first file. The file processing application is in Java.

What are the best approaches for this?

Note: it would be great if we knew that the line is being updated, inserted or deleted in the second file.

Requirements: -

There will be no duplicate entries
File 1 and file 2 can have the same number of records with multiple lines with updated values in file2 (updated records)
File 2 can delete multiple lines (this is considered a deleted record)
Several new lines may be added in file 2 (this is considered a record)
In the column, you can process the primary key of the record, which will not change in both files.

+8

java csv

Java guy Jun 2 '12 at 17:56

source share

7 answers

beerbajay · Answer 1 · 2012-06-02T18:00:53+0000

One way to do this would be to use the java Set interface; read each line as a line, add it to the set, then execute removeAll() with the second set in the first set, while preserving the rows that are different. This, of course, assumes that the files do not have duplicate lines.

 // using FileUtils to read in the files. HashSet<String> f1 = new HashSet<String>(FileUtils.readLines("file1.csv")); HashSet<String> f2 = new HashSet<String>(FileUtils.readLines("file2.csv")); f1.removeAll(f2); // f1 now contains only the lines which are not in f2

Update

Good, so you have a PK field. I just assume that you know how to get this from your string; use openCSV or regex or whatever you want. Make the actual HashMap instead of the HashSet as above, use PK as the key and string as the value.

 HashMap<String, String> f1 = new HashMap<String, String>(); HashMap<String, String> f2 = new HashMap<String, String>(); // read f1, f2; use PK field as the key List<String> deleted = new ArrayList<String>(); List<String> updated = new ArrayList<String>(); for(Map.Entry<String, String> entry : f1.keySet()) { if(!f2.containsKey(entry.getKey()) { deleted.add(entry.getValue()); } else { if(!f2.get(entry.getKey().equals(f1.getValue())) { updated.add(f1.getValue()); } } } for(String key : f1.keySet()) { f2.remove(key); } // f2 now contains only "new" rows

Hassan · Answer 2 · 2012-06-02T18:01:40+0000

Read the entire first file and put it in a List . Then read the second file one line at a time and compare each line with all lines of the first file to see if it is a duplicate. If this is not a duplicate, then this is new information. If you have trouble reading, check out http://opencsv.sourceforge.net/ , which is a pretty good library for reading CSV files in Java.

Mark o'connor · Answer 3 · 2012-06-02T23:21:46+0000

Try using the java-diff-utils library

Example

I use groovy for quick demos of java libraries:

The following are the differences between the two sample files:

 $ groovy diff [ChangeDelta, position: 0, lines: [1,11,21,31,41,51] to [1,11,99,31,41,51]] [DeleteDelta, position: 2, lines: [3,13,23,33,43,53]] [InsertDelta, position: 5, lines: [6,16,26,36,46,56]]

files1.csv

 1,11,21,31,41,51 2,12,22,32,42,52 3,13,23,33,43,53 4,14,24,34,44,54 5,15,25,35,45,55

file2.csv

 1,11,99,31,41,51 2,12,22,32,42,52 4,14,24,34,44,54 5,15,25,35,45,55 6,16,26,36,46,56

diff.groovy

 // // Dependencies // ============ import difflib.* @Grapes([ @Grab(group='com.googlecode.java-diff-utils', module='diffutils', version='1.2.1'), ]) // // Main program // ============ def original = new File("file1.csv").readLines() def revised = new File("file2.csv").readLines() Patch patch = DiffUtils.diff(original, revised) patch.getDeltas().each { println it }

Update

According to dbunit, frequently asked questions, the performance of this solution can be improved for very large datasets using the thread revision of the ResultSetTableFactory interface. This is included in the ANT task as follows:

 ant.dbunit(driver:driver, url:url, userid:user, password:pass) { compare(src:"dbunit.xml", format:"flat") dbconfig { property(name:"datatypeFactory", value:"org.dbunit.ext.h2.H2DataTypeFactory") property(name:"resultSetTableFactory", value:"org.dbunit.database.ForwardOnlyResultSetTableFactory") } }

Abhishek agggarwal · Answer 4 · 2013-04-24T06:37:35+0000

There is a program that compares / subtracts two CSV files. It uses an ArrayList

 import java.io.*; import java.util.ArrayList; /* file1 - file2 = file3*/ public class CompareCSV { public static void main(String args[]) throws FileNotFoundException, IOException { String path="D:\\csv\\"; String file1="file1.csv"; String file2="file2.csv"; String file3="p3lang.csv"; ArrayList al1=new ArrayList(); ArrayList al2=new ArrayList(); //ArrayList al3=new ArrayList(); BufferedReader CSVFile1 = new BufferedReader(new FileReader(path+file1)); String dataRow1 = CSVFile1.readLine(); while (dataRow1 != null) { String[] dataArray1 = dataRow1.split(","); for (String item1:dataArray1) { al1.add(item1); } dataRow1 = CSVFile1.readLine(); // Read next line of data. } CSVFile1.close(); BufferedReader CSVFile2 = new BufferedReader(new FileReader(path+file2)); String dataRow2 = CSVFile2.readLine(); while (dataRow2 != null) { String[] dataArray2 = dataRow2.split(","); for (String item2:dataArray2) { al2.add(item2); } dataRow2 = CSVFile2.readLine(); // Read next line of data. } CSVFile2.close(); for(String bs:al2) { al1.remove(bs); } int size=al1.size(); System.out.println(size); try { FileWriter writer=new FileWriter(path+file3); while(size!=0) { size--; writer.append(""+al1.get(size)); writer.append('\n'); } writer.flush(); writer.close(); } catch(IOException e) { e.printStackTrace(); } }}

http://p3lang.com/subtract-one-csv-from-another-in-java/

goat · Answer 5 · 2012-06-02T18:22:27+0000

You mentioned the detection of "updated" rows. I assume this means that the string has an identifier in some way that survives the update. Perhaps a single column or a composite column provides identification. This is a drillthrough that you personally need to understand and implement, and this will add more code to your solution.

In any case ... databases, as a rule, have good support for working with installed data and loading data from csv files. All large-name relational databases have great support with easy syntax for loading data into a csv file in a table. At this point, finding new rows or changed rows between two tables is very simple sql queries.

its clearly not a pure Java solution, but worth mentioning is what I think.

Silviu B. · Answer 6 · 2018-12-06T08:04:25+0000

My simple solution is if you want to compare two csv responses stored in string variables (in case you get them through a REST call). In my case, I wanted to exit the scan after a threshold of 10 different lines.

  BufferedReader baseline = new BufferedReader(new StringReader(responseBaseline)); BufferedReader tested = new BufferedReader(new StringReader(responseTested)); String lineBaseline = null; String lineTested = null; boolean linesExist = true; boolean foundDiff = false; int lineNumber = 0; int errorNumber = 0; int errorThreshold = 10; String message = ""; while (linesExist) { try { lineBaseline = baseline.readLine(); lineTested = tested.readLine(); lineNumber++; if ((lineBaseline != null) && (lineTested != null)) { if (!lineTested.equals(lineBaseline)) { foundDiff = true; errorNumber++; if (errorNumber > errorThreshold) { message = message + "\r\n" + "Found more than " + errorThreshold + " lines that were different. Will exit check."; break; } message = message + "\r\n" + "\r\n#Found differences for line number " + lineNumber + "\r\nLine baseline: " + lineBaseline + "\r\nLine tested: " + lineTested; } } else { linesExist = false; } } catch (IOException e) { throw new Error("Problems with reading csv files"); } } if (foundDiff) { throw new Error("Found differences between csv files. " + message); } }

dharam · Answer 7 · 2012-06-02T18:14:12+0000

What I suggest:

You can read the file to create markers separated by characters, and trim each token on both sides to take care of the extra spaces, and then store them in an ordered data structure (similar to a related hash set, a related map hash, etc. (if you want to get duplicates in the file if there are any), and then repeat it for another file.

Java provides many utility methods for comparing these data structures. :)

Comparing two csv files in Java

Example

files1.csv

file2.csv

diff.groovy

Update

More articles: