Delete duplicate lines from text file?

Given the input text string file, I want duplicate lines to be identified and deleted. Please show a simple C # snippet that will do this.

+6
c # duplicates
source share
5 answers

This should (and will be copied with large files).

Note that it only removes duplicate consecutive lines, i.e.

a b b c b d 

will end like

 a b c b d 

If you do not want to duplicate anywhere, you will need to save a set of lines that you have already seen.

 using System; using System.IO; class DeDuper { static void Main(string[] args) { if (args.Length != 2) { Console.WriteLine("Usage: DeDuper <input file> <output file>"); return; } using (TextReader reader = File.OpenText(args[0])) using (TextWriter writer = File.CreateText(args[1])) { string currentLine; string lastLine = null; while ((currentLine = reader.ReadLine()) != null) { if (currentLine != lastLine) { writer.WriteLine(currentLine); lastLine = currentLine; } } } } } 

Note that this assumes Encoding.UTF8 and that you want to use files. It is easy to generalize as a method:

 static void CopyLinesRemovingConsecutiveDupes (TextReader reader, TextWriter writer) { string currentLine; string lastLine = null; while ((currentLine = reader.ReadLine()) != null) { if (currentLine != lastLine) { writer.WriteLine(currentLine); lastLine = currentLine; } } } 

(Note that this does not close anything - the caller must do this.)

Here is the version that will remove all duplicates, not just sequential ones:

 static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer) { string currentLine; HashSet<string> previousLines = new HashSet<string>(); while ((currentLine = reader.ReadLine()) != null) { // Add returns true if it was actually added, // false if it was already there if (previousLines.Add(currentLine)) { writer.WriteLine(currentLine); } } } 
+18
source share

For small files:

 string[] lines = File.ReadAllLines("filename.txt"); File.WriteAllLines("filename.txt", lines.Distinct().ToArray()); 
+28
source share

For a long file (and not consecutive duplicates), I would copy the files line by line, building a hash position lookup table // when I went.

When each line is copied, check the hashed value, if there is a double collision check, make sure the line is the same and move on to the next. (

It is only for fairly large files.

+2
source share

It uses a streaming approach, which should carry overhead than reading all the unique lines in memory.

  var sr = new StreamReader(File.OpenRead(@"C:\Temp\in.txt")); var sw = new StreamWriter(File.OpenWrite(@"C:\Temp\out.txt")); var lines = new HashSet<int>(); while (!sr.EndOfStream) { string line = sr.ReadLine(); int hc = line.GetHashCode(); if(lines.Contains(hc)) continue; lines.Add(hc); sw.WriteLine(line); } sw.Flush(); sw.Close(); sr.Close(); 
+2
source share

I am new to .net and wrote something simpler, maybe not very efficient. Please fill out for free to share your thoughts.

 class Program { static void Main(string[] args) { string[] emp_names = File.ReadAllLines("D:\\Employee Names.txt"); List<string> newemp1 = new List<string>(); for (int i = 0; i < emp_names.Length; i++) { newemp1.Add(emp_names[i]); //passing data to newemp1 from emp_names } for (int i = 0; i < emp_names.Length; i++) { List<string> temp = new List<string>(); int duplicate_count = 0; for (int j = newemp1.Count - 1; j >= 0; j--) { if (emp_names[i] != newemp1[j]) //checking for duplicate records temp.Add(newemp1[j]); else { duplicate_count++; if (duplicate_count == 1) temp.Add(emp_names[i]); } } newemp1 = temp; } string[] newemp = newemp1.ToArray(); //assigning into a string array Array.Sort(newemp); File.WriteAllLines("D:\\Employee Names.txt", newemp); //now writing the data to a text file Console.ReadLine(); } } 
0
source share

All Articles