How to save file format if you use uniq command (in shell)?

To use the uniq command, you must first sort the file.

But in the file that I have, the order of information is important, so how can I keep the original file format, but still get rid of duplicate content?

+6
sorting unix file shell duplicates
source share
7 answers

Another awk version:

awk '!_[$0]++' infile 
+10
source share

This awk saves the first occurrence. The same algorithm as the other answers uses:

 awk '!($0 in lines) { print $0; lines[$0]; }' 

Here you only need to save duplicate lines (unlike all lines) with awk :

 sort file | uniq -d | awk ' FNR == NR { dups[$0] } FNR != NR && (!($0 in dups) || !lines[$0]++) ' - file 
+4
source share

There is also a "line number, double sort" method.

  nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2- 
+4
source share

You can run uniq -d in a sorted version of the file to find duplicate lines, and then run several scripts that say:

 if this_line is in duplicate_lines { if not i_have_seen[this_line] { output this_line i_have_seen[this_line] = true } } else { output this_line } 
+1
source share

Using only uniq and grep:

Create d.sh:

 #!/bin/sh sort $1 | uniq > $1_uniq for line in $(cat $1); do cat $1_uniq | grep -m1 $line >> $1_out cat $1_uniq | grep -v $line > $1_uniq2 mv $1_uniq2 $1_uniq done; rm $1_uniq 

Example:

 ./d.sh infile 
+1
source share

You can use some terrible thing O (n ^ 2) like this (Pseudocode):

 file2 = EMPTY_FILE for each line in file1: if not line in file2: file2.append(line) 

This is potentially quite slow, especially if it is implemented at the Bash level. But if your files are short enough, it will probably work fine, and will be quickly implemented ( not line in file2 then just grep -v , etc.).

Otherwise, you could, of course, encode a dedicated program using a more advanced in-memory data structure to speed it up.

0
source share
 for line in $(sort file1 | uniq ); do grep -n -m1 line file >>out done; sort -n out 

sort first

for each uniqe grep value for the first match (-m1)

and save line numbers

sort the result numerically (-n) by line number.

you can delete line # with sed or awk

0
source share

All Articles