Why is uniq not working on this large file? bash

I am very sorry for this other noob question, but I cannot understand what is going on here. I want to calculate the frequency of words from a file, where the words are one at a time. The file is really large, so this can be a problem (in this example, it is designed for 300 thousand lines)

I execute this command:

cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt 

and the problem is that he gives me a little mistake: he considers me the same words as different ones. For example, the first entries:

 306 continua 278 apertura 211 eventi 189 murah 182 giochi 167 giochi 

repeating twice, as you can see

the bottom of the file gets worse and it looks like this:

  1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 winchester 1 wind 1 wind 

for all words

I'm sorry again for the stupid question, but I'm a little null with shell programming. What am I doing wrong?

Many thanks

+6
source share
4 answers

Try to sort first:

 cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt 
+12
source

Or use "sort -u", which also eliminates duplicates. See here .

+6
source

File size has nothing to do with what you see. From the uniq (1) man page :

Note: "uniq" does not detect duplicate lines if they are not adjacent. You can sort the input first, or use "sort -u" without "Unique." In addition, comparisons abide by the rules defined by LC_COLLATE.

So run uniq on

 a b a 

will return:

 a b a 
+2
source

Is it possible that some of the words have white space after them? If so, you should remove them using something like this:

 cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt 
+1
source

Source: https://habr.com/ru/post/922341/


All Articles