Why is uniq not working on this large file? bash

Question

Why is uniq not working on this large file? bash

I am very sorry for this other noob question, but I cannot understand what is going on here. I want to calculate the frequency of words from a file, where the words are one at a time. The file is really large, so this can be a problem (in this example, it is designed for 300 thousand lines)

I execute this command:

cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt

and the problem is that he gives me a little mistake: he considers me the same words as different ones. For example, the first entries:

 306 continua 278 apertura 211 eventi 189 murah 182 giochi 167 giochi

repeating twice, as you can see

the bottom of the file gets worse and it looks like this:

  1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 win 1 winchester 1 wind 1 wind

for all words

I'm sorry again for the stupid question, but I'm a little null with shell programming. What am I doing wrong?

Many thanks

+6

linux bash shell uniq

Epi Aug 08 '12 at 8:20

source share

4 answers

Or use "sort -u", which also eliminates duplicates. See here .

+6

rollstuhlfahrer Aug 08 '12 at 8:26

source share

File size has nothing to do with what you see. From the uniq (1) man page :

Note: "uniq" does not detect duplicate lines if they are not adjacent. You can sort the input first, or use "sort -u" without "Unique." In addition, comparisons abide by the rules defined by LC_COLLATE.

So run uniq on

 a b a

will return:

 a b a

+2

Djohnson May 13, '15 at 13:30

source share

Is it possible that some of the words have white space after them? If so, you should remove them using something like this:

 cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt

+1

codebox Aug 08 '12 at 8:26

source share

kofemann · Accepted Answer · 2012-08-08T08:24:34+0000

Try to sort first:

 cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt

Why is uniq not working on this large file? bash

More articles: