Getting the number of unique values ​​in a column in bash

I have tab-delimited files with multiple columns. I want to calculate the frequency of occurrence of different values ​​in a column for all files in a folder and sort them in descending order of quantity (count the counter first). How can I do this in a Linux command line environment?

It can use any common command line language like awk, perl, python, etc.

+61
command-line bash frequency
Feb 07 2018-11-11T00:
source share
5 answers

To see the sample rate for column two (for example):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr 

fileA.txt

 zza abc wde 

fileB.txt

 tre zda agc 

fileC.txt

 zra vdc amc 

Result:

  3 d 2 r 1 z 1 m 1 g 1 b 
+96
Feb 07 '11 at 15:36
source share

Here's how to do it in a shell:

 FIELD=2 cut -f $FIELD * | sort| uniq -c |sort -nr 

This is something like bash great.

+38
Feb 07 '11 at 18:59
source share

The GNU site offers this nice awk script that prints both words and their frequency.

Possible changes:

  • You can skip through sort -nr (and vice versa word and freq[word] ) to see the result in descending order.
  • If you need a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.

Here:

  # wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] } 
+7
Feb 07 2018-11-22T00:
source share

Perl

This code calculates the occurrences of all columns and prints a sorted report for each of them:

 # columnvalues.pl while (<>) { @Fields = split /\s+/; for $i ( 0 .. $#Fields ) { $result[$i]{$Fields[$i]}++ }; } for $j ( 0 .. $#result ) { print "column $j:\n"; @values = keys %{$result[$j]}; @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a} || $a cmp $b } @values; for $k ( @sorted ) { print " $k $result[$j]{$k}\n" } } 

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

Description

In the top level loop:
* Loop through each line of combined input files
* Split string into @Fields array
* For each column, increase the data structure of the hash result array

At the top level for the loop:
* Loop over an array of results
* Print column number
* Get the values ​​used in this column
* Sort values ​​by number of occurrences
* Secondary sorting based on value (e.g. b vs g vs m vs z)
* Iterate over the hash of the result using a sorted list
* Print out the value and number of each event

Results based on sample input files provided by @Dennis

 column 0: a 3 z 3 t 1 v 1 w 1 column 1: d 3 r 2 b 1 g 1 m 1 z 1 column 2: c 4 a 3 e 2 

.csv input

If your input files are .csv, change /\s+/ to /,/

Obfuscation

In an ugly contest, Perl is especially well equipped.
This single line line does the same:

 perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files* 
+3
Sep 16 '15 at 22:37
source share

Ruby (1.9 +)

 #!/usr/bin/env ruby Dir["*"].each do |file| h=Hash.new(0) open(file).each do |row| row.chomp.split("\t").each do |w| h[ w ] += 1 end end h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" } end 
+1
Feb 07 '11 at 15:04
source share



All Articles