Clojure - counting unique values ​​from vectors in seq

Being somewhat new to Clojure, I can't figure out how to make something similar, it should be simple. I just don't see it. I have seq vectors. Let it be said that each vector has two values ​​representing the customer number and invoice number, and each of the vectors represents a sale of the element. Therefore, it looks something like this:

([ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]) 

I want to count the number of unique customers and unique invoices. Thus, in the example, a vector must be created

 [ 2 3 ] 

In Java or another imperative language, I would iterate over each of the vectors in seq, add the customer number and invoice number to the set, then count the number of values ​​in each set and return it. I do not see a functional way to do this.

Thanks for the help.

EDIT: I should have indicated in my original question that the seq of vectors is in 10 million and actually has more than two values. Therefore, I only want to go through seq once and calculate these unique counts (and some amounts as well) on the fact that one goes through seq.

+7
source share
5 answers

In Clojure, you can do this in much the same way - first call distinct to get unique values, and then use count to count the results:

 (def vectors (list [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])) (defn count-unique [coll] (count (distinct coll))) (def result [(count-unique (map first vectors)) (count-unique (map second vectors))]) 

Please note that here you first get a list of the first and second elements of the vectors (draw the first / second vectors), and then operate separately and, thus, repeating the collection twice. If performance matters, you can do the same with iteration (see loop shape or recursion tail) and sets, as in Java. You can also use transients to further improve performance. Although for beginners, like you, I would recommend the first way with distinct .

UPD. Here's the version with the loop:

 (defn count-unique-vec [coll] (loop [coll coll, e1 (transient #{}), e2 (transient #{})] (cond (empty? coll) [(count (persistent! e1)) (count (persistent! e2))] :else (recur (rest coll) (conj! e1 (first (first coll))) (conj! e2 (second (first coll))))))) (count-unique-vec vectors) ==> [2 3] 

As you can see, there is no need for atoms or anything like that. First, you pass the state of each subsequent iteration (callback). Secondly, you use transients to use temporary mutable collections (read more about transients in more detail) and, therefore, avoid creating a new object each time.

UPD2. Here is the version with reduce for the extended question (with price):

 (defn count-with-price "Takes input of form ([customer invoice price] [customer invoice price] ...) and produces vector of 3 elements, where 1st and 2nd are counts of unique customers and invoices and 3rd is total sum of all prices" [coll] (let [[custs invs total] (reduce (fn [[custs invs total] [cust inv price]] [(conj! custs cust) (conj! invs inv) (+ total price)]) [(transient #{}) (transient #{}) 0] coll)] [(count (persistent! custs)) (count (persistent! invs)) total])) 

Here we conduct the intermediate results in the [custs invs total] vector, unpack, process and return them back to the vector every time. As you can see, implementing such nontrivial logic with reduce more complicated (both for writing and reading) and requires even more code (in the loop ed version it is enough to add one more parameter for the price for args cycles). Therefore, I agree with @ammaloy that for simpler cases, reduce better, but more complex things require lower-level constructs such as a loop/recur .

+11
source

As is often the case when using a sequence, reduce here is better than a loop . You can simply:

 (map count (reduce (partial map conj) [#{} #{}] txn)) 

Or, if you are really in transitions:

 (map (comp count persistent!) (reduce (partial map conj!) (repeatedly 2 #(transient #{})) txn)) 

Both of these solutions go through the input only once, and they take much less code than the loop / recur solution.

+9
source

Or, you can use the sets to handle deadweight for you, since the sets can have a maximum value for any specific value.

 (def vectors '([100 2000] [100 2000] [101 2001] [100 2002])) [(count (into #{} (map first vectors))) (count (into #{} (map second vectors)))] 
+4
source

Here's a good way to do this using map functions and a higher order:

 (apply map (comp count set list) [[ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]]) => (2 3) 
+1
source

Also other solutions for the good ones mentioned above:

(map (comp count distinct vector) [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])

Another one written with the thread-last macro:

(->> '([100 2000] [100 2000] [101 2001] [100 2002]) (apply map vector) (map distinct) (map count))

both are returning (2 3).

0
source

All Articles