Filter / stream reduction for duplicate records

Question

Filter / stream reduction for duplicate records

I am trying to filter / reduce a data stream that has multiple duplicate records.

In essence, I'm trying to find a better solution for filtering a dataset than what I implemented. We have data that basically looks something like this:

Action | Date | Detail 15 | 2016-03-15 | 5 | 2016-03-15 | D1 5 | 2016-09-25 | D2 <-- 5 | 2016-09-25 | D3 <-- same day, different detail 4 | 2017-02-08 | D4 4 | 2017-02-08 | D5 5 | 2017-03-01 | D6 <-- 5 | 2017-03-05 | D6 <-- different day, same detail; need earliest 5 | 2017-03-08 | D7 5 | 2017-03-10 | D8 ...

I need to extract the details in order to:

Only action 5 selected
If the part is the same (for example, D6 appears twice on different days), the earliest date is selected

This data is loaded into objects (one instance for each "record"), and there are other fields in the object, but they are not related to this filtering. The detail is stored as String, Date as ZonedDateTime, and Action is int (well, in fact, enum , but shown here as int ). Objects are set in List<Entry> in chronological order.

I was able to get a job, but what I consider suboptimal did:

  List<Entry> entries = getEntries(); // retrieved from a server final Set<String> update = new HashSet<>(); List<Entry> updates = entries.stream() .filter(e -> e.getType() == 5) .filter(e -> pass(e, update)) .collect(Collectors.toList()); private boolean pass(Entry ehe, Set<String> update) { final String val = ehe.getDetail(); if (update.contains(val)) { return false; } update.add(val); return true; }

But the problem is that I had to use this pass() method and check Set<String> in it to process the data processing. Although this approach works, it seems that you can avoid an external link.

I tried to use groupingBy in detail, and this would extract the earliest entry from the list, the problem is that I no longer had date ordering, and I had to process the resulting Map<String,List<Entry>> .

It seems that some reduction operations (if I used this term correctly) here without using the pass() method should be possible, but I'm struggling to get a better implementation.

What would be the best approach to remove .filter(e -> pass(e, update)) ?

Thanks!

+7

java java-8 java-stream

Kevino May 04 '17 at 11:13

source share

4 answers

If I understood correctly...

  List<Entry> result = list.stream().collect(Collectors.toMap( Entry::getDetail, Function.identity(), (left, right) -> { return left.getDate().compareTo(right.getDate()) > 0 ? right : left; }, LinkedHashMap::new)) .values() .stream() .filter(e -> e.getAction() == 5) .collect(Collectors.toList());

+3

Eugene May 04 '17 at 11:45

source share

The thread interface provides a distinct method for this. It will sort duplicates based on equals() .

Therefore, one option would be to implement your Entry equals * method accordingly, or another would have to define a Wrapper class that checks for equality based on specific criteria (i.e. getDetail() )

 class Wrapper { final Entity entity; Wrapper(Entity entity){ this.entity = entity; } Entity getEntity(){ return this.entity; } public boolean equals(Object o){ if(o instanceof Entity) { return entity.getDetail().equals(((Wrapper) o).getEntity().getDetail()); } return false; } public int hashCode() { return entity != null ? entity.getDetail().hashCode() : 0; } }

And how to wrap, select and untie objects:

 entries.stream() .map(Wrapper::new) .distinct() .map(Wrapper::getEntity) .collect(Collectors.toList());

If the stream is ordered, the first matchin entry is always used. The list stream is always ordered.

*) I tried it first without hashCode () implementation, and it fails. The reason is that inside java.util.stream.DistinctOps , a HashSet used to track already processed elements and is checked for contains , which relies on the hashCode method, as well as the equals method. So just implementing equals not enough.

+2

Gerald mücke May 04 '17 at 11:43

source share

You can create a LinkedHashMap using groupingBy , which preserves the insertion order, unlike HashMap . You say that the list is already in chronological order, so preserving the order is enough. Then just collect the aggregated lists in the values of this map. For example (add static import):

 List<Entry> selected = objs.stream() .filter(e -> e.getType() == 5) .collect(groupingBy(Entry::getDetail, LinkedHashMap::new, reducing((a, b) -> a))) .values().stream() .filter(Optional::isPresent) .map(Optional::get) .collect(toList());

The reducing part will save the first of 1 or more cases. Here is the documentation for LinkedHashMap and the specific groupingBy I use.

+2

Manos Nikolaidis May 04 '17 at 11:43

source share

Robin topper · Accepted Answer · 2017-05-04T11:48:59+0000

Two solutions, in which answer the second is much faster.

Solution 1

Adapting the answer from Ole VV on another question:

 Collection<Entry> result = entries.stream().filter(e -> e.getAction() == 5) .collect(Collectors.groupingBy(Entry::getDetail, Collectors.collectingAndThen(Collectors.minBy(Comparator.comparing(Entry::getDate)), Optional::get))) .values();

With your example dataset, you ended up (I chose GMT + 0 as the time zone):

 Entry [action=5, date=2017-03-01T00:00Z[GMT], detail=D6] Entry [action=5, date=2017-03-08T00:00Z[GMT], detail=D7] Entry [action=5, date=2017-03-10T00:00Z[GMT], detail=D8] Entry [action=5, date=2016-03-15T00:00Z[GMT], detail=D1] Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D2] Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D3]

If you insist on returning the List back:

 List<Entry> result = new ArrayList<>(entries.stream() ..... .values());

If you want to return your original order, use the 3-parameter groupingBy :

 ...groupingBy(Entry::getDetail, LinkedHashMap::new, Collectors.collectingAndThen(...))

Decision 2

Using toMap , which is easier to read and faster (see holi-java comment on this answer and the next section):

 List<Entry> col = new ArrayList<>( entries.stream().filter(e -> e.getAction() == 5) .collect(Collectors.toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a)) .values());

where (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a can be replaced by:

 BinaryOperator.minBy(Comparator.comparing(Entry::getDate))

If you want to return your original order in this solution, use the 4-parameter toMap :

 ...toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a, LinkedHashMap::new)

Performance

Using the test data that I created to test my solutions, I checked the execution time of both solutions. The first solution takes an average of 67 ms (runs only 20 times, so do not trust the numbers!), The second solution took an average of 2 ms. If someone wants to make a proper performance comparison, put the results in the comments and I will add it here.

Filter / stream reduction for duplicate records

Solution 1

Decision 2

Performance

More articles: