Find common names in two files in Java

First of all, I want to clarify that the nature of this issue is different from other issues that have already been published in accordance with my knowledge. Please let me know if this is not the case.

Considering

  • I have a list of ~ 3000 names.
  • There are ~ 2500 files that consist of one name per line (taken from a list of names)
  • Each file contains ~ 3000 names (and therefore ~ โ€‹โ€‹3000 lines, although avg is 400)

Problem

At the moment, 2 files will be provided to me. I need to create a list of names that are common in both files.

Preliminary processing

To reduce the time complexity, I performed preprocessing and sorted the names in all the files.

My approach

  • Names in the specified list are sorted and indexed from 0 to 2999
  • In each file for each name

  • Group number calculated (name_index / 30)
  • The calculated value of the group (for each name in the same group, calculate (2 ^ (name_index% 30)) and add)
  • Create a new file with the same name in the format "groupNumber blankSpace groupValue"

Result

Instead of having ~ 3000 (although avg - 400) names in each file, now I will have a maximum of 100 lines in each file. Now I will need to check the common group number, and then using bit manipulation I can find out the common names.

Expectation

Can anyone suggest a shorter and better solution to the problem. I can do preprocessing and store new files in my application so that minimal processing is required when searching for common names.

Please let me know if I go in the wrong direction to solve the problem. Thanks in advance.

Points

In my approach, the size of shared files is 258 KB (since I used group names and group values), and if it is stored by name on each line, the size is 573 KB. These files must be stored on a mobile device. Therefore, I need to reduce the size as much as possible. I am also looking forward to data compression, and I do not know how to do this. Please also pay attention to this.

+4
source share
4 answers

Have you tried the following?

  • Read the names 1 at a time from list1, adding them to the hashset.
  • Read the names from list2 one by one, looking at them in the hash set created from list 1. If they are in a hashset, this means that the name is common to both files.

If you want to pre-process some extra speed, save the # names in each list and select a shorter list as list1.

+4
source

Yeah! Given the very low memory requirement that you specified in the editorial office, you can do one more thing.

Although I still think you can go for a solution that other answers offer. A HashSet with a 3000 String will not be too big. My quick approach with 16-char Strings suggests something below 400 Kbytes of heap memory. Try it, then come back. This is like 25 lines of code for the entire program.


If the solution consumes too much memory, you can do this:

  • Sort names in files. This is always good.
  • Open both files.
  • Read the line from both files.
    • If line1 < line2 , read the line from line1 , repeat.
    • If line1 > line2 , read the line from line2 , repeat.
    • Otherwise, they are the same, add to the results. Repeat.

It does not eat almost without memory, and this is a good place to use the compareTo() method (if you used it to sort names, that is) and the switch , I think.

File size does not affect memory usage at all.


About data compression - there are many tools and algorithms that you could use, try this (see also related questions), or this this .

+2
source

You are trying to re-implement a combo box. Do not do that. Use a set of names that will automatically take care of duplicate inserts.

You need to read both files, not without it.

 // in pseudo-java Set<String> names1 = new HashSet<String>(); for (String name : file1.getLine().trim()) { names1.put(name); } Set<String> names2 = new HashSet<String>(); for (String name : file2.getLine().trim()) { names2.put(name); } // with this line, names1 will discard any name not in names2 names1.retainAll(names2); System.out.println(names1); 

Assuming you are using a HashSet as shown in this example, you will compare string hashes, which will greatly improve performance.

If you find that performance is poor, start looking for faster solutions. Everything else is premature optimization, and if you donโ€™t know how fast it should work, then this is optimization without setting a goal. Finding the โ€œfastestโ€ solution requires listing and exhausting all possible solutions, since this solution, which you have not yet tested, can be faster.

0
source

I'm not sure if I understand your requirements and situation.

You have about 2,500 files, each of 3,000 words (or 400?). There are many duplicate words that are found in multiple files.

Now someone will ask you which words have file-345 and file-765.

You can create a hashmap where you store each word, and a list of files in which the words occur.

If you get a 345 file with 3000 words (400?) With it, you look at it in hashmap and see where the 765 file is listed.

However, 2 * 3000 is not so much. If I create 2 string lists in Scala (which runs on the JVM):

 val g1 = (1 to 3000).map (x=> "" + r.nextInt (10000)) val g2 = (1 to 3000).map (x=> "" + r.nextInt (10000)) 

and build the intersection

 g1.intersect (g2) 

I get the result (678 elements) almost not on an 8-year-old laptop.

So how many requests will you have to answer? How often does file input change? If rare, then reading two files can be a critical point.

How many unique words do you have? Perhaps it is not a problem to keep them all in mind.

0
source

Source: https://habr.com/ru/post/1411632/


All Articles