Java: access to arbitrary arbitrary ASCII file with state

Is there a better [pre-existing optional Java 1.6 solution] than creating a stream file reader class that will meet the following criteria?

  • For an ASCII file of arbitrary large size, where each line ends with \n
  • For each call to some readLine() method, read a random line from a file
  • And for the life of the file descriptor, calling readLine() should not return the same line twice

Update:

  • In the end, all lines must be read.

Context: file contents are created from Unix shell commands to get a list of directories of all paths contained in this directory; from a million to a billion files (which gives millions to a billion lines in the target file). If there is a way to randomly distribute paths to a file at creation time, which is also an acceptable solution.

+4
source share
4 answers

If the number of files is really arbitrary, it seems that there might be a problem with tracking processed files in terms of memory usage (or I / O time, if tracking in files instead of list or set). Solutions that support a growing list of selected rows are also associated with time issues.

I would consider something in the following lines:

  • Create n "bucket" files. n can be determined based on what is required to account for the number of files and system memory. (If n is large, you can generate a subset of n to keep open file descriptors.)
  • Each file name is hashed and goes into the corresponding bucket file, "delineating" the directory based on arbitrary criteria.
  • Read the contents of the bucket contents (file names only) and handle as-is (randomness provided by the hash mechanism), or select rnd (n) and delete as you go, providing a bit more randomness.
  • Alternatively, you can use and use the idea of ​​random access by removing indexes / offsets from the list as you select them.
+1
source

To avoid reading in the entire file, which may not be possible in your case, you can use RandomAccessFile instead of the standard java FileInputStream . With RandomAccessFile you can use the seek(long position) method to go to an arbitrary place in the file and start reading there. The code will look something like this.

 RandomAccessFile raf = new RandomAccessFile("path-to-file","rw"); HashMap<Integer,String> sampledLines = new HashMap<Integer,String>(); for(int i = 0; i < numberOfRandomSamples; i++) { //seek to a random point in the file raf.seek((long)(Math.random()*raf.length())); //skip from the random location to the beginning of the next line int nextByte = raf.read(); while(((char)nextByte) != '\n') { if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end nextByte = raf.read(); } //read the line into a buffer StringBuffer lineBuffer = new StringBuffer(); nextByte = raf.read(); while(nextByte != -1 && (((char)nextByte) != '\n')) lineBuffer.append((char)nextByte); //ensure uniqueness String line = lineBuffer.toString(); if(sampledLines.get(line.hashCode()) != null) i--; else sampledLines.put(line.hashCode(),line); } 

Here sampledLines should hold your randomly selected lines at the end. You may need to verify that you did not accidentally skip to the end of the file to avoid errors in this case.

EDIT: I put it at the beginning of the file if you get to the end. It was a pretty simple check.

EDIT 2: I checked the uniqueness of strings using HashMap .

+5
source

Pre-process the input file and remember the offset of each new line. Use BitSet to track used lines. If you want to save some memory, remember the offset of each 16th line; it’s still easy to go to the file and perform a sequential search in a block of 16 lines.

+2
source

Since you can insert lines, I would do something along these lines, and you should also notice that even then there may be a limit as to what List can actually hold.

Using a random number every time you want to read a line and add it to Set will also be executed, however this ensures that the file is fully read:

 public class VeryLargeFileReading implements Iterator<String>, Closeable { private static Random RND = new Random(); // List of all indices final List<Long> indices = new ArrayList<Long>(); final RandomAccessFile fd; public VeryLargeFileReading(String fileName, long lineSize) { fd = new RandomAccessFile(fileName); long nrLines = fd.length() / lineSize; for (long i = 0; i < nrLines; i++) indices.add(i * lineSize); Collections.shuffle(indices); } // Iterator methods @Override public boolean hasNext() { return !indices.isEmpty(); } @Override public void remove() { // Nope throw new IllegalStateException(); } @Override public String next() { final long offset = indices.remove(0); fd.seek(offset); return fd.readLine().trim(); } @Override public void close() throws IOException { fd.close(); } } 
+2
source

All Articles