Estimating the number of words in a file without reading the full file

I have a program for processing very large files. Now I need to show a progress bar to show the progress of processing. The program works at the word level, reads one line at a time, breaking it into words and processing the words one by one. Therefore, while the programs are running, he knows the number of words processed. If somehow he knows in advance the number of words in a file, he can easily calculate the progress.

The problem is that the files I'm dealing with can be very large and therefore it is not recommended to process the file twice, once, to get the total number of words and run the current processing code.

So, I'm trying to write code that can estimate the number of words in a file by reading a small part of it. This is what I came up with (in Clojure):

(defn estimated-word-count [file]
  (let [^java.io.File file (as-file file)
        ^java.io.Reader rdr (reader file)
        buffer (char-array 1000)
        chars-read (.read rdr buffer 0 1000)]
    (.close rdr)
    (if (= chars-read -1)
      0
      (* 0.001 (.length file) 
        (-> (String. buffer 0 chars-read) tokenize-line count)))))

This code reads the first 1000 characters from a file, creates a String from it, tokenizes it to get words, counts words, and then estimates the number of words in the file, multiplying it by the length of the file and dividing it by 1000.

When I run this code in a file with English text, I get an almost correct word count. But, when I run it in a file with Hindi text (encoded in UTF-8), it returns almost double the number of real words.

I understand that this problem is related to encoding. So is there a way to solve this problem?

Decision

As suggested by Frank , I determine the number of bytes of the first 10,000 characters and use it to estimate the number of words in a file.

(defn chars-per-byte [^String s]
  (/ (count s) ^Integer (count (.getBytes s "UTF-8"))))

(defn estimate-file-word-count [file]
  (let [file (as-file file)
        rdr (reader file)
        buffer (char-array 10000)
        chars-read (.read rdr buffer 0 10000)]
    (.close rdr)
    (if (= chars-read -1)
      0
      (let [s (String. buffer 0 chars-read)]
        (* (/ 1.0 chars-read) (.length file) (chars-per-byte s)
          (-> s tokenize-line count))))))

, UTF-8. , 10000 , .

+5
4

UTF-8 char. , 1000 . , , char .

100 . Clojure , , , , 1000 ?

+2

. , , .

- , , getBytes, , , . , .

, .

+11

/ char /?

0

? , " 0,1%". AVG_BYTES_PER_WORD .

0

All Articles