How would you get the word count in a given PDF file?

Interview Question

I was asked this question in an interview, and the answer does not have to be a specific programming language specific to a platform or tool.

The question is formulated as follows:

How do you get an invoice copy of the given word in PDF. The answer does not have to be a programming, platform or tool. Just let me know how you would do it in memory and speed mode.

I am posting this question for the following reasons:

  • To better understand the context - I still do not understand the context of this question, what could the interviewer look for by asking this question?
  • To get different opinions - I am inclined to answer such questions based on my skills in the programming language (C #), but there may be other valid options for this.

Thank you for your interest.

+5
source share
3 answers

If I had to write a program for this, I would find a PDF rendering library capable of extracting text from PDF files, such as Xpdf , and then count the words. If this was one of the tasks or something that needed to be automated for a non-production quality task, I just downloaded the file into pdftotext and then parsed the output file using python, breaking it into words, putting them into the dictionary and counting the number of events.

If I asked this interview question, I would look for a couple of things:

  • : script thingy
  • PDF- .

- PDF, , PDF "". , PDF . . , , , . PDF . (pdftotext ).

.

+4
+2

I would suggest an open source solution using Java. First you have to parse the pdf file and extract all the text using Tika .

Then I believe that the right question is how to find the TF (term frequency) words in a text. I will not bother you with definitions, because you can achieve this simply by scanning the extracted text and counting the word frequency.

A sample code would look like this:

 while(scan.hasNext())
    {   
        word = scan.next(); 
        ha += (" " + word + " ");

        int countWord = 0;
        if(!listOfWords.containsKey(word))
        {    
             listOfWords.put(word, 1); //first occurance of this word
        }
        else
        {
            countWord = listOfWords.get(word) + 1; //get current count and increment
                                                       //now put the new value back in the HashMap
            listOfWords.remove(word);              //first remove it (can't have duplicate keys)
            listOfWords.put(word, countWord);      //now put it back with new value
        }
    }     
0
source

All Articles