How to extract text from djvu format and other e-books (possibly in Python)

I have a collection of e-books in djvu, pdf, chm format, and I'm looking for a way to search for a keyword in the content. I researched and found a couple's suggestion to parse PDF content, but there seems to be no way to convert djvu content to text. In any way, does anyone know a way to decode djvu content into text so that I can easily search for it?

thanks

+4
source share
3 answers

Assuming djvu files contain OCR-ed text, a quick way on Linux to get this is to use Popen to start djvutxt and capture the output.

The text in the .djvu file is compressed using a special djvu compression algorithm, bzz , for which there is no simple C interface that you could load as a shared object in Python. This is a C ++ implementation based on some frameworks.

Shameless self-promotion: I contributed to the Caliber conversion from OCR-ed .djvu , which uses djvutxt in this way. However, it goes back to my pure python decoder (sloooow) implementation if djvutxt not available . Thus, you can use this code if you cannot use djvutxt .

I have not yet released a Python source separate from Caliber. But after loading and extracting the Caliber source:

 curl -L http://status.calibre-ebook.com/dist/src | tar xvJ find . | fgrep djvu 

Corresponding files: djvu_input.py , djvu.py and djvubzzdec.py

+6
source

python-djvulibre is a set of Python bindings to djvulibre an open source implementation of djvu - I haven't tried it, but it looks like it should fit your needs.

+3
source

Of course, the DjVuLibre SDK will allow access to the text layer - if it exists (not all DjVu files have a text layer, many are just bitmap images).

An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use to do this.

[ http://dev.caminova.jp/beta/djvu-wic/†[1]

+1
source

All Articles