How to extract text from djvu format and other e-books (possibly in Python)

Question

How to extract text from djvu format and other e-books (possibly in Python)

I have a collection of e-books in djvu, pdf, chm format, and I'm looking for a way to search for a keyword in the content. I researched and found a couple's suggestion to parse PDF content, but there seems to be no way to convert djvu content to text. In any way, does anyone know a way to decode djvu content into text so that I can easily search for it?

thanks

+4

python pdf full-text-search djvu

leon Oct 08 '09 at 15:28

source share

3 answers

Anthon · Answer 1 · 2013-03-12T18:28:45+0000

Assuming djvu files contain OCR-ed text, a quick way on Linux to get this is to use Popen to start djvutxt and capture the output.

The text in the .djvu file is compressed using a special djvu compression algorithm, bzz , for which there is no simple C interface that you could load as a shared object in Python. This is a C ++ implementation based on some frameworks.

Shameless self-promotion: I contributed to the Caliber conversion from OCR-ed .djvu , which uses djvutxt in this way. However, it goes back to my pure python decoder (sloooow) implementation if djvutxt not available . Thus, you can use this code if you cannot use djvutxt .

I have not yet released a Python source separate from Caliber. But after loading and extracting the Caliber source:

 curl -L http://status.calibre-ebook.com/dist/src | tar xvJ find . | fgrep djvu

Corresponding files: djvu_input.py , djvu.py and djvubzzdec.py

Alex martelli · Answer 2 · 2009-10-08T15:39:16+0000

python-djvulibre is a set of Python bindings to djvulibre an open source implementation of djvu - I haven't tried it, but it looks like it should fit your needs.

msr · Answer 3 · 2009-12-11T04:29:44+0000

Of course, the DjVuLibre SDK will allow access to the text layer - if it exists (not all DjVu files have a text layer, many are just bitmap images).

An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use to do this.

[ http://dev.caminova.jp/beta/djvu-wic/†[1]

How to extract text from djvu format and other e-books (possibly in Python)

More articles: