Tesseract v3.03 pdf rendering with text search

Question

Tesseract v3.03 pdf rendering with text search

From the tesseract v3.03 release notes, tesseract now supports finding a PDF file with searchable text, but I don’t know how to use this function in my code.
I am currently using tess-two for my Android app, then I'm just wondering if this feature can work for Android?

It would be great if you could give an example that uses the tesseract api to render pdf, and then I will try to pass the missing functions to the tess-two library.
Thanks in advance.

P / s: I see a pdfrenderer file that can handle the output of a pdf file, but I don’t know how to apply it with the base api.

Update : here is my attempt:

  tesseract::TessResultRenderer* renderer = new tesseract::TessPDFRenderer(nat->api.GetDatapath()); __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "data path = %s", nat->api.GetDatapath()); if (!nat->api.ProcessPages(c_file_name, NULL, 0, renderer)) { __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "process page failed"); delete renderer; return; } FILE* fout = fopen(c_pdf_file_name, "wb"); if (fout == NULL) { __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "Cannot create output file %s\n", c_pdf_file_name); delete renderer; return; } const char* data; int dataLength; bool boolValue = renderer->GetOutput(&data, &dataLength); if (boolValue) { fwrite(data, 1, dataLength, fout); if (fout != stdout) fclose(fout); else clearerr(fout); }else{ __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "Cannot get output file"); } delete renderer;

My code does not work with the ProcessPages method. After writing a log (I have a problem with debugging in ndk), I found that the pdfrender BeginDocument always returns false in the TessBaseAPI::ProcessPages method of baseapi.cpp :

 if (renderer && !renderer->BeginDocument(kUnknownTitle)) { success = false; }

Am I missing something?

P / s: I use tess-two , which prefer baseapi - capi

+6

android ocr tesseract

R4j Feb 12 '14 at 5:48

source share

1 answer

nguyenq · Answer 1 · 2014-02-13T02:28:37+0000

It is as follows:

 TessResultRenderer renderer = api.TessPDFRendererCreate(dataPath); api.TessBaseAPIProcessPages1(handle, image, null, 0, renderer); PointerByReference data = new PointerByReference(); IntByReference dataLength = new IntByReference(); api.TessResultRendererGetOutput(renderer, data, dataLength); byte[] bytes = data.getValue().getByteArray(0, dataLength); // then write bytes array to a file with PDF extension.

If you have problems following the codes, see an example of visualization in this post .

Tesseract v3.03 pdf rendering with text search

More articles: