Tess4j does not use this tessdata folder

I am using tess4j, a Tesseract java wrapper. I also have the usual Tesseract installed. I'm not quite sure how tess4j is designed to work, but since it comes with the tessdata folder, I can assume that you will place language data files there. However, tess4j only works if the language data files are in the "real" tessdata folder (the one that comes with tesseract, not tess4j). If I delete this folder, I will get this error message:

Error opening data file C:\Program Files\Tesseract-OCR\tessdata/jpn.trained data Please make sure the TESSDATA_PREFIX environment variable is set to the par ent directory of your "tessdata" directory. Failed loading language 'jpn' Tesseract couldn't load any languages! # # A fatal error has been detected by the Java Runtime Environment: # # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x631259dc, pid=5108, tid= 10148 # # JRE version: 7.0_06-b24 # Java VM: Java HotSpot(TM) Client VM (23.2-b09 mixed mode, sharing windows -x86 ) # Problematic frame: # C [libtesseract302.dll+0x59dc] STRING::strdup+0x467c # # Failed to write core dump. Minidumps are not enabled by default on client versions of Windows # # An error report file with more information is saved as: # D:\School\Programs\OCRTest\v1.0.0\hs_err_pid5108.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # 

Does this mean that I need to install Tesseract to use tess4j? What for? Or maby, my tess4j tessdata folder is in the wrong place (this is currently related to my .java files, tess4j bans are in the lib folder to which I set the class path).

+7
java tesseract
source share
4 answers

Let your TESSDATA_PREFIX environment variable point to the tessdata folder of your Tess4j.

Usually you set this variable during installation on the system, but you can find the solution here: How to set environment variables from Java?

You must do this on the system that launches your application, because tessdata .dll depends on this environment variable.

+3
source share

TESSDATA_PREFIX environment variable, if defined, will cancel everything, including the value set by init or setDatapath ; but this may change in the near future, when the application can indicate where its tessdata folder is tessdata .

http://code.google.com/p/tesseract-ocr/issues/detail?id=938
https://groups.google.com/forum/#!topic/tesseract-ocr/bkJwI8WmxSw

+2
source share

You may not have the tessdata folder in your main project folder. This folder has all the supported tesseract language (it contains files with the extensions .traineddata , .bigrams , .fold , .lm , .nn , .params , .size and .word-freq ). If you don’t have one, follow these steps:

  • Download the tessdata-master folder from github.com/tesseract-ocr/tessdata (from the ZIP download button)
  • Unzip the contents of the tessdata-master.zip file in the main project folder
  • Rename tessdata-master to tessdata
  • Run the java project and check if it works. At least it works for me.
0
source share

For those who use maven and don't like to use global variables, this works for me:

 File imageFile = new File("C:\\random.png"); Tesseract instance = Tesseract.getInstance(); //In case you don't have your own tessdata, let it also be extracted for you File tessDataFolder = LoadLibs.extractTessResources("tessdata"); //Set the tessdata path instance.setDatapath(tessDataFolder.getAbsolutePath()); try { String result = instance.doOCR(imageFile); System.out.println(result); } catch (TesseractException e) { System.err.println(e.getMessage()); } 

found here , tested with maven → net.sourceforge.tess4j: tess4j: 3.4.1, link 1.4.1 is also used

0
source share

All Articles