Need a good OCR to print the source code, any ideas?

In my work, sometimes I have to type the source code and manually enter the source code into a text editor. Do not ask why.

Obviously, it takes a lot of time to enter it and additional time is always required to debug input errors (oops missed the "$" sign there).

I decided to try some OCR solutions, for example:

  • Microsoft Document Imaging - Built in OCR
    • Result: skipped all leading spaces, skipped all underscores, incorrectly interpreted many punctuation characters.
    • Conclusion: Slower than manually entering the code.
  • Various OCR Online Applications
    • Result: similar or worse than Microsoft Document Imaging
    • Conclusion: Slower than manually entering the code.

It seems to me that the source code will be very easy for OCR if the font is sans-serif and monospaced.

Do you have a good text recognition solution that works well with source code?

Maybe I just need a better OCR solution (not necessarily for source code)?

+6
ocr
source share
6 answers

There are currently three options for using OCR:

  • Abbot FineReader and OminPage . Both are commercial products that are roughly equal when it comes to OCR features and results. I can't say much about OmniPage, but FineReader does support reading source code (for example, it has a Java language library).
  • The best OCR tesseract OSS engine . It is much more difficult to use, you probably need to learn it for your language.

I rarely do OCR, but found that spending $ 150 on commercial software wastes time.

+5
source share

Today, there are two new options (years after requesting a question):

one.)

Windows 10 ships with Microsoft's OCR engine.

It is located in the namespace:

Windows.Media.Ocr.OcrEngine 

https://msdn.microsoft.com/en-us/library/windows/apps/windows.media.ocr

Github also has an example:

https://github.com/Microsoft/Windows-universal-samples/tree/master/Samples/OCR

You need VS2015 to compile this material. Or, if you want to use an older version of Visual Studio, you must call it through traditional COM, then read this article in Codeproject: http://www.codeproject.com/Articles/262151/Visual-Cplusplus-and-WinRT-Metro -Some-fundamentals

OCR quality is very good. However, if the text is too small, you should enhance the image earlier. You can download every language that exists in the world through Windows Update - even for handwriting!


2.)

Another option is to use the OCR library from Office. This is a COM library. It is available in Office 2003, 2007, and Vista, but removed in Office 2010.

http://www.codeproject.com/Articles/10130/OCR-with-Microsoft-Office

The downside is that every Office installation comes with multi-language support. For example, the Spanish office installs support for Spanish, English, Portuguese, and French. But I noticed that it hardly matters if you use Spanish or English as the OCR language to detect Spanish text.

If you convert the image to shades of gray, you will get better results. The recognition is fine, but that did not satisfy me. This makes about as many errors as Tesseract, although Tesseract needs much more image processing to get these results.

+3
source share

Printed text instead of handwritten text is usually easier for OCR, but it all depends on the original image. I usually find that PNG capture with smaller colors (shades of gray is better) with some manual cleaning (remove any image noise due to scanning, etc.) works best.

Most OCRs are similar in performance and accuracy. An OCR with the ability to train / correct would be better.

+1
source share

In general, I found that FineReader gives very good results. Usually all products have an available trial version. Try as much as possible.

Now the source code of the program can be complicated:

  • leading spaces: perhaps a zip code a beautiful printing process may help
  • underlining and punctuation: perhaps a good product can be trained for this
+1
source share

OCRopus is also a good open source option. But, like Tesseract, there is a pretty steep learning curve for efficient use and integration.

+1
source share

Try http://www.free-ocr.com/ . I used it to restore source code from screen capture when my IDE crashed in an editor session without warning. Obviously, this depends on the font you use in the editor (I use Courier New 10pt in Delphi). I tried using Google Docs that will recognize the image when it is downloaded - while Google Docs scans documents pretty well, for some reason it fails in the Pascal source.

FreeOCR working example: Input image:

image uploaded

gave the following:

 begin FileIDToDelete := FolderToClean + 5earchRecord.Name ; Inc (TotalFilesFound) ; if (DeleteFile (PChar (FileIDToDelete))) then begin Log5tartupError (FormatEx ('%s file %s deleted', [Annotation, Fi eIDToDelete])) ; Inc (TotalFilesDeleted) ; end else begin Log5tartupError (FormatEx ('Error deleting %s file %s', [Annotat'on, FileIDToDelete])) ; Inc (TotalFilesDeleteErrors) ; end ; end ; FindResult := 5ysUtils.FindNext (5earchRecord) ; end ; 

therefore replacing the indentation is the main part of the work, and then changing only 5 to upper case S He was also tangled in a vertical line at around 80 columns. Fortunately, most errors will be picked up by the compiler (with the exception of errors inside quoted strings).

It's a shame FreeOCR does not have a source code option, where a space is considered significant.

Tip. If your source includes syntax highlighting, make sure you save the image in grayscale before downloading.

+1
source share

All Articles