Monday, October 14, 2013

Tesseract wrapper

GOCR really doesn't seem to recognize Finnish or different kind of fonts out of the box. Tesseract seems to have no problems in recognizing the content I feed it so I wanted to configure XSANE so that it would use Tesseract instead. Setting up XSANE to use Tesseract is not straightforward though. XSANE expects input and output files to be defined as options when Tesseract accepts the input file as first argument and the basename of the output as second parameter. By default XSANE is configured to use GOCR.

In order to make XSANE work with Tesseract the easy option was just to make a wrapper script that accepts options in the way XSANE can provide them. For this purpose I created a wrapper script for Tesseract.

Comparisons of options:


[sebastian@localhost tesseract-wrapper]$ tesseract --help
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:
  -v --version: version info
  --list-langs: list available languages for tesseract engine


[sebastian@localhost tesseract-wrapper]$ gocr --help
 Optical Character Recognition --- gocr 0.49 20100924
 Copyright (C) 2001-2010 Joerg Schulenburg  GPG=1024D/53BDFBE3
 released under the GNU General Public License
 using: gocr [options] pnm_file_name  # use - for stdin
 options (see gocr manual pages for more details):
 -h, --help
 -i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
 -o name   - output file  (redirection of stdout)
 -e name   - logging file (redirection of stderr)
 -x name   - progress output to fifo (see manual)
 -p name   - database path including final slash (default is ./db/)
 -f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
 -l num    - threshold grey level 0<160<=255 (0 = autodetect)
 -d num    - dust_size (remove small clusters, -1 = autodetect)
 -s num    - spacewidth/dots (0 = autodetect)
 -v num    - verbose (see manual page)
 -c string - list of chars (debugging, see manual)
 -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
 -m num    - operation modes (bitpattern, see manual)
 -a num    - value of certainty (in percent, 0..100, default=95)
 -u string - output this string for every unrecognized character
 gocr -m 4 text1.pbm                   # do layout analyzis
 gocr -m 130 -p ./database/ text1.pbm  # extend database
 djpeg -pnm -gray text.jpg | gocr -    # use jpeg-file via pipe


Tesseract wrapper

[sebastian@localhost tesseract-wrapper]$ ./tesseract-wrapper --help
./tesseract-wrapper -i inputfile [-o outputfile] [-l lang]
Clone the repo in Github

No comments:

Tip me if you like what you're reading