In order to make XSANE work with Tesseract the easy option was just to make a wrapper script that accepts options in the way XSANE can provide them. For this purpose I created a wrapper script for Tesseract.
Comparisons of options:
Tesseract
[sebastian@localhost tesseract-wrapper]$ tesseract --help
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine
GOCR
[sebastian@localhost tesseract-wrapper]$ gocr --help
Optical Character Recognition --- gocr 0.49 20100924
Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
released under the GNU General Public License
using: gocr [options] pnm_file_name # use - for stdin
options (see gocr manual pages for more details):
-h, --help
-i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
-o name - output file (redirection of stdout)
-e name - logging file (redirection of stderr)
-x name - progress output to fifo (see manual)
-p name - database path including final slash (default is ./db/)
-f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num - threshold grey level 0<160<=255 (0 = autodetect)
-d num - dust_size (remove small clusters, -1 = autodetect)
-s num - spacewidth/dots (0 = autodetect)
-v num - verbose (see manual page)
-c string - list of chars (debugging, see manual)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num - operation modes (bitpattern, see manual)
-a num - value of certainty (in percent, 0..100, default=95)
-u string - output this string for every unrecognized character
examples:
gocr -m 4 text1.pbm # do layout analyzis
gocr -m 130 -p ./database/ text1.pbm # extend database
djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe
webpage: http://jocr.sourceforge.net/
Tesseract wrapper
[sebastian@localhost tesseract-wrapper]$ ./tesseract-wrapper --help
Syntax:
./tesseract-wrapper -i inputfile [-o outputfile] [-l lang]
Clone the repo in Github
No comments:
Post a Comment