LISTSERV - BLIND-HAMS Archives - LISTSERV.ICORS.ORG

While I don't fully understand the technology involved, I understand and 
appreciate the implications.  Could this information be forwarded to tech 
support staff at the leading screen reading programs such as AI Squared?

Thanks.
Bob Martin

EchoLink Node - 55127
Please visit http://www.wan-leatonks.net.
----- Original Message ----- 
From: "Buddy Brannan" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Friday, September 14, 2007 12:21 PM
Subject: Fwd: [Promotion-technology] Fwd: Announcing PDF2OCR


> This should be of some interest to someone, especially in light of
> recent discussions...
>
> Begin forwarded message:
>
>> From: David Andrews <[log in to unmask]>
>> Date: September 14, 2007 11:01:30 AM EDT
>> To: [log in to unmask], [log in to unmask],
>> [log in to unmask], [log in to unmask], [log in to unmask], nabs-
>> [log in to unmask], [log in to unmask]
>> Subject: [Promotion-technology] Fwd: Announcing PDF2OCR
>> Reply-To: "Committee on the Promotion, Evaluation and Advancement
>> of Technology" <[log in to unmask]>
>>
>>
>>>
>>> Now available at
>>> http://www.EmpowermentZone.com/pdf2ocr.zip
>>>
>>> PDF2OCR 1.0
>>> Released September 14, 2007
>>> Public Domain by Jamal Mazrui
>>>
>>> Following up on a tip from Ken Perry about the open source
>>> Tesseract-OCR
>>> project at Google, I have tried to use this OCR engine to build a
>>> free
>>> program for producing accessible text from an image-based PDF.
>>> Such files
>>> are created by scanning equipment or software printer drivers that
>>> save
>>> only the picture of text, without the actual characters
>>> themselves.  This
>>> makes them inaccessible to most PDF viewing utilities, which
>>> extract text
>>> but do not perform OCR on images.
>>>
>>> I could not find an existing Windows solution on the web, but did get
>>> useful ideas from Linux-oriented ones.  What I am calling PDF2OCR
>>> combines
>>> Tesseract from
>>> http://code.google.com/p/tesseract-ocr
>>> with the GhostScript interpreter from
>>> http://ghostscript.com
>>>
>>> GhostScript creates a .tif file from the .pdf file of interest,
>>> and then
>>> Tesseract creates a .txt file from that.  The current
>>> implementation is a
>>> batch file, pdf2ocr.bat, with the following syntax on the command
>>> line:
>>> pdf2ocr SourceRootName
>>> where SourceRootName is the name of a PDF file without the .pdf
>>> extension.
>>> This produces a text file with the same name except for a .txt
>>> extension.
>>> The PDF name can include a directory path, but not embedded
>>> spaces.  For
>>> example,
>>> pdf2ocr c:\temp\test
>>> produces
>>> c:\temp\test.txt
>>> When complete, the batch file prints tesseract.log to the screen
>>> -- a file
>>> that is recreated for each conversion.
>>>
>>> Installation consists of unzipping the pdf2ocr.zip archive to a
>>> target
>>> directory, e.g., to one called
>>> C:\PDF2OCR
>>> This directory contains the executable files, as well as three
>>> subdirectories with support files.  The gsdata subdirectory
>>> contains many
>>> files I gathered from an installed GhostScript directory tree.  The
>>> tessdata subdirectory contains language support for Tesseract (I
>>> have only
>>> distributed English files, but other languages are available from the
>>> Google site).  The misc subdirectory contains sample files, some
>>> source
>>> code, and this documentation.
>>>
>>> A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
>>> King, Jr. wrote from the Birmingham Jail.  Another sample is
>>> debate.pdf --
>>> the legal agreement between the Bush and Kerry campaigns concerning
>>> Presidential debates.  Two commercial OCR programs tested,
>>> Kurzweil 1000
>>> and PDF Magic, converted one of these files well, but not the
>>> other at all
>>> (a different one for each).  Their results, as well as that of
>>> PDF2OCR,
>>> are provided in text files.  Please understand that Tesseract is
>>> not the
>>> best OCR available, though it is generally considered the best
>>> free OCR at
>>> present.
>>>
>>> In order to run the batch file from any directory, you can add the
>>> PDF2OCR
>>> directory to the path of a console session with a command like the
>>> following:
>>> set path=c:\pdf2ocr;%path%
>>> You can add the path for every console session via the Advanced
>>> tab page
>>> of the System applet in Control Panel.
>>>
>>> To easily convert multiple PDFs in a directory, I have also created a
>>> utility called dir2ocr.exe.  Simply pass the directory name to
>>> process as
>>> a parameter, e.g.,
>>> dir2ocr c:\temp
>>> If no parameter is passed, the current directory is assumed.
>>> Source code
>>> for this PowerBASIC program that calls pdf2ocr.bat is in the files
>>> dir2ocr.bas and fn.inc, located in the misc subdirectory.
>>>
>>> The PDF2OCR  download is large, about 14 megabytes as a compressed
>>> archive.  Other techniques of getting text from a PDF should
>>> probably be
>>> tried first.  When other tools do not work or are unavailable,
>>> however, I
>>> hope this helps to bridge an accessibility gap.  Feel free to
>>> enhance it
>>> in the spirit of open source development!
>>>
>>> Jamal Mazrui
>>> [log in to unmask]
>>
>> David Andrews and white cane Harry.
>>
>>
>> _______________________________________________
>> Promotion-technology mailing list
>> [log in to unmask]
>> http://www.nfbnet.org/mailman/listinfo/promotion-technology
>
> --
> Buddy Brannan, KB5ELV - Erie, PA
> Phone: (814) 746-4502 or 888-75-BUDDY
> Check out some of the best music you've never heard, and claim your
> free trial platinum mmembership: http://www.musicforte.com/trial/
> bbrannan
> Check out the new Watkins: natural plant-based home care and our 2007
> holiday gift line: http://www.tastyshop.net
> And claim your free mall: Unlimited earning potential just for the
> shopping you already do: http://www.powermall.info
>