LISTSERV - BLIND-HAMS Archives

David W Wood <[log in to unmask]> · Sat, 15 Sep 2007 06:07:46 +0100

Buddy:

Very interesting, and ingenious! 

-----Original Message-----
From: For blind ham radio operators
[mailto:[log in to unmask]] On Behalf Of Buddy
Brannan
Sent: Friday, September 14, 2007 5:21 PM
To: [log in to unmask]
Subject: Fwd: [Promotion-technology] Fwd: Announcing PDF2OCR

This should be of some interest to someone, especially in
light of recent discussions...

Begin forwarded message:

> From: David Andrews <[log in to unmask]>
> Date: September 14, 2007 11:01:30 AM EDT
> To: [log in to unmask], [log in to unmask], 
> [log in to unmask], [log in to unmask],
[log in to unmask], nabs- 
> [log in to unmask], [log in to unmask]
> Subject: [Promotion-technology] Fwd: Announcing PDF2OCR
> Reply-To: "Committee on the Promotion, Evaluation and
Advancement of 
> Technology" <[log in to unmask]>
>
>
>>
>> Now available at
>> http://www.EmpowermentZone.com/pdf2ocr.zip
>>
>> PDF2OCR 1.0
>> Released September 14, 2007
>> Public Domain by Jamal Mazrui
>>
>> Following up on a tip from Ken Perry about the open
source  
>> Tesseract-OCR
>> project at Google, I have tried to use this OCR engine to
build a  
>> free
>> program for producing accessible text from an image-based
PDF.   
>> Such files
>> are created by scanning equipment or software printer
drivers that  
>> save
>> only the picture of text, without the actual characters  
>> themselves.  This
>> makes them inaccessible to most PDF viewing utilities,
which  
>> extract text
>> but do not perform OCR on images.
>>
>> I could not find an existing Windows solution on the web,
but did get
>> useful ideas from Linux-oriented ones.  What I am calling
PDF2OCR  
>> combines
>> Tesseract from
>> http://code.google.com/p/tesseract-ocr
>> with the GhostScript interpreter from
>> http://ghostscript.com
>>
>> GhostScript creates a .tif file from the .pdf file of
interest,  
>> and then
>> Tesseract creates a .txt file from that.  The current  
>> implementation is a
>> batch file, pdf2ocr.bat, with the following syntax on the
command  
>> line:
>> pdf2ocr SourceRootName
>> where SourceRootName is the name of a PDF file without
the .pdf  
>> extension.
>> This produces a text file with the same name except for a
.txt  
>> extension.
>> The PDF name can include a directory path, but not
embedded  
>> spaces.  For
>> example,
>> pdf2ocr c:\temp\test
>> produces
>> c:\temp\test.txt
>> When complete, the batch file prints tesseract.log to the
screen  
>> -- a file
>> that is recreated for each conversion.
>>
>> Installation consists of unzipping the pdf2ocr.zip
archive to a  
>> target
>> directory, e.g., to one called
>> C:\PDF2OCR
>> This directory contains the executable files, as well as
three
>> subdirectories with support files.  The gsdata
subdirectory  
>> contains many
>> files I gathered from an installed GhostScript directory
tree.  The
>> tessdata subdirectory contains language support for
Tesseract (I  
>> have only
>> distributed English files, but other languages are
available from the
>> Google site).  The misc subdirectory contains sample
files, some  
>> source
>> code, and this documentation.
>>
>> A sample image-based PDF is named mlk.pdf -- the letter
Martin Luther
>> King, Jr. wrote from the Birmingham Jail.  Another sample
is  
>> debate.pdf --
>> the legal agreement between the Bush and Kerry campaigns
concerning
>> Presidential debates.  Two commercial OCR programs
tested,  
>> Kurzweil 1000
>> and PDF Magic, converted one of these files well, but not
the  
>> other at all
>> (a different one for each).  Their results, as well as
that of  
>> PDF2OCR,
>> are provided in text files.  Please understand that
Tesseract is  
>> not the
>> best OCR available, though it is generally considered the
best  
>> free OCR at
>> present.
>>
>> In order to run the batch file from any directory, you
can add the  
>> PDF2OCR
>> directory to the path of a console session with a command
like the
>> following:
>> set path=c:\pdf2ocr;%path%
>> You can add the path for every console session via the
Advanced  
>> tab page
>> of the System applet in Control Panel.
>>
>> To easily convert multiple PDFs in a directory, I have
also created a
>> utility called dir2ocr.exe.  Simply pass the directory
name to  
>> process as
>> a parameter, e.g.,
>> dir2ocr c:\temp
>> If no parameter is passed, the current directory is
assumed.   
>> Source code
>> for this PowerBASIC program that calls pdf2ocr.bat is in
the files
>> dir2ocr.bas and fn.inc, located in the misc subdirectory.
>>
>> The PDF2OCR  download is large, about 14 megabytes as a
compressed
>> archive.  Other techniques of getting text from a PDF
should  
>> probably be
>> tried first.  When other tools do not work or are
unavailable,  
>> however, I
>> hope this helps to bridge an accessibility gap.  Feel
free to  
>> enhance it
>> in the spirit of open source development!
>>
>> Jamal Mazrui
>> [log in to unmask]
>
> David Andrews and white cane Harry.
>
>
> _______________________________________________
> Promotion-technology mailing list
> [log in to unmask]
>
http://www.nfbnet.org/mailman/listinfo/promotion-technology

--
Buddy Brannan, KB5ELV - Erie, PA
Phone: (814) 746-4502 or 888-75-BUDDY
Check out some of the best music you've never heard, and
claim your  
free trial platinum mmembership:
http://www.musicforte.com/trial/ 
bbrannan
Check out the new Watkins: natural plant-based home care and
our 2007  
holiday gift line: http://www.tastyshop.net
And claim your free mall: Unlimited earning potential just
for the  
shopping you already do: http://www.powermall.info

BLIND-HAMS Archives

For blind ham radio operators

BLIND-HAMS@LISTSERV.ICORS.ORG