LISTSERV - BLIND-HAMS Archives

Buddy Brannan <[log in to unmask]> · Fri, 14 Sep 2007 12:21:26 -0400

This should be of some interest to someone, especially in light of  
recent discussions...

Begin forwarded message:

> From: David Andrews <[log in to unmask]>
> Date: September 14, 2007 11:01:30 AM EDT
> To: [log in to unmask], [log in to unmask],  
> [log in to unmask], [log in to unmask], [log in to unmask], nabs- 
> [log in to unmask], [log in to unmask]
> Subject: [Promotion-technology] Fwd: Announcing PDF2OCR
> Reply-To: "Committee on the Promotion, Evaluation and Advancement  
> of Technology" <[log in to unmask]>
>
>
>>
>> Now available at
>> http://www.EmpowermentZone.com/pdf2ocr.zip
>>
>> PDF2OCR 1.0
>> Released September 14, 2007
>> Public Domain by Jamal Mazrui
>>
>> Following up on a tip from Ken Perry about the open source  
>> Tesseract-OCR
>> project at Google, I have tried to use this OCR engine to build a  
>> free
>> program for producing accessible text from an image-based PDF.   
>> Such files
>> are created by scanning equipment or software printer drivers that  
>> save
>> only the picture of text, without the actual characters  
>> themselves.  This
>> makes them inaccessible to most PDF viewing utilities, which  
>> extract text
>> but do not perform OCR on images.
>>
>> I could not find an existing Windows solution on the web, but did get
>> useful ideas from Linux-oriented ones.  What I am calling PDF2OCR  
>> combines
>> Tesseract from
>> http://code.google.com/p/tesseract-ocr
>> with the GhostScript interpreter from
>> http://ghostscript.com
>>
>> GhostScript creates a .tif file from the .pdf file of interest,  
>> and then
>> Tesseract creates a .txt file from that.  The current  
>> implementation is a
>> batch file, pdf2ocr.bat, with the following syntax on the command  
>> line:
>> pdf2ocr SourceRootName
>> where SourceRootName is the name of a PDF file without the .pdf  
>> extension.
>> This produces a text file with the same name except for a .txt  
>> extension.
>> The PDF name can include a directory path, but not embedded  
>> spaces.  For
>> example,
>> pdf2ocr c:\temp\test
>> produces
>> c:\temp\test.txt
>> When complete, the batch file prints tesseract.log to the screen  
>> -- a file
>> that is recreated for each conversion.
>>
>> Installation consists of unzipping the pdf2ocr.zip archive to a  
>> target
>> directory, e.g., to one called
>> C:\PDF2OCR
>> This directory contains the executable files, as well as three
>> subdirectories with support files.  The gsdata subdirectory  
>> contains many
>> files I gathered from an installed GhostScript directory tree.  The
>> tessdata subdirectory contains language support for Tesseract (I  
>> have only
>> distributed English files, but other languages are available from the
>> Google site).  The misc subdirectory contains sample files, some  
>> source
>> code, and this documentation.
>>
>> A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
>> King, Jr. wrote from the Birmingham Jail.  Another sample is  
>> debate.pdf --
>> the legal agreement between the Bush and Kerry campaigns concerning
>> Presidential debates.  Two commercial OCR programs tested,  
>> Kurzweil 1000
>> and PDF Magic, converted one of these files well, but not the  
>> other at all
>> (a different one for each).  Their results, as well as that of  
>> PDF2OCR,
>> are provided in text files.  Please understand that Tesseract is  
>> not the
>> best OCR available, though it is generally considered the best  
>> free OCR at
>> present.
>>
>> In order to run the batch file from any directory, you can add the  
>> PDF2OCR
>> directory to the path of a console session with a command like the
>> following:
>> set path=c:\pdf2ocr;%path%
>> You can add the path for every console session via the Advanced  
>> tab page
>> of the System applet in Control Panel.
>>
>> To easily convert multiple PDFs in a directory, I have also created a
>> utility called dir2ocr.exe.  Simply pass the directory name to  
>> process as
>> a parameter, e.g.,
>> dir2ocr c:\temp
>> If no parameter is passed, the current directory is assumed.   
>> Source code
>> for this PowerBASIC program that calls pdf2ocr.bat is in the files
>> dir2ocr.bas and fn.inc, located in the misc subdirectory.
>>
>> The PDF2OCR  download is large, about 14 megabytes as a compressed
>> archive.  Other techniques of getting text from a PDF should  
>> probably be
>> tried first.  When other tools do not work or are unavailable,  
>> however, I
>> hope this helps to bridge an accessibility gap.  Feel free to  
>> enhance it
>> in the spirit of open source development!
>>
>> Jamal Mazrui
>> [log in to unmask]
>
> David Andrews and white cane Harry.
>
>
> _______________________________________________
> Promotion-technology mailing list
> [log in to unmask]
> http://www.nfbnet.org/mailman/listinfo/promotion-technology

--
Buddy Brannan, KB5ELV - Erie, PA
Phone: (814) 746-4502 or 888-75-BUDDY
Check out some of the best music you've never heard, and claim your  
free trial platinum mmembership: http://www.musicforte.com/trial/ 
bbrannan
Check out the new Watkins: natural plant-based home care and our 2007  
holiday gift line: http://www.tastyshop.net
And claim your free mall: Unlimited earning potential just for the  
shopping you already do: http://www.powermall.info	

BLIND-HAMS Archives

For blind ham radio operators

BLIND-HAMS@LISTSERV.ICORS.ORG