While I don't fully understand the technology involved, I understand and appreciate the implications. Could this information be forwarded to tech support staff at the leading screen reading programs such as AI Squared? Thanks. Bob Martin EchoLink Node - 55127 Please visit http://www.wan-leatonks.net. ----- Original Message ----- From: "Buddy Brannan" <[log in to unmask]> To: <[log in to unmask]> Sent: Friday, September 14, 2007 12:21 PM Subject: Fwd: [Promotion-technology] Fwd: Announcing PDF2OCR > This should be of some interest to someone, especially in light of > recent discussions... > > Begin forwarded message: > >> From: David Andrews <[log in to unmask]> >> Date: September 14, 2007 11:01:30 AM EDT >> To: [log in to unmask], [log in to unmask], >> [log in to unmask], [log in to unmask], [log in to unmask], nabs- >> [log in to unmask], [log in to unmask] >> Subject: [Promotion-technology] Fwd: Announcing PDF2OCR >> Reply-To: "Committee on the Promotion, Evaluation and Advancement >> of Technology" <[log in to unmask]> >> >> >>> >>> Now available at >>> http://www.EmpowermentZone.com/pdf2ocr.zip >>> >>> PDF2OCR 1.0 >>> Released September 14, 2007 >>> Public Domain by Jamal Mazrui >>> >>> Following up on a tip from Ken Perry about the open source >>> Tesseract-OCR >>> project at Google, I have tried to use this OCR engine to build a >>> free >>> program for producing accessible text from an image-based PDF. >>> Such files >>> are created by scanning equipment or software printer drivers that >>> save >>> only the picture of text, without the actual characters >>> themselves. This >>> makes them inaccessible to most PDF viewing utilities, which >>> extract text >>> but do not perform OCR on images. >>> >>> I could not find an existing Windows solution on the web, but did get >>> useful ideas from Linux-oriented ones. What I am calling PDF2OCR >>> combines >>> Tesseract from >>> http://code.google.com/p/tesseract-ocr >>> with the GhostScript interpreter from >>> http://ghostscript.com >>> >>> GhostScript creates a .tif file from the .pdf file of interest, >>> and then >>> Tesseract creates a .txt file from that. The current >>> implementation is a >>> batch file, pdf2ocr.bat, with the following syntax on the command >>> line: >>> pdf2ocr SourceRootName >>> where SourceRootName is the name of a PDF file without the .pdf >>> extension. >>> This produces a text file with the same name except for a .txt >>> extension. >>> The PDF name can include a directory path, but not embedded >>> spaces. For >>> example, >>> pdf2ocr c:\temp\test >>> produces >>> c:\temp\test.txt >>> When complete, the batch file prints tesseract.log to the screen >>> -- a file >>> that is recreated for each conversion. >>> >>> Installation consists of unzipping the pdf2ocr.zip archive to a >>> target >>> directory, e.g., to one called >>> C:\PDF2OCR >>> This directory contains the executable files, as well as three >>> subdirectories with support files. The gsdata subdirectory >>> contains many >>> files I gathered from an installed GhostScript directory tree. The >>> tessdata subdirectory contains language support for Tesseract (I >>> have only >>> distributed English files, but other languages are available from the >>> Google site). The misc subdirectory contains sample files, some >>> source >>> code, and this documentation. >>> >>> A sample image-based PDF is named mlk.pdf -- the letter Martin Luther >>> King, Jr. wrote from the Birmingham Jail. Another sample is >>> debate.pdf -- >>> the legal agreement between the Bush and Kerry campaigns concerning >>> Presidential debates. Two commercial OCR programs tested, >>> Kurzweil 1000 >>> and PDF Magic, converted one of these files well, but not the >>> other at all >>> (a different one for each). Their results, as well as that of >>> PDF2OCR, >>> are provided in text files. Please understand that Tesseract is >>> not the >>> best OCR available, though it is generally considered the best >>> free OCR at >>> present. >>> >>> In order to run the batch file from any directory, you can add the >>> PDF2OCR >>> directory to the path of a console session with a command like the >>> following: >>> set path=c:\pdf2ocr;%path% >>> You can add the path for every console session via the Advanced >>> tab page >>> of the System applet in Control Panel. >>> >>> To easily convert multiple PDFs in a directory, I have also created a >>> utility called dir2ocr.exe. Simply pass the directory name to >>> process as >>> a parameter, e.g., >>> dir2ocr c:\temp >>> If no parameter is passed, the current directory is assumed. >>> Source code >>> for this PowerBASIC program that calls pdf2ocr.bat is in the files >>> dir2ocr.bas and fn.inc, located in the misc subdirectory. >>> >>> The PDF2OCR download is large, about 14 megabytes as a compressed >>> archive. Other techniques of getting text from a PDF should >>> probably be >>> tried first. When other tools do not work or are unavailable, >>> however, I >>> hope this helps to bridge an accessibility gap. Feel free to >>> enhance it >>> in the spirit of open source development! >>> >>> Jamal Mazrui >>> [log in to unmask] >> >> David Andrews and white cane Harry. >> >> >> _______________________________________________ >> Promotion-technology mailing list >> [log in to unmask] >> http://www.nfbnet.org/mailman/listinfo/promotion-technology > > -- > Buddy Brannan, KB5ELV - Erie, PA > Phone: (814) 746-4502 or 888-75-BUDDY > Check out some of the best music you've never heard, and claim your > free trial platinum mmembership: http://www.musicforte.com/trial/ > bbrannan > Check out the new Watkins: natural plant-based home care and our 2007 > holiday gift line: http://www.tastyshop.net > And claim your free mall: Unlimited earning potential just for the > shopping you already do: http://www.powermall.info >