Axel and others,
One of the troubles with OCR is that it doesn't mark up the
logical structure of a document with headings, emphasis (in bold,
underline, italics), tables and lists. Ideally the e-reserve should
have the structure marked up in a standard and accessible way.
HTML is suitable for this. Therefore, I argue that "cleaning up"
should involve put in these markings, using HTML tags. That further
increases the burden on librarians, and lends support to the argument
that getting the original document source in electronic form is
preferable.
However if the original is in PDF, there is still a problem of
accessibility. Librarians should do PDF to HTML conversion for
such documents, I would argue.
BTW, the ultimate goal would be everything in XML, so that semantics
is preserved as well as structure. (See Ron Stewart's email.)
Cheers,
John
--
In message <[log in to unmask]>
Axel via [log in to unmask] writes:
>Jeff and others,
>
>Thanks for you thoughts, Jeff. The problem with cleaning up OCRed
>documents is the additional time factor. Folks who do it told me that it
>takes about 10 times as long to get a document onto e-reserve if they
>not only scan and ORC but also clean it up. This puts an enormous strain
>on the libraries' resources.
>
>Here's my thinking at this point: If OCR technology by itself, without
>additional editing and proofreading, does not provide an acceptable
>product (this would be your position), and since additional cleaning and
>proofreading of all materials placed on e-reserve puts an enormous
>strain on a libraries' resources (this is my sentiment), we should look
>for a different solution: We need to think about ways of establishing
>e-reserves within the context of a larger system that allows us to get
>articles in their original text-based electronic format and to place
>them on reserve (without any optical scanning involved) in either their
>original format or some converted text-based format. This, of course,
>touches on legal issues, involving among others, copyright law and
>interpretations thereof.
>
>Greetings,
>
>Axel
>
>-----Original Message-----
>From: Senge, Jeff [mailto:[log in to unmask]]
>Sent: Tuesday, April 16, 2002 10:32 AM
>To: [log in to unmask]
>Subject: Re: electronic reserve and image-based pdf files
>
>Axel,
>
>My personal opinion is that e-reserve materials are a tremendous step
>forward in terms of accessibility but they need to be in accessible
>formats. This would mean scanning, running OCR to convert them to text,
>and then editing and proofreading them for format and accuracy. This
>process should produce very clean and useable accessible e-reserve
>documents.
[snip]
--
Access the word, access the world! -- Try our WordAloud software!!
John Nissen, Cloudworld Ltd., Chiswick, London
Tel: +44 (0) 845 458 3944 (local rate in the UK)
Fax: +44 (0) 20 8742 8715
Web: http://www.cloudworld.co.uk
|