Sleuth OCR software for RISC OS
The version of Sleuth described below and the software supplied on this CD is Sleuth 1. The latest version, Sleuth 3, has many additional features and is considerably more powerful. A 'demo' version of Sleuth 3 is supplied in the SOFTWARE directory complete with the full manual in HTML so you can see what extra facilities it has. The normal price for Sleuth 3 is £49, but as a very special offer to RISC World subscribers we are offering the complete Sleuth 3 package for just £40. This price includes UK post, but please add £1.50 for carriage outside the UK.
Sleuth is © APDL/ProAction 2000. All rights reserved
No part of this product may be reproduced in whole or part by any means without written permission of the publisher. Unauthorised hiring, renting, lending, public performance or broadcasting of this product or its parts is prohibited.
Sleuth is supplied under licence for use on one computer at a time. The user is forbidden to make any copies of the source media where the program is supplied on CD. Please contact APDL for details of site licences.
While every care is taken the publisher cannot be held responsible for any errors in this product, or for the loss of any data or consequential effects from the use of this package.
If you discover a problem with this product please contact APDL at the address below explaining the problem as briefly as possible but including all relevant information, e.g. system configuration, hardware add-ons, version of operating system and the version and Serial Number of your copy. Always remember to quote the version number of the application.
APDL/ProAction edition December 2000
39 Knighton Park Road
London SE26 5RN
Phone -020 8778 2659
Fax -020 8488 0487
Sleuth was written by Graham Jones.
All trademarks are acknowledged.
OCR is an acronym for Optical Character Recognition. This is a process whereby text is extracted from a scanned image and converted into a plain text file. OCR has been available for some time on other machine platforms and accuracy has been steadily improving. Sleuth 2 can be used with the majority of scanners available for RISC OS computers.
Installing the program
Sleuth is supplied in an ArcFS archive in the software directory of the RISC World CD. Drag the entire contents of the archive, including the Sleuth program and the examples to a suitable directory on your hard drive.
Once Sleuth is installed on your hard drive it is ready to use.
To load Sleuth, double-click on the application's icon and it will be installed on the icon bar. If you click Select over the icon bar icon an empty window will be opened. The application now requires a sprite to process.
Loading a sprite
To load a sprite drag the sprite's icon onto the icon bar icon or into the open window. Only one sprite can be loaded at a time. If a further sprite is loaded it will replace the current one. A new sprite cannot be loaded if the package is already converting a sprite. If a sprite file contains more than one sprite only the first will be displayed. Most scanners, especially those using a TWAIN driver (see below), can produce monochrome or greyscale sprites suitable for use with Sleuth. Sleuth can automatically load sprites that have been compressed using !Squash.
Sleuth will accept greyscale sprites and convert them into monochrome sprites. Using the Input and processing preferences dialogue box, the user can determine the levels of grey converted to white in this process. A Pre-sharpening facility is provided that will generally improve the quality of degraded text during conversion.
Scanning using TWAIN
Sleuth supports TWAIN to allow direct scanning into the package. To use this facility you must have a copy of TWAIN, an appropriate scanner and Scanner Driver. TWAIN and scanner drivers can be purchased from APDL or from:
TWAIN supports the Select and Acquire options on the Scan submenu. Choosing Select opens a dialogue box that allows you to choose the scanner driver source that you require. Choosing Acquire lets you set up the scanner before scanning. Sleuth will only accept monochrome images, so make sure the BW option is selected. Full information about these options is provided with TWAIN.
The Save option allows you to save the scanned sprite.
Please note that !TWAIN must be seen before Sleuth can use these options.
Before beginning the OCR process it is important that Sleuth is configured correctly. The Options dialogue box is accessed by pressing Menu over the sprite to be scanned and going past the "Options" option on the menu.
The Reject character is the character used by Sleuth when, in its estimation, there is no equivalent character in the set of characters that it knows, or it is unsure what the correct character is. The Reject character should be set to an infrequently used character.
The Remove end-of-line hyphens option relates to the how the output file is saved. Any hyphens at the ends of lines will be removed if this option is selected.
End-of-line string can be set to any character. A carriage return is represented by \r escape sequence, a line feed by \n. These can be used separately or in combination e.g. \n\r. You could enter a space if you want to reformat the text in another package. These characters will be inserted at the end of each line of text converted by Sleuth.
End-of-paragraph string uses the same escape sequences as the End-of-line option. Sleuth will insert these characters at the end of each paragraph.
As these settings are read just before the converted text is saved it is possible to change them after the image is converted to text.
Clicking on OK will set the options for the current session.
Icon bar menu
Click Menu over the icon on the icon bar and the following menu will be displayed:
Manipulating the sprite
Before starting to convert a sprite you may need to alter it. The sprite must have black text on a white background and the text has to be upright. Sleuth allows the sprite to be zoomed, rotated or inverted. To rotate or invert the sprite choose the Edit option from the main menu and this submenu will be displayed:
Rotate offers a submenu from which you can choose to rotate the sprite by 90, 180 or 270 degrees. Choosing 90 degrees will rotate the sprite in an anti-clockwise direction; 270 degrees will rotate it in a clockwise direction.
Invert will swap the black and white elements of the sprite.
To zoom the sprite choose Zoom from the main menu and the zoom dialogue box will open.
The OCR process
Sleuth cannot deal with complex page layouts automatically. A complex page layout might consist of text in columns and/or include graphics. Sleuth 1 can only OCR a block of text. Both Sleuth 2 and Sleuth 3 can cope with complex page layouts without assistance.
To OCR a simple block of text click Menu over the sprite window and choose OCR. A new window will open. this is the text window and this is where the converted text will appear. Once the OCR process has completed you can save the text by pressing Menu over this new window and clicking on Save.
It may be that you only want to OCR part of the sprite that you have loaded into Sleuth. In this case you can select the part you want with a zone. A zone is a user-defined area from which Sleuth will extract text. To create a zone drag with Select over the sprite window. As soon as you start dragging a zone rectangle will be drawn with eight handles and the pointer will be positioned in the bottom right-hand corner handle as shown below:
As you continue to drag the zone will be resized. The window will scroll automatically if necessary. Release Select to finish drawing the zone. The zone can be resized by dragging with Select over one of the handles and moved by dragging with Select inside the zone.
Sleuth1 only supports 1 zone, Sleuth 2 and Sleuth 3 both support multiple zones.
As text appears in the text window you have the option of clicking in the window with the mouse and manually editing the output to remove any mistakes. You can move the caret around with the cursor keys and remove incorrect letters with the delete key, just like you would in any word processor.
Getting the best results from Sleuth
There are many factors which affect the accuracy of the output from Sleuth. We recommend that the following suggestions are followed where possible to get the best results.
Use at least 300 or 400 dpi resolution scanning. Scan the original document where possible, not a photocopy. Scan a portion first and convert it to ensure that the scan is not too light or too dark. If you are using a hand-held scanner try to make the scan as straight as possible.
Here are some suggestions to help you get the best results:
Important points to remember
These are possible problems that may occur when using Sleuth with explanations and possible solutions.
I'm getting output but it is all punctuation or apparently random characters.
The output is good but the occasional word comes out badly.
The output is good at the beginning of the page, but gradually deteriorates.
The output is good, but the order of the text is wrong.
Although Sleuth has only been trained on the standard fonts it will recognise other similar fonts without further training. It will recognise the following characters :
Sleuth will convert characters between 9 and 24 point in size. This equates to approximately 1/8th to 1/3rd of an inch depending on the font. The actual limits are between 10 and 80 pixels measured from the top of an upper-case letter, like an 'A', to the bottom of a lower-case letter with a descender, like a 'g'. To convert 24 point text use a 200 dpi scan.
Sleuth will convert text at speeds between 80 to 250 words per minute depending on the machine used, the size and quality of the text and the resolution of the sprite.
Sleuth will automatically cope with slightly skewed scans (generally less than 2 degrees of skew, depending on line spacing) and lines of text that are slightly wavy.