OCR help needed - please?

oggbashan

Dying Truth seeker
Joined
Jul 3, 2002
Posts
56,017
I am researching my wife's ancestors and have downloaded death rolls for her maiden name for the period 1891 to 1901 - about 50 sheets.

They are in .DJV files to be viewed with a DjVU viewer which I have also downloaded so I can read them on screen.

Printing has been a problem. I can get them to fit on a page of A4 but then I need a magnifying glass. I tried exporting the .DJV file as .bmp and then printing. My printer decided that 32 pages per file was about right!

What I really want to do is to convert them to text with OCR and editing them in Word97. My first attempts have converted the text to white on black and the OCR program wouldn't recognise them.

In desperation I am considering printing all 50 files as A4 then scanning them before using OCR.

Is there an easier way?

Please?

Og
 
No ideas Og - just didn't want you to think you were being ignored. OCR and scanning seem to have entirely different execution logic to everything else in computing.
 
neonlyte said:
No ideas Og - just didn't want you to think you were being ignored. OCR and scanning seem to have entirely different execution logic to everything else in computing.

Thank you.

I have noticed.

Logic appears to be absent.

Og
 
Working on the kiss principle... have you tried a simple copy and paste into word or notepad....
 
TxRad said:
Working on the kiss principle... have you tried a simple copy and paste into word or notepad....

Yes. All I get is a page of obscure symbols. There isn't even a code to them. 90% of the symbols are the same one.

Thank you for the suggestion.

Og
 
Ok then.... Try taking a screen shot and then try the old copy and paste....

The viewer they are using reads from code not text.....
 
Colleen Thomas said:
Following tx's advice, can you export it as a generic .txt file?

No. If I could, that would solve the problem. It will only export as a .bmp file with degraded definition.

Og
 
oggbashan said:
No. If I could, that would solve the problem. It will only export as a .bmp file with degraded definition.

Og


I'm so untechnical, but if it exports as a bmp, that's a picture type file isn't it? If so, maybe on eo fhte shareware viewers, like Ifranview, could let you just print it if you opened it with them?
 
Colleen Thomas said:
I'm so untechnical, but if it exports as a bmp, that's a picture type file isn't it? If so, maybe on eo fhte shareware viewers, like Ifranview, could let you just print it if you opened it with them?

Thanks.

I can print the files with DjVu but what I want to do is extract some of the information so that I can use a WP on it. OCR should be able to do that but every attempt so far has failed.

I have spent 5 hours on it so far.

I thought I'd attach the file I'd converted to .bmp format to a post in this thread. I tried. The system just sat stationary for 12 minutes and didn't attach the file.

Unless anyone has any bright ideas, I'm back with the print and rescan option using OCR on the scan. For 50 sheets that seems a mad way to do it.

Og
 
oggbashan said:
Thanks.

I can print the files with DjVu but what I want to do is extract some of the information so that I can use a WP on it. OCR should be able to do that but every attempt so far has failed.

I have spent 5 hours on it so far.

I thought I'd attach the file I'd converted to .bmp format to a post in this thread. I tried. The system just sat stationary for 12 minutes and didn't attach the file.

Unless anyone has any bright ideas, I'm back with the print and rescan option using OCR on the scan. For 50 sheets that seems a mad way to do it.

Og


I'm sorry Oggs, you've always been such a great help to me, but I'm out of my depth.

Hopefully someone will log in who knows computers. I'd suggest raphy, but he isn't around anymore :(
 
Colleen Thomas said:
I'm sorry Oggs, you've always been such a great help to me, but I'm out of my depth.

So am I.

I appreciate the attempt.

If I get to waving an axe at the screen I'll take a copy of the files to my friendly neighbourhood computer nerd and let him play with them.

He's only half an hour away by car but I'm not going that way for a week.

Og
 
Og have you tried printing to a file? You can setup a printer which is a text file on your hard drive.
 
oggbashan said:
I tried exporting the .DJV file as .bmp and then printing. My printer decided that 32 pages per file was about right!

What I really want to do is to convert them to text with OCR and editing them in Word97. My first attempts have converted the text to white on black and the OCR program wouldn't recognise them.

Step one is to reverse the image to black text on white background -- Irfanview can do that once you export to BMP format.

Second, you need to massage the BMP files down to something that your OCR program can handle. The OCR Program I use is OmniPage Limited and I've found that it does a much better job if I highllight sections of the text to scan and eliminate as much whitespace as possible.

Since you're only interested in certain sections, you should be able to cut those sections you're interested in from the BMP version and paste them to a new image to run through the OCR program.

You can also try resizing or enhancing the sharpness and contrast of the image to give the OCR program more to work with. OmniPage works best with something scanned at 300 dpi or higher. That converts to about 2600 pixels for the full width of an A4 page or about 1900 pixels for a standard 6.5" line of text. If you mean that your printer is taking 32 pages to print a BMP of each page, you should have more than sufficient resolution for OCR processing, but you need to LOOK at the BMP file with an image editor and/or reuce it down to manageable dimensions. 4 sheets by 8 sheets to print one image generlly means you have a HUGE BMP file in terms of pixel dimensions -- or your printer defaults to only 72 DPI.


You should be able to cut a portion of the BMP file and paste it to a new image and attach it here if you save it in JPeG format.
 
Thanks for the advice.

I'm at a different location and the files are on my other hard drive.

I have found that my real problem is that the DjVU files are relatively small and difficult to manipulate given the viewer supplied. I can see them but not capture them.

When exported to .bmp format the file is large but has lost so much definition that they are unreadable.

I'll try again tomorrow morning.

Og
 
Since it is a .bmp , why not try Paint?
You can adjust the size through Attributes. :)
 
kendo1 said:
Since it is a .bmp , why not try Paint?
You can adjust the size through Attributes. :)

Thanks.

I have tried Paint and other programs on the .bmp files.

Too much detail was lost on the conversion from DJV to .bmp.

Og
 
Og, dvju files is a pain in the rear end. They have superb compression rates but are not playing well with any standard ways to handle pictures.

I don't have any .DVJ files to try it on, but here is a site that claims to be able to do converting to and from .DVJ, including OCR. Might be worth a try.

http://any2djvu.djvuzone.org/

best o luck
 
Back
Top