OSI-tems Vol I

Post Reply
bxdanny
Posts: 336
Joined: Thu Apr 16, 2015 2:27 pm
Location: Bronx, NY USA

OSI-tems Vol I

Post by bxdanny »

OSI-tems was the (mostly) monthly newsletter of the group that became known as Ohio Scientific Users of New York, or OSUNY. The first issue was November 1979, and it continued until at least June 1983. I have most (though not quite all) of the issues, and will be scanning them and posting them here over the next few months.

The first two issues were just five pages and four pages, respectively, but after that they got much larger. Those first two issues are attached below. The fourth page of issue #2 was 14" long, and I can only scan 11" pages, so that page was scanned in two overlapping parts, making the PDF of the issue five pages long.

Note: My scanning software, Acrobat X Standard, always tries to do OCR on any scanned text, but it often doesn't do a very good job. The "recognized" text embedded in the file is simply not usable for program listings.
Attachments
OSI-tems Vol I.zip
(347.18 KiB) Downloaded 417 times
No current OSI hardware
Former programmer for Dwo Quong Fok Lok Sow and Orion Software Associates
Former owner of C1P MF (original version) and C2-8P DF (502-based)
pbirkel
Posts: 34
Joined: Mon Feb 27, 2017 8:06 am

Re: OSI-tems Vol I

Post by pbirkel »

Given the quality of the initial scans, AFAICS your OCR is pretty good. The fundamental problem is that the scan includes a lot of fly-speck noise that is pretty high-contrast and thus is picked up as, typically, periods or commas or occasionally dashes. Playing with preprocessing to increase contrast and/or threshold is unlikely to help at the OCR stage.

It looks as if you scanned the originals as B&W. Suggest that you scan as at least grey-scale, or perhaps color (the better choice). That will give the OCR algorithm more useful information and an opportunity for *it* to eliminate noise -- rather than leaving it up to your scanner.

Thanks for taking on this task. Strongly suggest that you fiddle a bit before proceeding further.

In the end, though, collective experience in the community is that *all* OCR of listings ends up requiring thorough hand-checking. So don't set your expectations to "perfect" :->.

Sometimes you can just submit the result to a syntax checker and let it spit out the obvious errors for investigation, but there will still be non-obvious ones that are syntactically fine but semantic trash :-{.
bxdanny
Posts: 336
Joined: Thu Apr 16, 2015 2:27 pm
Location: Bronx, NY USA

Re: OSI-tems Vol I

Post by bxdanny »

I had been thinking that a black-and-white scan would be cleaner, since the pages were printed as black on white, and any changes in color or shades of grey would be extraneous info that wasn't intended to be there. And of course black-and-white scans produce the smallest files. I considered turning OCR off all together (there's a setting somewhere to do that, but I'd have to find it first). But having it on causes the scans of the bottom half of overlength pages to automatically display in the correct orientation, rather than starting out upside-down. I did consider rescanning the first part of that last page to eliminate the blooming on the lines in the 500 range, but decided not to bother because the same section also appears in the scan of the bottom part.

Actually, i almost didn't mention the existence of the OCR at all. Of course even high-quality OCR would need to be checked visually against the image, but in some cases that might be worth doing. My point was that, for the listing on that last page at least, it wouldn't be. I guess it depends on the font, and the font size. I don't plan to spend much time on the issue.
No current OSI hardware
Former programmer for Dwo Quong Fok Lok Sow and Orion Software Associates
Former owner of C1P MF (original version) and C2-8P DF (502-based)
pbirkel
Posts: 34
Joined: Mon Feb 27, 2017 8:06 am

Re: OSI-tems Vol I

Post by pbirkel »

It's your time, but WRT B&W you're forgetting that your scanning process is arbitrarily aligned to the original printing process. The actual scan is necessarily grey-scale as the scanning raster overlaps to varying degrees with character edges. Selecting B&W means that the scanner is setting an arbitrary threshold internally. You're always better off deferring that decision to a later stage of processing. Including the opportunity to re-process with better OCR (or other) software.

I strongly urge you to use at least grey-scale scanning if you want to achieve anything like "archival quality". IMO file-size shouldn't be the determinant. Acrobat can be use to re-export a smaller sized file if that's important.
Post Reply