开云体育

ctrl + shift + ? for shortcuts
© 2025 开云体育

Service manual scan post processing


 

I have recently been scanning sections of microfiched service manuals for a couple of group members using a Canon MS-800
There is a significant tradeoff between file size and readability (especially with circuit diagrams)
To simplify the scanning process I have been acquiring everything at maximum equipment resolution but this leads to files that might be 200Mb+ per fiche
These are unwieldy but get the job done
?
Does anyone in the group have experience with tools that might be used to post process these scans to reduce size whilst maintaining small font fidelity?
Any recommendations?


 

I scan a lot of manuals, pdfize it and thrown in my esquemateca (link in my sig)

I use Adobe Acrobat Pro (I believe V11), using low compression and clearscan. The output is great. Remember to mark the option "full page on screen" in properties before saving.

---8<---Corte aqui---8<---

- High quality schematics and service manuals FREE scanned by me
---8<---Corte aqui---8<---


On Mon, Feb 17, 2025 at 11:42?AM Peter Brown via <peter=[email protected]> wrote:
I have recently been scanning sections of microfiched service manuals for a couple of group members using a Canon MS-800
There is a significant tradeoff between file size and readability (especially with circuit diagrams)
To simplify the scanning process I have been acquiring everything at maximum equipment resolution but this leads to files that might be 200Mb+ per fiche
These are unwieldy but get the job done
?
Does anyone in the group have experience with tools that might be used to post process these scans to reduce size whilst maintaining small font fidelity?
Any recommendations?


 

Thanks, Alexandre? I will take a look.?
Seems like Acrobat Pro v11 is no longer supported? - any idea experience of their current product?


 

I usually scan the manual pages at 1200 or 600 dpi to PDF form and save them. Then if I want to load them on a reading device or share them, I'll use Adobe to downsample the file(s) to a lower resolution, usually 300 dpi.

DaveD
KC0WJN


On Mon, Feb 17, 2025 at 09:42 Peter Brown via <peter=[email protected]> wrote:
I have recently been scanning sections of microfiched service manuals for a couple of group members using a Canon MS-800
There is a significant tradeoff between file size and readability (especially with circuit diagrams)
To simplify the scanning process I have been acquiring everything at maximum equipment resolution but this leads to files that might be 200Mb+ per fiche
These are unwieldy but get the job done
?
Does anyone in the group have experience with tools that might be used to post process these scans to reduce size whilst maintaining small font fidelity?
Any recommendations?


 

Check ebay, etc. to see if someone is selling an older copy.

I've been using Acrobat X (10) for a long time with no proplems.

DaveD
KC0WJN


On Mon, Feb 17, 2025 at 10:13 Peter Brown via <peter=[email protected]> wrote:
Thanks, Alexandre? I will take a look.?
Seems like Acrobat Pro v11 is no longer supported? - any idea experience of their current product?


 

I'm hesitant to bring this up because I'm only just barely beginning to understand it and create a workflow, but as an alternative to Adobe, there is a Google Cloud "Vision" API that does OCR of PDF files. According to ChatGPT, it does a better job than the various open source tools would, though I don't know how it compares to Acrobat.

You need a Google cloud or workspace account, and from there you set up a cloud bucket to hold the raw PDFs, and then create an API Key to the Vision API. Then a Python script can call the Google APIs to trigger conversion of the PDF to a text only document. Most of the pain is getting the bucket and API set up with the right permissions and account info.

Believe it or not, I used ChatGPT to walk me through the whole process and even write the Python script! (Which I'm happy to share.)

Google lets you process 1000 pages per month for free, and it's an additional $1.50/1000 pages thereafter. But I found that my Google Workspace account gave me a $300 credit, so I can do a lot of conversion before I have to pay any real money.

Anyway, this may be too far down the rabbit hole, but looks like it would work well for processing large numbers of documents automatically; even at $1.50 per thousand pages, it's pretty inexpensive.

John
----

On 2/17/25 10:13, Peter Brown via groups.io wrote:
Thanks, Alexandre? I will take a look.
Seems like Acrobat Pro v11 is no longer supported? - any idea experience of their current product?


 

We switched from the exorbitantly priced Adobe to PDF-XChange several years ago at work (and me personally) and could not be happier.? Excellent and very functional product at 60-70 USD per user.
?
?
Again, highly recommended.
?
Hal


 

Hi John,?sounds interesting.
Would you be interested in running a few sample pages through the process?
Peter


 

This +1.

600dpi, G4 compression. (Not JPEG compression -- Never JPEG)

I've got some terrible Linux scripts that use NETPBM/ImageMagick/Tiff
tools to build PDFs for the few I've ever scanned, but the process
varies greatly for every different document.

On Mon, Feb 17, 2025 at 10:14?AM Dave Daniel via groups.io
<kc0wjn@...> wrote:

I usually scan the manual pages at 1200 or 600 dpi to PDF form and save them. Then if I want to load them on a reading device or share them, I'll use Adobe to downsample the file(s) to a lower resolution, usually 300 dpi.

DaveD
KC0WJN


On Mon, Feb 17, 2025 at 09:42 Peter Brown via groups.io <peter@...> wrote:

I have recently been scanning sections of microfiched service manuals for a couple of group members using a Canon MS-800
There is a significant tradeoff between file size and readability (especially with circuit diagrams)
To simplify the scanning process I have been acquiring everything at maximum equipment resolution but this leads to files that might be 200Mb+ per fiche
These are unwieldy but get the job done

Does anyone in the group have experience with tools that might be used to post process these scans to reduce size whilst maintaining small font fidelity?
Any recommendations?


 

I just saw John's post. I forgot to add to my earlier post (below) that I also convert all documents that I scan to OCR format as well.

I do not intend for this follow-up post about my omission to be any comment, observation, endorsement or criticism about John's post.

DaveD
KC0WJN


On Mon, Feb 17, 2025 at 10:14 Dave Daniel via <kc0wjn=[email protected]> wrote:
I usually scan the manual pages at 1200 or 600 dpi to PDF form and save them. Then if I want to load them on a reading device or share them, I'll use Adobe to downsample the file(s) to a lower resolution, usually 300 dpi.

DaveD
KC0WJN


开云体育 Links:

You receive all messages sent to this group.

View/Reply Online (#150159) | Reply to Group | Reply to Sender | Mute This Topic | New Topic
Your Subscription | Contact Group Owner | Unsubscribe [kc0wjn@...]

_._,_._,_


 

If anyone wants to have a go, there are some sample scans here?
?
Files - A temporary directory for photographs and help relating to emails and posting - 8430A 08340-90021 Service vol 1 section1
?
They are PDF'd from the scanning software with minimum compression.? The software will also produce .tiff files if these are a better place to start
?
Peter
  • ?


 

Why not just upload the uncompressed output to and let people compress it in the future? I think if you upload the raw jpeg files they even do that for you.


On Mon, Feb 17, 2025, 9:42 AM Peter Brown via <peter=[email protected]> wrote:
I have recently been scanning sections of microfiched service manuals for a couple of group members using a Canon MS-800
There is a significant tradeoff between file size and readability (especially with circuit diagrams)
To simplify the scanning process I have been acquiring everything at maximum equipment resolution but this leads to files that might be 200Mb+ per fiche
These are unwieldy but get the job done
?
Does anyone in the group have experience with tools that might be used to post process these scans to reduce size whilst maintaining small font fidelity?
Any recommendations?


 

Hi Evan,
I understand where you are coming from, best possible is best possible
.. but one manual I have been asked about is 38 fiches long - almost 8Gb
?
I am also wondering if current predictive tools might also be able to repair pars of the scans where low contrast causes portions of e.g. single letters to drop out.
If the scans can be post processed so that all of the text information is 100% there is less case to store this at higher resolution - images are different, storing these at best possible resolution seems wise
Peter


 

John,

Just so I understand, "Vision" (only) performs the OCR process step of reproducing a printed document in OCR PDF form? Or did I mis-read your post?

DaveD
KC0WJN


On Mon, Feb 17, 2025 at 10:28 John Ackermann N8UR via <jra=[email protected]> wrote:
I'm hesitant to bring this up because I'm only just barely beginning to
understand it and create a workflow, but as an alternative to Adobe,
there is a Google Cloud "Vision" API that does OCR of PDF files.
According to ChatGPT, it does a better job than the various open source
tools would, though I don't know how it compares to Acrobat.

You need a Google cloud or workspace account, and from there you set up
a cloud bucket to hold the raw PDFs, and then create an API Key to the
Vision API.? Then a Python script can call the Google APIs to trigger
conversion of the PDF to a text only document.? Most of the pain is
getting the bucket and API set up with the right permissions and account
info.

Believe it or not, I used ChatGPT to walk me through the whole process
and even write the Python script!? (Which I'm happy to share.)

Google lets you process 1000 pages per month for free, and it's an
additional $1.50/1000 pages thereafter.? But I found that my Google
Workspace account gave me a $300 credit, so I can do a lot of conversion
before I have to pay any real money.

Anyway, this may be too far down the rabbit hole, but looks like it
would work well for processing large numbers of documents automatically;
even at $1.50 per thousand pages, it's pretty inexpensive.

John
----

On 2/17/25 10:13, Peter Brown via wrote:
> Thanks, Alexandre? I will take a look.
> Seems like Acrobat Pro v11 is no longer supported? - any idea experience
> of their current product?
>







 

My workflow on a linux system ends up with approx 80kBytes/ page, which can be halved by using djvu format. I'm a skinflint, so it only uses free-as-in-beer software...? I've had compliments about the quality, but judge for yourself.

The workflow is described for a single page. Normally there will be a multiplicity of files/pages with numeric suffixes; simply use the traditional shell script wildcards to process them all at once.

1) scan at 300dpi to produce colour jpg file, e.g. "filename.jpg"

2) normally convert filename.jpg to small tiff files using the scancvt script below "scancvt filename.jpg" . That creates variants of the input? "b-filename.tif" and "g-filename.tif". The g-filename.tif variant is better for grayscale images, but the b-filename.tif is good for black and white images and is much smaller. Select the b-* variant unless g-* is necessary.

3) use the standard command "tiff2pdf filename.tif filename.pdf"

4) concatenate all the pdf files into one using "pdfunite filename*.pdf finalManual.pdf

N.B.Occasionally a jpg image is required, in which case replace (2) and (3) to reduce the size using gimp and its posterising actions to produce "colour.jpg", then convert that file to a pdf using "convert colour.jpg colour.pdf"

If you want to see manual created like that, see That's a 180 page file, mostly text and schematics with a small number of "photos". Average page size is around 80kbytes (pdf) 40kBytes (djvu).
Ed at BAMA postprocessed the pdf to produce the djvu file, which of identical quality but half the size. I believe that is achieved by spotting common bits across each page, e.g. a letter "e".

Alternatively see which contains a higher proportion of photos and colour PCB layouts. Still only 100kBytes/page for the pdf.

The scancvt script is...

#!/bin/bash
#
# Digital Camera + This Software + Printer = A Document Photocopier
#
# Input: ?pictures of B&W Text documents taken with a digital camera using
# ? ? ? ? flash from about 3 feet away with no dark border around the page.
#
# Output1: b-file.tif (a very small B&W TIF file)
# Output2: g-file.jpg (a alternative grayscale file)
#
# If input is purely black and white, ?Output1 should be better
# If input is not purely black and white, Output2 may be better
#
# Corey Satten, corey @ , March 2007

do1 () {
? ?echo starting $1 1>&2
? ?BASE="${1##*/}"; NAME=${BASE%.[jJ][pP][gG]}; TMP1="t-$BASE"; TMP2="x-$BASE"
? ?trap 'rm -f "$TMP1" "$TMP2"; exit' 0 1 2 13 15
? ?CGQ="-colorspace gray -quality"
? ?CGT="-compress group4 -density 480x480"

? ?convert $CGQ 99 "$1" -resize 5120x5120 "$TMP2"
? ?convert $CGQ 99 "$1" -resize 1024x1024 -negate -blur 15,15 -resize 5120x5120 "$TMP1"
? ?composite $CGQ 99 -compose plus "$TMP2" "$TMP1" "$TMP1"
? ?convert $CGQ 60 "$TMP1" -normalize -level 50,85% "g-$BASE"
? ?convert $CGT "$TMP1" -normalize -threshold 85% "b-$NAME.tif"
? ?rm -f "$TMP1" "$TMP2"
}

# This tries to detect multiprocessors and run 2 conversions in parallel
# Move CPUS=1 after the test to effectively disable the test.
CPUS=1
if [ -f /proc/cpuinfo ] ;then
? ? CPUS=`grep ^processor /proc/cpuinfo | wc -l`
? ? if [ "$CPUS" -lt 2 ] ;then CPUS=1; fi
fi

for i in "$@"; do
? ?case $#/$CPUS in
? ? 0/*) exit;; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # done
? ? 1/*) do1 "$1"; shift;; ? ? ? ? ? ? ? ? ? ? ? ? ?# only one file to do
? ? */1) do1 "$1"; shift;; ? ? ? ? ? ? ? ? ? ? ? ? ?# only one cpu to use
? ? ? *) do1 "$1" & do1 "$2"; wait; shift; shift;; ?# process 2 files at once
? ?esac
done

exit 0


 

Yes, my project is focusing on OCR of existing scanned PDFs that lack it. I'm not doing any new scanning (yet).

The Google tool can read other image formats, not just PDFs, so you could directly OCR a pile of JPGs or TIFFs or whatever. But I'm working through a stack of existing manual scans.

John
----

On 2/17/25 11:02, Dave Daniel via groups.io wrote:
John,
Just so I understand, "Vision" (only) performs the OCR process step of reproducing a printed document in OCR PDF form? Or did I mis-read your post?
DaveD
KC0WJN
On Mon, Feb 17, 2025 at 10:28 John Ackermann N8UR via groups.io < groups.io> <jra@... <mailto:[email protected]>> wrote:
I'm hesitant to bring this up because I'm only just barely beginning to
understand it and create a workflow, but as an alternative to Adobe,
there is a Google Cloud "Vision" API that does OCR of PDF files.
According to ChatGPT, it does a better job than the various open source
tools would, though I don't know how it compares to Acrobat.
You need a Google cloud or workspace account, and from there you set up
a cloud bucket to hold the raw PDFs, and then create an API Key to the
Vision API.? Then a Python script can call the Google APIs to trigger
conversion of the PDF to a text only document.? Most of the pain is
getting the bucket and API set up with the right permissions and
account
info.
Believe it or not, I used ChatGPT to walk me through the whole process
and even write the Python script!? (Which I'm happy to share.)
Google lets you process 1000 pages per month for free, and it's an
additional $1.50/1000 pages thereafter.? But I found that my Google
Workspace account gave me a $300 credit, so I can do a lot of
conversion
before I have to pay any real money.
Anyway, this may be too far down the rabbit hole, but looks like it
would work well for processing large numbers of documents
automatically;
even at $1.50 per thousand pages, it's pretty inexpensive.
John
----
On 2/17/25 10:13, Peter Brown via groups.io <> wrote:
> Thanks, Alexandre? I will take a look.
> Seems like Acrobat Pro v11 is no longer supported? - any idea
experience
> of their current product?
>


 

And I should also clarify, that the output of my workflow is a separate text-only file, not a PDF that includes both image and text. I am sure there's a way to combine the text and images into a new PDF, but that's not needed for my project.

John
----

On 2/17/25 11:02, Dave Daniel via groups.io wrote:
John,
Just so I understand, "Vision" (only) performs the OCR process step of reproducing a printed document in OCR PDF form? Or did I mis-read your post?
DaveD
KC0WJN
On Mon, Feb 17, 2025 at 10:28 John Ackermann N8UR via groups.io < groups.io> <jra@... <mailto:[email protected]>> wrote:
I'm hesitant to bring this up because I'm only just barely beginning to
understand it and create a workflow, but as an alternative to Adobe,
there is a Google Cloud "Vision" API that does OCR of PDF files.
According to ChatGPT, it does a better job than the various open source
tools would, though I don't know how it compares to Acrobat.
You need a Google cloud or workspace account, and from there you set up
a cloud bucket to hold the raw PDFs, and then create an API Key to the
Vision API.? Then a Python script can call the Google APIs to trigger
conversion of the PDF to a text only document.? Most of the pain is
getting the bucket and API set up with the right permissions and
account
info.
Believe it or not, I used ChatGPT to walk me through the whole process
and even write the Python script!? (Which I'm happy to share.)
Google lets you process 1000 pages per month for free, and it's an
additional $1.50/1000 pages thereafter.? But I found that my Google
Workspace account gave me a $300 credit, so I can do a lot of
conversion
before I have to pay any real money.
Anyway, this may be too far down the rabbit hole, but looks like it
would work well for processing large numbers of documents
automatically;
even at $1.50 per thousand pages, it's pretty inexpensive.
John
----
On 2/17/25 10:13, Peter Brown via groups.io <> wrote:
> Thanks, Alexandre? I will take a look.
> Seems like Acrobat Pro v11 is no longer supported? - any idea
experience
> of their current product?
>


 

If you send me a PDF of reasonable size, I can give it a try and send the text file output. It may take a day or three as I'm in the middle of setting up a separate workstation that will handle this (along with other tasks) and moving the tools off my desktop system.

On 2/17/25 10:36, Peter Brown via groups.io wrote:
Hi John,?sounds interesting.
Would you be interested in running a few sample pages through the process?
Peter


 

Peter,

I have scanned quite a few manuals. I will take a look at the scans. I find it is always best to ingest the scans as tif (lossless compressed with lzw or something) and use a separate tool to convert to pdf as one of the final steps in processing. As a rule I never use the scanner's scanning software beyond what is absolutely necessary.

Never successfully scanned a fiche, though. Have a Minolta MS-6000 with a broken SCSI interface. The MS-800 looks nice. Have been eyeing a ScanPro 2000/3000 for fiche for years but never found the right auction.

As for OCR I use Tesseract. Free, very good, and easy to use.

A tool like tiff2pdf like Mr. Gardner pointed out works well. I think my script uses img2pdf which is probably similar.

-Michael Bierlein

On Mon, Feb 17, 2025 at 09:48 Peter Brown <peter@...> wrote:
If anyone wants to have a go, there are some sample scans here?
?
Files - A temporary directory for photographs and help relating to emails and posting - 8430A 08340-90021 Service vol 1 section1
?
They are PDF'd from the scanning software with minimum compression.? The software will also produce .tiff files if these are a better place to start
?
Peter
  • ?


 

High Peter,

I'm using "ImageOptim". I think its free. It works on JPEG and other formats and has a number of compression algorithms where it chooses the one that fits most. You can decide if you accept some loss of information or not.

Usually, taking the pics of my camera that come in at 3-4MB (medium size setting on the camera, whatever that means), they are reduced by 90% on average without me seeing any loss of information.

cheers
Martin