The ultimate guide to process scanned books

TLDR:

All programs are free. The whole process takes around 1 hour running, with occasional checks.

Splitting scanned books

There are two problem with automating splitting scanned books in a single pass:

  • Automation is not always accuracy
  • Making a scanned book comfortably read is more than just splitting pages

For everything related with scanned books, I strongly recommend using ScanTailor (or its fork ScanTailor Advanced). It has features such as:

  • Turn skewed pages vertically,
  • Select content to reduce the page size,
  • Increase/decrease margin (for taking notes, say),
  • Whiten the result for better reading experience.

You must export the PDF into images to use this, and recombine the output images back. The processed images may be very small in file size (up to only 6% of the origin), but excellent in quality.

To complete the task satisfactorily, I recommend using PDF-Xchange Viewer for extracting images and adding OCR layer, and i2pdf for merging the outputs. In my experience, you can set the JPG quality to the lowest and it doesn’t seem much different, but there is a trade-off between the final output’s size and image quality.

Creating hierarchical bookmarks/table of content

Use Jpdfbookmark.

Step 1: Prepare the table of content

Save the TOC in a .txt file in this format:

Chapter 1. The Beginning/23
    Para 1.1 Child of The Beginning/25,FitWidth,96
        Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
    Para 2.1 Child of The Beginning/32,FitPage

You can ORC the TOC and use regex to fix it.

Step 2: Load that TOC

Machine generated alternative text:
ile Edit View Tools Window Help
Select Text
Ctrl+Alt+T
Use System Clipboard Ctrl+AIt+C
Show On Open
Dump
Apply Page Offset
Options
Ctrl+Alt+D
Ctrl+Alt+L
Ctrl+Alt+O

Step 3: Prepare for step 4

This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset

Machine generated alternative text:
File Edit View Tools Window Help
Select Text
Ctrl+AIt
Use System Clipboard Ctrl+AIt+C
Show On Open
Dump
Apply Page Offset
Options
1.4 Invariance in Geometr•,
I. 5 Dimensional Analysis
1.6 Eddington's äÉæMethod of
I. 7 Ideal Numbers—
1.8 Actual Infinity and the Axi(
2 Intuitve Theories of
3 Axiomatc Set Theory—
4 Axiomatc Generalizatons of the
5 Representatonal Theory of Mea:
6 Intrinsicness
7 Qualitatveness
8 and the Axiom
References
Index
Ctrl+AIt+D
Ctrl+AIt+L
Ctrl+Alt+O

Step 4: Apply page offset

This step should be self-explained. Don’t forget to save.

That’s it. You are done. For more information, you can read its manual. The program has command line mode and can work on Linux, Mac.

If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.

See also: How to OCR tables of contents?

Other resources

Book scanning – Wikipedia
How to Scan a Book (with Pictures) – wikiHow
Willus.com’s PDF Conversion Tips for e-readers
DIY Book Scanner
Tips for Scanning · scantailor/scantailor Wiki


Posted

in

by

Tags:

Comments

Leave a Reply