Working with scanned books

Splitting scanned books

There are two problem with automating splitting scanned books in a single pass:

  • Automation is not always accuracy
  • Making a scanned book comfortably read is more than just splitting pages

For everything related with scanned books, I highly recommend using ScanTailor. It has features such as:

  • Turn skewed pages vertically,
  • Select content to reduce the page size,
  • Increase/decrease margin (for notetaking, maybe),
  • Whiten the result for better reading experience.

You must export the PDF into images to use this, and recombine the output images back. The processed images may be very small in file size (up to only 6% of the origin), but excellent in quality.

To complete the task satisfactorily, I recommend you to use PDF-Xchange Viewer for extracting images and adding OCR, i2pdf for merging the outputs. In my experience, you can set the JPG quality to the lowest and it doesn’t seem much different, but you have a trade-off between the final output’s size and image quality. All programs are free. The whole process takes around 1 hour in background, with occasional checks.

Creating hierarchical bookmarks (table of content)

Use Jpdfbookmark.

1. Prepare the table of content in a .txt file in this format:

Chapter 1. The Beginning/23
    Para 1.1 Child of The Beginning/25,FitWidth,96
        Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
    Para 2.1 Child of The Beginning/32,FitPage

You can ORC the TOC and use regex to fix it.

2. Load that TOC

Machine generated alternative text:
ile Edit View Tools Window Help 
Select Text 
Ctrl+Alt+T 
Use System Clipboard Ctrl+AIt+C 
Show On Open 
Dump 
Apply Page Offset 
Options 
Ctrl+Alt+D 
Ctrl+Alt+L 
Ctrl+Alt+O

3. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools > Apply Page Offset

Machine generated alternative text:
File Edit View Tools Window Help 
Select Text 
Ctrl+AIt 
Use System Clipboard Ctrl+AIt+C 
Show On Open 
Dump 
Apply Page Offset 
Options 
1.4 Invariance in Geometr•, 
I. 5 Dimensional Analysis 
1.6 Eddington's äÉæMethod of 
I. 7 Ideal Numbers— 
1.8 Actual Infinity and the Axi( 
2 Intuitve Theories of 
3 Axiomatc Set Theory— 
4 Axiomatc Generalizatons of the 
5 Representatonal Theory of Mea: 
6 Intrinsicness 
7 Qualitatveness 
8 and the Axiom 
References 
Index 
Ctrl+AIt+D 
Ctrl+AIt+L 
Ctrl+Alt+O

4. Enter the first pages that outmatch the page number in the TOC

You can read its manual or watch a quick video tutorial. It has command line mode and can work on Linux, Mac.

If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.

See also: How to OCR tables of contents?

Other resources

Willus.com’s PDF Conversion Tips for e-readers
DIY Book Scanner
Tips for Scanning · scantailor/scantailor Wiki