- Extracting images and adding OCR layer: PDF-Xchange Viewer
- Splitting scanned books: ScanTailor/ScanTailor Advanced
- Merging the outputs: i2pdf
- Creating hierarchical bookmarks/table of content: JpdfBookmark
All programs are free. The whole process takes around 1 hour running, with occasional checks.
Splitting scanned books
There are two problem with automating splitting scanned books in a single pass:
- Automation is not always accuracy
- Making a scanned book comfortably read is more than just splitting pages
- Turn skewed pages vertically,
- Select content to reduce the page size,
- Increase/decrease margin (for taking notes, say),
- Whiten the result for better reading experience.
You must export the PDF into images to use this, and recombine the output images back. The processed images may be very small in file size (up to only 6% of the origin), but excellent in quality.
To complete the task satisfactorily, I recommend using PDF-Xchange Viewer for extracting images and adding OCR layer, and i2pdf for merging the outputs. In my experience, you can set the JPG quality to the lowest and it doesn’t seem much different, but there is a trade-off between the final output’s size and image quality.
Creating hierarchical bookmarks/table of content
Step 1: Prepare the table of content
Save the TOC in a .txt file in this format:
Chapter 1. The Beginning/23 Para 1.1 Child of The Beginning/25,FitWidth,96 Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43 Chapter 2. The Continue/30,TopLeft,120,42 Para 2.1 Child of The Beginning/32,FitPage
You can ORC the TOC and use regex to fix it.
Step 2: Load that TOC
Step 3: Prepare for step 4
This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset
Step 4: Apply page offset
This step should be self-explained. Don’t forget to save.
That’s it. You are done. For more information, you can read its manual. The program has command line mode and can work on Linux, Mac.
If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.
See also: How to OCR tables of contents?