Splitting scanned books
There are two problem with automating splitting scanned books in a single pass:
- Automation is not always accuracy
- Making a scanned book comfortably read is more than just splitting pages
For everything related with scanned books, I highly recommend using ScanTailor. It has features such as:
- Turn skewed pages vertically,
- Select content to reduce the page size,
- Increase/decrease margin (for notetaking, maybe),
- Whiten the result for better reading experience.
You must export the PDF into images to use this, and recombine the output images back. The processed images may be very small in file size (up to only 6% of the origin), but excellent in quality.
To complete the task satisfactorily, I recommend you to use PDF-Xchange Viewer for extracting images and adding OCR, i2pdf for merging the outputs. In my experience, you can set the JPG quality to the lowest and it doesn’t seem much different, but you have a trade-off between the final output’s size and image quality. All programs are free. The whole process takes around 1 hour in background, with occasional checks.
Creating hierarchical bookmarks (table of content)
1. Prepare the table of content in a .txt file in this format:
Chapter 1. The Beginning/23 Para 1.1 Child of The Beginning/25,FitWidth,96 Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43 Chapter 2. The Continue/30,TopLeft,120,42 Para 2.1 Child of The Beginning/32,FitPage
You can ORC the TOC and use regex to fix it.
2. Load that TOC
3. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools > Apply Page Offset
4. Enter the first pages that outmatch the page number in the TOC
If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.
See also: How to OCR tables of contents?