Contents

Creating an ePub ebook from a Paperback

Background

Since reading about James Bond creator Ian Fleming’s work entering the public domain in Canada in 2015, I’ve thought about creating public domain ebooks that I could share online. At the time, I picked up a copy of Octopussy, and scanned it. I ran it through the OCR software bundled with my scanner, and the results were pretty mediocre. At that time, the OCR process was certainly faster than typing, but the output required a lot of manual review - I abandonned the project. In short order, the Canadian Gutenberg team created excellent James Bond ePubs, and I note now that other ones are available from Faded Page.

This year, I saw that a number of the AI/LLM providers introduced OCR models. I tried Mistral OCR on my Octopussy scan from 2015 - the results were much better - this was worth trying again.

/create-epub-from-paperback/images/OCRStateOfTheArt2015vs2025.png
Original Octopussy Book vs 2015 OCR vs 2025 OCR

Creating the ebook

Prepare and scan the book

I decided to create an ebook of Charles T. Currelly’s I Brought The Ages Home. My initial plan was just to take photos of each page of the book.

/create-epub-from-paperback/images/IBroughtTheAgesHomeHeader-NonDestructivePhoto.jpg
Non destructive photos of book

It was impossible to get the pages flat. Looking at the scans:

  • The text doesn’t follow a straight line
  • The lighting was inconsistent

I decided to take the book apart, so the pages could be scanned flat. I had access to an old Fujitsu ScanSnap IX500 scanner with a document feeder, and the book was scanned in a minute.

/create-epub-from-paperback/images/book-taken-apart.jpg
I Brought The Ages Home, disassembled

The results were much better - the lighting was even, and the text was straight.

/create-epub-from-paperback/images/page-294.jpg
Page 271, as scanned by ScanSnap

I exported the scans to a multi-page PDF file using the software bundled with the scanner.

/create-epub-from-paperback/images/pdfoutputfromscanner.jpg
Scanner PDF Output

OCR Process

Here’s the flow I used. I have Python, a Mistral account and API key, the uv package manager, and Pandoc installed.

1
2
3
4
5
export MISTRAL_API_KEY='...'

uv run http://tools.simonwillison.net/python/mistral_ocr.py inputpdffile.pdf --html --inline-images > outputhtmlfile.html

pandoc outputhtmlfile.html -o outputepubfile.epub

I’ve used a script here from Simon Willison to run Mistral’s Cloud OCR service on the PDF, and this has worked well for me. There are many options that you can run on your own PC, a couple worth considering include:

OCR Clean Up

I have used this process a few times now, and the results have been acceptable. I’ve loaded the ePubs created using this method on my Kindles and Kobos using Calibre and read them cover-to-cover - it’s a much better experience than reading a PDF. But, the experience is not as good as commercial book purchased through the Kindle or Kobo stores. Anything beyond straight text - formating, figures, tables, page numbers - don’t look great.

When I digitized Charles Trick Currelly’s I Brought The Ages Home, I wanted to share it, and I wanted any potential readers to have a great reading experience. I used Calibre’s built-in ebook editor to:

  • Set the metadata like title & author
  • Set the cover
  • Remove page numbers, create links
  • Clean up headers
  • Clean up formatting
  • Review figures

Here are a few examples of the types of OCR issues I fixed:

/create-epub-from-paperback/images/ebook-tweaks-1.jpg
Example of inconsistent formating by the OCR process

/create-epub-from-paperback/images/ebook-tweaks-2.jpg
Example of content from paperback that must be removed from ebook

Download the ebook

You can download the epub here: I Brought The Ages Home by Charles T. Currelly epub

Please share any issues you find with this epub, or let me know if you would like the original scans (they are too large for the hosting service I am using).