Screen Publishing

Kindle Conversion From PDF


Ebook conversion from PDF is possible but problematic. Often, PDFs carry print-specific elements that aren’t useful in the context of reflowable text, like page numbers and repeating chapter titles. In the case of our most recent conversion, our PDF was fraught with various printers crop-marks, the file name, and creation date on every page. It’s hard for conversion programs (like Calibre and the converter built into Amazon’s KDP web-based uploader) to recognize and strip away unwanted elements like that. Layout-specific elements within PDFs, like sidebars, call-outs, and anything tabular, can also befuddle converters.

Layout programs like InDesign offer increasingly sophisticated ebook output options that allow the user to specify reflow-friendly display instructions for elements like these (eg: don’t include page numbers in the epub output, place sidebars at the end of the section and style them differently, etc).

I recently converted a book for which I only had the PDF –no InDesign files, nor Word files to port to InDesign.

Even though I knew the Calibre developer discourages conversion from PDF, I took a swing. The program produced a fairly literal conversion (including crop-marks) but with lots of strange line-breaks and no images (where did they go? I don’t know). I also attempted uploading a copy of the PDF to KDP for conversion, an although it impressively removed the crop marks and other repeating page elements, images were again partially left out (partially because some textual elements from images were preserved, like chart text, which was probably an artifact of the Adobe OCR process).

Copy/special-pasting directly from the PDF to Word did a decent job of preserving italics and bold text, but also preserved unwanted repeating page elements. It also included artificial line-breaks (which would interrupt reflowability).

Finally, I tried exporting HTML from the PDF using Adobe Acrobat. This produced a set of files (html, css, and images) fairly similar in display quality to the Calibre conversion, the KDP conversion, and the Word copy/paste. All of these alternate formats, based on the PDF, would need quite a bit of fine-tuning. I’d need to create style classes, inspect parts of the text with unique formatting, re-insert images optimized for display in the ebook environment.

I decided editing the HTML directly would give me the most precise control over everything. However, the HTML that Adobe exported was U G L Y. Tons of extraneous style application and markup of unwanted elements (line-breaks, crop-marks, etc). Not to mention it was really hard to read — just one continuous flow of code.

These are the steps I performed to clean-up and process the HTML:

1) HTML Tidy, to nest elements and ID errors (eg: unclosed tags)

2) Remove unwanted image elements (crop-marks, mangled charts and illos)

3) Remove unwanted textual elements (page numbers)

4) Convert in-line styling to CSS styles

5) Re-insert images (744px wide, jogs, 72dpi)

6) Insert links for TOC and endnotes

7) Cleanup CSS elements which Kindle strictly regulates (justification, paragraph indents, list-styling, blockquotes)

A similar process could be done from Word directly. Word has a pretty neat auto-format feature that performs a similar task to HTML Tidy. It’ll remove hard line-breaks and assign headings and styles pretty consistently so that you can more easily make changes. Instead of creating CSS styles, you’d be creating custom styles in Word. Instead of doing GREP searches in Text Wrangler, you’d be doing wildcard searches in Word. Working in Word is probably easier to read for more people. (I previewed as I went by viewing the file in Firefox, which allowed me to diagnose problem areas using the Firebug extension.) You could then upload your Word file to KDP for conversion or you could port it to InDesign for epub export. I like to think that I avoided a lot of bloat-related trouble-shooting by working directly on the HTML, but honestly, Kindle is a lot more forgiving than iBooks, so it’s a little hard to tell. Natasha Fondren presents a good argument for hand-coding ebooks. Ditto.

I’ll need to work on the iBooks conversion next to test that theory.

Formatting guides/helps for preparing Kindle ebooks:

Kindle Formatting, by Joshua Tallent
website and guide are both super helpful

Natasha Fondren’s blog, Avdventures in Writing on the Road
see especially the Kindle Formatting category

Posted by cc on May 27, 2011 at 3:49 pm | comments