EPUB Import: Optimal Workflow

zettelstraum · December 6, 2023, 9:56am

For a while, I try to find way that allow importing of EPUBs into the vault. My general experience is that it is more convenient working with the full text, even though I love the extended PDF capabilities of Obsidian (Link to section in document is fantastic!)

WHile nothing comparable exists for EPUB, I import the books and have experimented with a lot of approaches. My best case scenario (so far), I want to describe here since it may be of help to others.

What have I tried before?

Exporting to MD via Calibre
Checking this was a nobrainer and I was thrilled about this option. Using it, I quickly found out, however, that formatting is far from optimal: HEaders are very often only boldened and images are not extracted / integrated.
In essence: This approach needs a LOT of manual work after conversion to be done. I am sure this could (at least in part) be remedied by a good recipe (Calibre allows for much costomization) but I am not capable of using this part of the program.

Conversion via Pandoc
This I tried only with the GUI version and it hung up on me every time. I am sure that Pandoc is able to do wonders to conversion but it needs A LOT of manual adjusting with arguments that (at least to me) look very much like code (see Calibre section for understanding how far my knowledge goes for such scenarios)

Conversion via online Converters
There are probably hundreds of those. Mostly, they do not work. I found one that was pretty good but it was unreliable (sometimes uploading and converting simply did NOT progress) and it also renamed all extracted images to the products name which means substantial manual work if you don’t like this to remain as it is (batch renaming files, then batch replacing links is possible but can lead to try and error phases).

So: What did I found out works best so far?

Conversion to HTMLZ via Calibre

This option creates a HTMLZ file inside the books folder. The file is essentially a ZIP file that can be extracted (by 7zip, for example, a great free ZIP clone). In the extracted folder, there is a HTML file and a subfolder with all extracted images.
Now, I only import the HTML via Obsidian import function. I very much like the predefined option (imported file is created inside a new “HTML Import” folder in the root of my vault). Now I can check for things I do not like in the imported result which mostly requires modest manual aftertouches ( I use the Linter plugin for automatically converting CAPITALIZED HEADINGS to Normal Ones. as I don’t wish to be screamed at when reading, also renaming the imported file and the subfolder with images to the name of the book)
I do a little finishing magic here and there and have a well ordered MD version of my EPUB inside my vault.
Once I am done, I move the imported file to the right folder and the subfolder alongside it (I do not use one giant image folder but subfolders inside the folder structure)

What is missing with this process?
I do not yet have a good way of automaically dealing with footnotes which can be a pain with books that have many.
I would very much love to somehow regex the calibre internal HTML links to proper footnotes.
If there is anyone capable and willing to help with that matter, I’d be very grateful indeed.

So, summing up. This is my workflow. It is very fast and pretty efficient and provides me with a full book copy in my vault which I find very convenint when working with some of my literature. I hope this is helpful to somebody.
Have fun. And keep rocking!

jpfieber · December 6, 2023, 3:23pm

I went through a similar struggle and found the same result, that Calibre to HTMLZ was the ‘cleanest’ way to get data from an EPUB to MD. Agreed that it would be useful to further clean up and simplify this process!

zettelstraum · December 7, 2023, 7:20pm

Ahh, that’s interesting!

Did you find a way to deal with footnotes?

Also, at times, I have occurrences where Calibre creates internal links to headings from a TOX. I find this particularly nasty since I never use a TOC inside the note, but only navigate via outline.

Do you have an idea of how to batch remove them?

They would look like this, numbered:

[Chapter 1](#calibre_link-86)

jpfieber · December 7, 2023, 7:39pm

I usually delete the TOC as it’s largely useless. I don’t have a programmatic way of dealing with footnotes, usually just delete them when I see them, but would love a good way of keeping them in a correctly formatted way.

obsidina · December 11, 2023, 3:23pm

You can try this plugin Epub Importer.