I have a few thousand newspaper issues—spanning over three decades—from which I need to extract a multi-page table. Each issue contains around 2,000–2,500 ship names, presented in a six-column table that flows across three horizontal sections per page. Each section holds a continuation of the same table, and the layout continues fluidly across multiple pages.
Is there any plugin in Obsidian (or perhaps Zotero ) that can extract these tables and save each section as a Markdown table—or ideally, merge all six to eight pages into a single MD table with wiki-links?
I’m about to build a local Python script using AI-assisted text recognition (OCR)—possibly with Tesseract and OpenCV—and a local LLM for automated text learning. But before I start, I wanted to check here if any existing tools already support this kind of layout-aware extraction locally. Each file is over 2GB, so uploading to services like Transkribus is not an option.
I plan to use the extracted data in a combined genealogy and Norwegian mercantile ship research project I tinker with in my spare time—using Obsidian, Foam for VSC, Aoen Timeline, Gramps, and Tulip/Cytoscape/Gephi.
Note: This text was translated and corrected by Copilot for flow and readability, from Norwegian to English.