Python script to scrape articles and add links to topics

Nebucatnetzer · October 13, 2020, 9:47pm

Hi everyone,

inspired by this post: https://forum.obsidian.md/t/clip-an-article-straight-to-your-obsidian-with-one-click/ I decided to write my own little script.
You can find it here: https://github.com/Nebucatnetzer/url2markdown

Right from the start, it is still in a very early stage but works so far that one can see what I’m trying to achieve.
The main motivation behind this script was that I wanted a way to archive articles after I’ve read them to possibly reference them again at a later point.
Currently I’m doing this with Wallabag. Wallabag is a great application but a bit clunky and saves the articles into a DB instead in file. However it has a mobile application which is great.

What I try to achieve with this project:

Have an extension in the desktop browser to download the article with one click and extend it with the required topics.
Collect URLs on the go and possibly link to related topics.

So far the script I created does the following:

Download the content from an article’s URL
Convert it to markdown.
If provided it adds topics to the header in the form of Obsidian’s wiki style links
It can batch download all articles in a given file and add the related topic links to the header. With this I can add my read articles to a note on my phone, add the related topics and later download them from my computer or find a way to automate the download.

What is missing:

Better scraping, currently there is still too much JavaScript and other stuff inside the Markdown.
An easier way to configure the application at the moment there is no really a way to configure it, everything is hard-coded.
Download the article’s images and save them to a related folder or similar.
Packaging it to pypi.
Make it work with the “External Application Button” extension (this is already halfway there).

I’m just leaving this here in case someone wants to test it and provide feedback. Please note that I might not be able to include all ideas and wishes since this is only a little fun project.

Elhoom · October 17, 2024, 1:13pm

Hey, great project!

It sounds like you’re well on your way to building a super useful tool for archiving articles. I like the idea of converting articles to markdown and linking them with Obsidian’s wiki style—perfect for future referencing!

For your next steps, if you’re aiming for better scraping (like removing excess JavaScript), you could look into libraries like BeautifulSoup or Readability to help clean up the content before converting it to markdown. Also, for downloading images, integrating with requests or aiohttp could automate the saving process to a folder.

In terms of packaging, once you’ve refined it a bit more, uploading to PyPi will definitely make it easier for others to install and use. For anyone interested, here’s the repo: url2markdown.

Also, if scraping detection becomes an issue, tools like Multilogin might help you avoid getting blocked while gathering content. You can check it out here.

Looking forward to seeing how this evolves!