How to parse an entire website (html)

tonymynd · May 29, 2022, 10:29pm

Hello everyone:
I have an old website which is pure html and the majority of its pages have links that are referenced to themselves. Is composed about 500 pages.

I want to change the link href=internal-page.html to [[internal-page]]

I tried HTMLas (Nirsoft) to strip html, but still far from “good enough”.

Any ideas?

I-d-as · May 30, 2022, 6:26am

Cool idea!

I could have used something like this back when I was converting various software documentation sets into personal vaults in Obsidian. I created this help topic back then: Is it easier/better to convert .htm software documentation to .md or save PDFs and link manually?

In the end, I was able to make things work using some search and replace with a little regex. I actually recently explained my process, if you are interested. Not sure it’s actually relevant for you, but here’s the link: Note Composer: links to blocks and headers should be updated when extracting - #10 by I-d-as

Good luck!

tonymynd · May 30, 2022, 8:05am

Thanks so much bro, love your nickname, I-d-as. Will check it out and share. Best and Greatest for All.

I-d-as · May 30, 2022, 10:00am

You’re welcome! Thank you.

Like explained in the second link from my earlier reply, you could first back up your vault, then open it in VSCode and use the Edit>Replace in all files command with regex toggled on to search for

href=(.*)\.html

and replace all occurrences with

[[$1]]

Hope this helps. Again, make sure you save a backup first. And I recommend reviewing all the occurrences that will be altered before making the commitment.

Keep us posted. Good luck!

tonymynd · November 21, 2024, 9:02pm

I made it, using OpenAI help, it was like a walk in the park. LOL