Publish: let the user provide a custom robot.txt (Block crawling, LLM Content protection)

For Obsidian Publish, it would be nice to configure a robot.txt file to gain additional protection against LLM crawling the web.

Use case or problem

When the site is public, there is a risk of LLM training on / stealing proprietary content without mentioning sources.

Proposed solution

Add an option in settings to create a robot.txt file to protect content from LLM.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Current workaround (optional)

Add a password

Related feature requests (optional)

7 Likes

Edited and expanded the scope.

Hello! With OpenAI’s announcement that they will now allow us to disallow their crawler from ingesting our websites, has there been any progress on allowing us to utilize this block on our Obsidian Publish websites? Thank you!

2 Likes

Thank you for filing this feature request. I second this. If supporting robots.txt editing is not possible then at least a switch in the preferences for this.

2 Likes

Just chiming in to support this one: I realize that scrapers can choose to ignore robots.txt (and the disreputable ones absolutely will) but I also want to opt my content out of OpenAI, Microsoft and related scraping.