Publish: let the user provide a custom robot.txt (Block crawling, LLM Content protection)

patben · April 16, 2023, 8:19am

For Obsidian Publish, it would be nice to configure a robot.txt file to gain additional protection against LLM crawling the web.

Use case or problem

When the site is public, there is a risk of LLM training on / stealing proprietary content without mentioning sources.

Proposed solution

Add an option in settings to create a robot.txt file to protect content from LLM.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Current workaround (optional)

Add a password

Related feature requests (optional)

WhiteNoise · April 16, 2023, 1:24pm

Edited and expanded the scope.

BearDroid77 · August 7, 2023, 7:42pm

Hello! With OpenAI’s announcement that they will now allow us to disallow their crawler from ingesting our websites, has there been any progress on allowing us to utilize this block on our Obsidian Publish websites? Thank you!

RandyVerdain · August 11, 2023, 9:03am

Thank you for filing this feature request. I second this. If supporting robots.txt editing is not possible then at least a switch in the preferences for this.

wogan · September 5, 2023, 10:29am

Just chiming in to support this one: I realize that scrapers can choose to ignore robots.txt (and the disreputable ones absolutely will) but I also want to opt my content out of OpenAI, Microsoft and related scraping.