Disallow LLM (AI/ML) Scraping Using Google-Extended on Obsidian Publish

Use case or problem

Google has admitted to training their AI models on publicly accessible websites.

They have provided an opt-out for it.

Proposed solution

Use the Google-Extended product token in the site’s robots.txt to disallow scraping Publish sites for training data.

I think this should be the default behavior, people can opt-in if it suits them down the line.

I recognize that they may not even respect it, but I would rather not even imply that permission was ever given.

Current workaround (optional)

No known workaround exists other than blocking all web crawlers or putting everything behind a password.

Related feature requests (optional)

References

I think this is too similar to the related FR you linked. I am going to close this and continue the conversation there.