Data Donation request

Its certainly an interesting idea.

My concern would be anything modelled would be limited by the way ideas have been historically linked to one other. Assumptions need to be made about what is an appropriate link or not, which may limit our ability to creatively connect thoughts.

If the goal is to link similar ideas, then maybe a cruder solution like Jaccard Similarity is appropriate.

If the goal is to identify novelty by linking together seemingly incompatible ideas, I feel that requires a leap in machine understanding of what is trying to be expressed within a note.

Perhaps you had a different goal in mind?

I tend to agree. I applaud the effort, and I’m interested in seeing where it goes, but there’s a non-trivial number of assumptions going into how any individual structures their thinking. Without some kind of standardization or a Google-level volume of data, I suspect that any models will be a little lacking.

Yet TFIDF does seem too crude. I think we can do better, e.g., by letting users label and weight concepts and clusters picked up by an TFIDF model.

I agree that it would be limited by current data. But I think this would be highly useful in two specific scenarios:

  1. someone new to PKM systems. It would give you hints (based on what other users in the past have done before, of course), on what other notes you may want to link to the current one (it’s up to the user to add the appropriate context, of course)
  2. Someone with a large graph that may be forgetting about some old note. Imagine there is some older note, in a completely different place of your vault, but that is somewhat complimentary the your current note. This would be useful to know!

About the lack of novelty, if we can get enough data (from multiple users, that is), the model would learn to link notes using diverse methodologies, so it could give you suggestions that you would not think yourself, but the “obsidian hive mind” would.

About the leap in machine understanding, I don’t feel it’s too far-fetched. There are plenty of works using pre-trained language models (I.e BERT, T5, etc) that deal pretty well with the link prediction problem (https://www.aclweb.org/anthology/2020.coling-main.153.pdf)

2 Likes

The Google-level volume of data is already there, in large pre-trained language models like BERT, Electra, etc. It has been shown that they need very little data to be fine-tunned to a specific dataset.

Honestly, if a handful of users with hundreds of links contribute with data, we are in a good spot to begin with. =)

1 Like

I will try to do some proof of concepts over the weekend. (Quite busy with a conference this week)

Wouldn’t it be better for a plugin to have a training mode so that it could learn from that user’s data?

I think this is an awesome idea. Unfortunately many people’s notes contain private information or information that may be proprietary (mine included). I also expect that the way people build their vault (free form and on any possible topic imaginable) will make it difficult to train in a way that can be reused between people’s vaults.

A few slightly different ideas:

I don’t think we should need to be linking things manually but we should be guided to capture information consistently, eg. structurally, spelling, terms, etc. For me that’s the main function of links so I can refer to something the same way and therefore easily find it and build upon it. What would be helpful is a way to help maintain that consistency by working out similar terms, ideas, word roots, etc. and:

  1. be able to use this knowledge for search and ranking.

For me one of the weakest aspects, and biggest opportunities, of Obsidian is the search and ranking. It would be huge to be able to have a search that can take plain language understanding into account when ranking results. Doing this well also removes some of the need for manual linking and ultimately you might be able to use this knowledge to build graphs automatically on the fly.

  1. help a user during authoring to brainstorm, get novel input from their vault and to be consistent.

Imagine a pane that shared snippets of pre-existing notes, common terms or phrases, etc. that could help a user with recall, knowledge, consistency, etc. This is how our memories work and I see this as being a huge benefit. ML could build an index that would be used for this purpose - essentially a real time background search using capabilities from 1. above.

I would imagine a Wikipedia dataset being a good place to start (far from perfect because users vary widely in what they want to do). Take a subset, remove the links and then train it to create links and reward it when the links are as per the original dataset.

For anyone interested, I’ve built a very basic structural analysis plugin:

It shows centrality, similarity, and link prediction measures :slight_smile: I’ve messaged Arthur about it already, but perhaps someone else finds it useful too!

3 Likes

@acamara happy to provide you with data :slight_smile:

That would be nice to have. However, using Neural Nets would make it quite complex, unless the user has a decent GPU setup. But setting up a online server where users can upload their data, train a personalised model and download this trained model could be useful. =)

I agree. That’s why you should “sanitise” your vault before submitting it to anyone.

True. But I imagine that maybe a model could capture how the user noting style relates to how they link? I don’t know. Need to train a model and see what happens.

YES. 100% YES. Current search is not very good. Even using a simple BM25 (purely probabilistic approach) algorithm should improve it greatly. However, given the small size of the notes, vocabulary mismatch is a big problem (say, if you search for USA, notes with United States will not show up), and neural models exceed at this. This is a trivial follow-up from the current proposal. (BTW, this is one of the focus of my my PhD research)

This is also doable and interesting.

that’s quite straightforward, and many people have done this already. This is a 2018 paper that did that: https://shiruipan.github.io/publication/yang-binarized-2018/yang-binarized-2018.pdf

That’s great! Can you Zip your vault and send it to my email? camara.arthur <at> gmail . com
(please, remove any personal notes or notes with confidential/sensitive data)

Hi!,

Something that would be interesting to research is:

How do people use links today?

How people want to use links?

I’ve seen people use links in different ways, also some links are argumentative, others Wikipedia like etc.

I myself have links for different purposes in my vault.

If you find it useful, I wrote a plugin to try-out algorithms written in python.

1 Like

Agree on most of these points.

If we could get a large enough sample size, and be able develop tools that allow users to select different types of weighting’s for desired relationships this could be very useful in overcoming personal biases and uncovering new connections.

I see what you say about different types of links.

:thinking: Maybe we can first try to classify the different types of [[links]]? Then, create one model per type of link, and the user can give a weight for each of them.
(I’m just brainstorming here).

@acamara FYI with respect to search ranking this feature request might be worthy of support:
https://forum.obsidian.md/t/sort-search-results-by-relevance-and-what-relevance-is/

I would love to see if we could structure an Obsidian note database into something like this to create a knowledge graph.

Notes could be transformed from unstructured text data into hyperstructures data through entity extraction and classification.

This could be a good basis for building a shared model of relationships between ideas.


Figures lifted from AI Powered Search

4 Likes

A bit late to the party but i realised this that since you mentioned Google, technically the whole internet is pretty much like a vault with hyperlinks being the uni-directional links.

Especially given the fact that a hug number of crawlers have already archived these webpages links and hyperlinks makes it even more of a better place to start .

Another point i wanted to make was that rather than having a direct connection from one note to another would it make sense to go from Note A to Note B via a Lable note which contains the reason for linking A to B.

Not only do we quantify the connection between the note but we also elaborate on why the connection exists in the first place making the links more rich in nature. This probably would mimic graphql ig

Also afik there are already a bunch of vaults by creators that are open for all and are in active use for eg:

LorenzDuremdes/Second-Brain: Building my Second Brain using Obsidian.md (github.com)

swyxio/brain: Swyx’s second brain! (github.com)