Data Donation request

Hello there!
My name is Arthur, and I’m a PhD in Computer Science from TU Delft, in The Netherlands, working specifically with Natural Language Processing and Information Retrieval.

I feel that something that would improve Obsidian a LOT (or any other PKM system) is a more robust “Unliked Mentions” tab. By that, I mean some type of Machine Learning system for predicting links between notes (this is an active research area, called, you guessed it, Link prediction, but mainly focused on large graphs of entities or friends, for instance).

So, I propose developing a plugin that would, given the current open note, improve upon the current “Unlinked Mention” tab by proposing other notes to be linked to the current one. The user would still have to actively accept the suggestion and add the context, of course.

To do so, the first thing we need is a large enough dataset of linked notes, where experts (i.e. Obsidian users) have a large number of linked notes in context, so that an algorithm (probably a Transformer, BERT-Like model, for those who like that) could process the semantics of the notes and learn if two notes should be linked or not.

For that, we need data donators. That means people who already have a reasonably developed (or maybe not so much developed) vault, willing to donate their vault data for this cause.

So, my question is: Anyone willing to do so? Or someone has any idea if such a dataset is already available?

Naturally, we will need a few people to help with that. So, feel free to jump in!

2 Likes

Hey @acamara ,

Big fan of the idea of suggesting links between notes.
In fact when first exploring software like Roam and Obsidian I assumed this was already a feature.

The closest I’ve seen to replicating this is a python script that tries to find the similarity between notes based on term frequency x inverse document frequency matching.

Would be keen to hear what you think of it.

That’s a nice starting point, even if quite crude!

From what I could get from their code, they compute the jaccard similarity between the keywords (tags and links) and the full text, giving an arbitrary weight to each similarity, and use that value to estimate how similar two texts are.

What I had in mind is a step forward: Not only checking if two notes are similar, but rather, if they should be linked or not (see the difference? One is bi-directional, the other, uni-directional).

This requires more data so a model can learn to predict if one note will be linked to another or not.

Its certainly an interesting idea.

My concern would be anything modelled would be limited by the way ideas have been historically linked to one other. Assumptions need to be made about what is an appropriate link or not, which may limit our ability to creatively connect thoughts.

If the goal is to link similar ideas, then maybe a cruder solution like Jaccard Similarity is appropriate.

If the goal is to identify novelty by linking together seemingly incompatible ideas, I feel that requires a leap in machine understanding of what is trying to be expressed within a note.

Perhaps you had a different goal in mind?

I tend to agree. I applaud the effort, and I’m interested in seeing where it goes, but there’s a non-trivial number of assumptions going into how any individual structures their thinking. Without some kind of standardization or a Google-level volume of data, I suspect that any models will be a little lacking.

Yet TFIDF does seem too crude. I think we can do better, e.g., by letting users label and weight concepts and clusters picked up by an TFIDF model.

I agree that it would be limited by current data. But I think this would be highly useful in two specific scenarios:

  1. someone new to PKM systems. It would give you hints (based on what other users in the past have done before, of course), on what other notes you may want to link to the current one (it’s up to the user to add the appropriate context, of course)
  2. Someone with a large graph that may be forgetting about some old note. Imagine there is some older note, in a completely different place of your vault, but that is somewhat complimentary the your current note. This would be useful to know!

About the lack of novelty, if we can get enough data (from multiple users, that is), the model would learn to link notes using diverse methodologies, so it could give you suggestions that you would not think yourself, but the “obsidian hive mind” would.

About the leap in machine understanding, I don’t feel it’s too far-fetched. There are plenty of works using pre-trained language models (I.e BERT, T5, etc) that deal pretty well with the link prediction problem (https://www.aclweb.org/anthology/2020.coling-main.153.pdf)

2 Likes

The Google-level volume of data is already there, in large pre-trained language models like BERT, Electra, etc. It has been shown that they need very little data to be fine-tunned to a specific dataset.

Honestly, if a handful of users with hundreds of links contribute with data, we are in a good spot to begin with. =)

1 Like

I will try to do some proof of concepts over the weekend. (Quite busy with a conference this week)

Wouldn’t it be better for a plugin to have a training mode so that it could learn from that user’s data?

I think this is an awesome idea. Unfortunately many people’s notes contain private information or information that may be proprietary (mine included). I also expect that the way people build their vault (free form and on any possible topic imaginable) will make it difficult to train in a way that can be reused between people’s vaults.

A few slightly different ideas:

I don’t think we should need to be linking things manually but we should be guided to capture information consistently, eg. structurally, spelling, terms, etc. For me that’s the main function of links so I can refer to something the same way and therefore easily find it and build upon it. What would be helpful is a way to help maintain that consistency by working out similar terms, ideas, word roots, etc. and:

  1. be able to use this knowledge for search and ranking.

For me one of the weakest aspects, and biggest opportunities, of Obsidian is the search and ranking. It would be huge to be able to have a search that can take plain language understanding into account when ranking results. Doing this well also removes some of the need for manual linking and ultimately you might be able to use this knowledge to build graphs automatically on the fly.

  1. help a user during authoring to brainstorm, get novel input from their vault and to be consistent.

Imagine a pane that shared snippets of pre-existing notes, common terms or phrases, etc. that could help a user with recall, knowledge, consistency, etc. This is how our memories work and I see this as being a huge benefit. ML could build an index that would be used for this purpose - essentially a real time background search using capabilities from 1. above.

I would imagine a Wikipedia dataset being a good place to start (far from perfect because users vary widely in what they want to do). Take a subset, remove the links and then train it to create links and reward it when the links are as per the original dataset.

For anyone interested, I’ve built a very basic structural analysis plugin:

It shows centrality, similarity, and link prediction measures :slight_smile: I’ve messaged Arthur about it already, but perhaps someone else finds it useful too!

3 Likes

@acamara happy to provide you with data :slight_smile:

That would be nice to have. However, using Neural Nets would make it quite complex, unless the user has a decent GPU setup. But setting up a online server where users can upload their data, train a personalised model and download this trained model could be useful. =)

I agree. That’s why you should “sanitise” your vault before submitting it to anyone.

True. But I imagine that maybe a model could capture how the user noting style relates to how they link? I don’t know. Need to train a model and see what happens.

YES. 100% YES. Current search is not very good. Even using a simple BM25 (purely probabilistic approach) algorithm should improve it greatly. However, given the small size of the notes, vocabulary mismatch is a big problem (say, if you search for USA, notes with United States will not show up), and neural models exceed at this. This is a trivial follow-up from the current proposal. (BTW, this is one of the focus of my my PhD research)

This is also doable and interesting.

that’s quite straightforward, and many people have done this already. This is a 2018 paper that did that: https://shiruipan.github.io/publication/yang-binarized-2018/yang-binarized-2018.pdf

That’s great! Can you Zip your vault and send it to my email? camara.arthur <at> gmail . com
(please, remove any personal notes or notes with confidential/sensitive data)

Hi!,

Something that would be interesting to research is:

How do people use links today?

How people want to use links?

I’ve seen people use links in different ways, also some links are argumentative, others Wikipedia like etc.

I myself have links for different purposes in my vault.

If you find it useful, I wrote a plugin to try-out algorithms written in python.

1 Like

Agree on most of these points.

If we could get a large enough sample size, and be able develop tools that allow users to select different types of weighting’s for desired relationships this could be very useful in overcoming personal biases and uncovering new connections.

I see what you say about different types of links.

:thinking: Maybe we can first try to classify the different types of [[links]]? Then, create one model per type of link, and the user can give a weight for each of them.
(I’m just brainstorming here).

@acamara FYI with respect to search ranking this feature request might be worthy of support:
https://forum.obsidian.md/t/sort-search-results-by-relevance-and-what-relevance-is/