Find similar notes (Python script)

macedotavares · December 3, 2020, 6:07pm

Hey!

I’ve been wanting this for quite some time, and I know that other people are, too. There’s already a plugin idea and a feature request.

Either as core functionality or as a plugin, I believe that a suggested/similar/related notes pane will eventually arrive.

In the meantime, I’m putting together a Python script that lets me do this in the Terminal:

It’s poorly coded, undocumented, uncommented, and less than half-baked. But it’s also on GitHub, so you can try it out as it is, make suggestions, or improve it yourself (create a pull request if you do )

Cheers!

Klaas · December 3, 2020, 6:15pm

Interesting. What is the similarity between notes based on? And what do the %-ages mean?

macedotavares · December 3, 2020, 6:22pm

It calculates the jaccard similarities of both explicit keywords (wiki links + tags) and full text.

The keyword similarity weighs a bit more on the score.

I’ll probably make the titles weigh in too.

0% means no overlap and 100% is an identical match.

Klaas · December 3, 2020, 6:32pm

Is it possible to let the user set the keywords? Or are we then talking about a plug-in type set-up?

Also, letting the title weigh in assumes similar/same titles exist, right? If so, it means that for me, who only uses unique titles, that aspect won’t be used. Right?

macedotavares · December 3, 2020, 6:42pm

The keywords I mentioned are actually all the wiki links and tags found inside the notes. So, if two notes have somewhat dissimilar contents, but share a lot of tags and wiki links, they’ll score higher that those without them.

When I implement the title part, it will behave the same way. It’s not necessary for them to be identical; they only need to share a few words (stop words are always ignored).

romanov.maxim · December 3, 2020, 9:33pm

Are you using TFIDF for keyword identification?

macedotavares · December 3, 2020, 10:03pm

Only Jaccard Coefficient for now, but I’d like to experiment with TF-IDF later.

Keywords are just regexed out of [[wikilinks]] and #tags .

This is my first time working on the subject of text similarity and I’m not exactly good at coding

raudaschl · February 6, 2021, 4:46pm

@macedotavares I really enjoyed playing with this plugin.
Thank you so much for sharing it.

I’ve placed it inside an Alfred Workflow so now I can call the script on any note within Obsidian.

Really really nice.

I’ve attached the Alfred Script in case anyone else wants to play with it.
You will need to install some libraries from the requirements.txt file before it will work.
And of course, set up the Alfred variables to state where your vault_path and valut_name is.

Screenshot 2021-02-06 at 16.48.43

obsidian-similar-notes.zip (491.3 KB)

macedotavares · February 6, 2021, 5:09pm

Thanks! I’ll definitely take a look at the code and try to learn something

From the placeholder block on the screen recording, I imagine that you’ve also used some parts of the Obsidian Utilities workflow, right?

I’ve included the related notes search there too, but I ran into different kinds of trouble and had to refactor the whole script. I really don’t know how (or if) it’s working for other people.

raudaschl · February 6, 2021, 5:22pm

Thats absolutely right.
I reused the related papers part of that workflow, modified your code and inserted it into the workflow.

Initially I tried to trigger the python script using Nodejs so I could create a plugin, but that was just not working at all.

I want to experiment with other types of similarity scripts.
Do you have any ideas of what would be good to explore?

macedotavares · February 6, 2021, 7:32pm

My first attempt used a Jaccard index algorithm. It was pretty basic, but didn’t need any external libraries.

When I included the feature in Obsidian Utilities, I refactored it to use TF-IDF, but then NLTK and Gensim became necessary. The packaging was a nightmare and the workflow went from a few hundred KB to more than 300 MB.

Finally, I refactored it again, this time to use the core SQLite package and do away with the dependencies.

Honestly, I didn’t notice any significant improvements from Jaccard to TF-IDF, but that may have something to do with the nature of my notes. Very disparate subjects, two languages, etc.

I think there’s a lot of room for improvement:

1. Better performance

My vault has around 3k notes and the workflow takes around 15 seconds to run. All the notes are being processed at runtime and I had to refrain on the preprocessing in order not to make it even worse. I’d like to be able to lemmatize and remove stop words without degrading performance, and I guess that proper caching would do the trick. By proper, I mean a persistent cache with all the lemmatized words and their counts, all the notes neatly vectorized, and regular cache updates.

2. Better preprocessing

Like I said above, the current preprocessing is pretty basic and I believe it degrades the relevance of the matches. I’d like to remove numbers, punctuation, stop words, and then lemmatize the whole thing. Multiple language support would be a plus.

3. Different weights for title and keywords

There was actually a regression from my first standalone python script to the workflow versions: initially, it gave more weight to tags and wikilinks, which I believe makes sense, since they mean an intentional, active relationship between notes. Including titles in this logic might also work.

4. Turning it into a proper plugin

Alfred is great, but not everyone uses it. If an Obsidian plugin was available, that would mean better integration, probably better performance, and a much better UX. Just imagine a “Similar Notes” pane automatically populating like the Backlinks. I don’t know a thing about Typescript, so I’m not the one making that happen.

raudaschl · February 7, 2021, 11:38am

Thanks for the detailed reply.

It’s interesting to hear that you did not notice much difference between Jaccard and TF-IDF.

I would be curious to see if lemmatisation and stop word removal would actually make a big difference.
I might play with that during the week.

I guess, honestly, any similarity functionality would be better than none right now.
I’ve recently found it useful for writing a blog on complexity theory.

Have you given any thought to applications of graph databases to identity similar notes?

macedotavares · February 7, 2021, 11:46am

I’m glad that someone who actually knows about this stuff is taking an interest in this idea. My trade is UX design; python and Alfred are just useful hobbies for me. So, to answer your last question: I barely know what a graph database is .

Please, keep me posted on your findings and progress.

macedotavares · February 8, 2021, 10:56pm

@raudaschl, in fact, I’m moving this to the Plug-in Ideas section. We may get some useful input and traction there. Thanks!

raudaschl · February 10, 2021, 10:42pm

That’s pretty awesome.
Having used this for a few days now for personal and work the value is really high.
I’ve discovered so many connections I have forgotten.
I feel the obsidian search is a bit weak for finding relevant notes based on keywords.
This fills part of that gap.

raudaschl · February 26, 2021, 10:41pm

So I have been able to experiment with this on and off the last few weeks.

I’ve made some amendments to the script and here is what I’ve learned.

Added Lemmatisation but did not seem to make much difference in the relevancy of the results with Jaccard Similarity
However, I did introduce TFxIDF cosine similarity and that not only improved the recall of the number of results but when combined with the lemmatisation made the results far more relevant
I then blended the TFxIDF score with the keyword similarity score and adjusted the score weightings
The results I’m seeing is better recall (now seeing results for some items where as before I was seeing none) and improved relevancy

I’ve attached the code below which allows you to compare the ranking of the two methods side by side.

The new script does require new libraries and is a bit more resource-heavy.

One thing that surprised me is how this can tell me how novel a new note or idea is!
I can see an application for that when ideating, refactoring or reflecting.

I might set up some experiments around entity extraction next to see if this improved the relevancy further.

In terms of productionizing I’m not sure how to move this forward. The Obsidian to Anki Plugin seems to have found a way to blend python and nodejs.
It may be worth looking into that in more detail.

similarity.py.zip (2.4 KB)

macedotavares · March 1, 2021, 4:26pm

Oh, this has got to be good

I’ll take it for a spin! Thanks!

cristian · March 16, 2021, 9:35am

Hi @marcelotavares, @raudaschl !, nice scripts you have there :D.

Last month I played with the library networkx to get some vault metrics and things like Pagerank (GitHub - cristianvasquez/prototype_05). I was happy with it.

We could have a ‘python lab’ plugin to make these experiments easier, providing an interface that executes the python scripts we’ve configured and return results in some usable way. We can use the plugin to experiment further and come up with better scripts.

In my case, I’m trying to figure out how:

Inspect similar notes to the current one.
Propose new links while I’m writing
Find two notes that I should merge into one.

I think python scripts are a flexible way to do it. I also have a feeling that the best scripts will depend on the contents of the vault.

raudaschl · March 17, 2021, 10:17pm

Hey @cristian

I love this idea of a python plugin!
How do we even go about doing such a thing?
I’ve been having issues getting node to even run a python script.

On the points you are trying to figure out I think the above similarity scripts are a good starting point.

Another way this could be achieved is via topic modelling.
I’ve been experimenting with creating models based on bi and trigrams within the notes - so each topic is based on keywords that commonly associated with eachother from the entire vault.

It’s based on this script I wrote to identify trends in COVID-19 preprint papers. Good news is it works on any plain text, so its great for obsidian.

The sooner we make python easier to implement in Obsidian the sooner I think we will see really what networked thinking augmented with machine learning can do.

cristian · March 18, 2021, 5:20pm

@raudaschl Let’s do it then!

I executed some python in the past from Obsidian, with this plugin GitHub - cristianvasquez/obsidian-snippets-plugin

For the lab, I’m trying to imagine something simple to use and extend.

At least I think that in a ‘lab’ plugin, you can associate commands to scripts, to experiment.

What do you imagine? are similar things out there?