What is the formal syntax surrounding tags?

datum · February 13, 2021, 9:07pm

Hello Obsidianians, I’m working on a really cool dashboard for my notes (I’ll talk about it in Show and Share when v.1 is done) and I need to nail down precisely what constitutes a tag. I need to process them outside of Obsidian by scanning the notes. Initially I thought that a tag is delimited by white space, such that it’s surrounded by spaces, tabs, or at the beginning of a line. It’s not that simple. Some examples:

abc#Bob

Not a tag.

|#Alice

Not a tag.

\|#Alice

This is recognized as the tag #Alice

#Bob#Alice

This is recognized as both tags #Bob and #Alice

But, if #Bob is no longer a tag:

Bob#Alice

then neither is Alice. This doesn’t make a lot of sense. It seems like a tag can be prefixed by a tag or an escaped pipe (\|) but if prefixed by anything else, not only is #Bob not a tag, but neither is #Alice. some kind of weird prefix chain rule. This must be a parsing nightmare.

So if some fine developer out there can explain the precise lexical rules surrounding (no pun intended) tags, well, some good will come of it.

My experience with the tags leads me to another question. I often copy/paste articles / research papers directly into notes, and there’s an awful lot of potential unintended interpretation. Is there a straightforward way to tell Obsidian to not interpret anything in the note or section of a note? I’m aware of code blocks, but I really don’t want the whole note red (and code blocks are not immune from interpretation).

Thanks for your time and consideration.

datum · February 25, 2021, 7:29pm

In case it helps others, I had to answer this one myself. If looks like the following regular expression will work:

(^|[[:blank:]])(#[a-zA-Z0-9/_-]*[a-zA-Z/_-][a-zA-Z0-9/_-]*)+

This also obeys the restriction that tags cannot be only numbers. The astute reader will notice that this allows for an all-number sub-tag, or empty tag. Yes, it does, because Obsidian recognizes them as tags, so Obsidian will actually accept and tag crazy things like:

#///777/

So I’ve gotta conform, if I expect external results to match internal (Obsidian) results.

The only hiccup is that the regex will capture leading space, which isn’t really a part of the tag, but that’s easily “strippable.” I wanted to do a Positive Lookbehind but that’s not reliable in a cross-platform or cross-tool context, so it has to be avoided. Also, the sequence

(^|[[:blank:]])

should be simplified to

[[:space:]]

except that it’s not picking up tags at beginning of line, even though it should.
Linux man page citation: grep → re_format → wctype → isspace

Edit: Whoops, I forgot to handle the case with the pipe. Rather than delete the post I’ll update it later.

Edit2: Regex not rendering properly due to display interpretation; fixed. I’m addressing the pipe issue in a followup post.

Edit3: Oof. Yet another display rendering issue. Hopefully the last one. This getting kinda meta.

datum · February 26, 2021, 2:35pm

Identifying tags prefixed by escaped pipes

The previous incarnation of the regex isn’t complete because Obsidian will recognize a tag preceded by an escaped pipe. In other words, this is not a tag:

…|#tag

However this is a tag:

…\|#tag

I’m not sure why this is allowed, but it is what it is.
There comes a point where one realizes that nothing short of a recursive descent parser will solve the problem, but I’m not there yet. Revisiting the goal, it is to be able to identify all tags outside of Obsidian. This is for a dashboard I’m building that will use tags as a filtering constraint. Therefore I don’t need to get this done in a single regex, and it’s not really possible anyway. As a stream filter, though, this will work for me:

(note content) | sed ‘s/[\]|/ /’ | egrep -o ‘(^|[[:blank:]])(#[a-zA-Z0-9/_-]*[a-zA-Z/_-][a-zA-Z0-9/_-]*)+’ | sed ‘s/^[1]//’

This will output a list of tags, one per line, which is ideal for my purposes.

[:blank:] ↩︎