Filter Dataview for YAML category. Dataview acting strange

What I’m trying to do

I have a bunch of MD files, that already have YAML attributes. An examplefile can be seen here. As we can see there are many categories in YAML.

Lets look at “requires” category in YAML.

I want to filter the Dataview for lets say 20. There are many cases to be covered:

  • 20
  • 20, X, X
  • X, 20, X
  • X, X, 20

Potential false-positives: 1820

Things I have tried

The Regex I have prepared is :

(^|\s)20($|\s|,)

I had no success with the queries. I don’t know if regex is supported or I am doing it wrong… Operators I have tried:

  • regexmatch()
  • filter()

The best I could do is the following code:

TABLE file.link AS "EIP", title AS "Title", requires, created as "Filed at", status
FROM "Github imports"
Where (requires=20) OR econtains(split(requires, ", "),"20")
SORT created ASC

Logically it should work… Funny Thing is that the OR econtains(split(requires, ", "),"20") makes the (requires=20) not work any more.

Leaving OR econtains(split(requires, ", "),"20") away, I get filtered results for requires: 20 . As soon as I add the line above, entries with requires: 20 do not appear in the table any more.

Instead I get these cases:

  • 20, X, X
  • X, 20, X
  • X, X, 20

Investigating further I imported the YAML to my Jupyter Notebook.

Within my pandas Dataframe I have three different data types

  • int such as 20
  • float such as NaN
  • str such as 20, 165, 721

I do not know how Dataview interprets Data types, but could that be the root of the problem?

Most likely, for anybody to tell you need to provide some markup of how this is written. But dataview can detect strings, numbers, dates, links, to mention a few. And doing splits on a list wouldn’t work, and contains on strings vs lists can be icky in some cases, so there’s a lot which can go wrong.

So please show some real examples of how you’ve marked up the requires field.

Sure, in the following I give some examples. Under the link, you can see the full MD File. The Frontmatter part I paste in here.

---
eip: 100
title: Change difficulty adjustment to target mean block time including uncles
author: Vitalik Buterin (@vbuterin)
type: Standards Track
category: Core
status: Final
created: 2016-04-28
---
  • Requires Field present, only one numeric Value. Full .md File here. YAML Frontmatter Header:
---
eip: 926
title: Address metadata registry
author: Nick Johnson <[email protected]>
type: Standards Track
category: ERC
status: Stagnant
created: 2018-03-12
requires: 165
---
  • Requires Field present, multiple numeric entries separated by ", ". Note that the value to filter for can be either the first value, the one enclosed by two values, or the last value.Full .md File here. YAML Frontmatter Header:
---
eip: 998
title: Composable Non-Fungible Token
description: Extends a ERC-721 to own other ERC-721 and ERC-20 tokens.
author: Matt Lockyer <[email protected]>, Nick Mudge <[email protected]>, Jordan Schalm <[email protected]>, sebastian echeverry <[email protected]>, Zainan Victor Zhou (@xinbenlv)
discussions-to: https://ethereum-magicians.org/t/erc-998-composable-non-fungible-tokens-cnfts/387
status: Draft
type: Standards Track
category: ERC
created: 2018-07-07
requires: 20, 165, 721
---

Your main problem throughout is related to the type of the requires field. When you do requires: 20 it’s considered a number, which you can’t really use regex or array functions on (without a little trickery).

I agree that logically it should work, but yet again the type of requires plays tricks on us. In addition there seems to be a bug related to how the split() filters out only rows where the type of the first parameters is string. In other words, the number version of requires is removed from the result set, as you’ve also experienced. Bummer…

Test cases showing the various textual notation and resulting types

Copy the text below into a note of its own, and study the end result of it when rendered in live-preview or reading.

## The simple test of various alternatives for markup

```dataview
TABLE WITHOUT ID value[0] as Textual, value[1] as Value, typeof(value[1]) as Type
FLATTEN list(
[ "10", 10],
[ "\"20\"", "20"],
[ "\"30, 31\"", "30, 31"],
[ "[40]", [40] ],
[ "[50, 51]", [50, 51]]) as value
WHERE file.path = this.file.path
```

The first column shows what you would type in the `requires: <textual>`. The second column displays the value as _dataview_ sees it, and the third column displays which type _dataview_ considers this value to be. 

## Strangeness occurs when adding `split()` to the mix

```dataview
TABLE WITHOUT ID value[0] as Textual, value[1] as Value, typeof(value[1]) as Type,
  split(value[1], ",")

FLATTEN list(
[ "10", 10],
[ "\"20\"", "20"],
[ "\"30, 31\"", "30, 31"],
[ "[40]", [40] ],
[ "[50, 51]", [50, 51]]) as value

WHERE file.path = this.file.path
```

The query above _should_ have all the rows, but the fourth column with the `split(value[1], ",")` causes it to filter out _only rows of type string_. I'm close to considering this as a bug of _Dataview_.  Doing either of the following causes this filter effect to occur:

- `split(value[1], ",")` 
- `default(split(value[1], ","), "no split")` 
- `choice(typeof(value[1]) = "string", split(value[1], ","), "no string")`

## Normality(?) restored 

```dataview
TABLE WITHOUT ID value[0] as Textual, value[1] as Value, typeof(value[1]) as Type,
  split(string(value[1]), ",") as SplitValue,
  typeof(split(string(value[1]), ",")) as SplitType
  
FLATTEN list(
[ "10", 10],
[ "\"20\"", "20"],
[ "\"30, 31\"", "30, 31"],
[ "[40]", [40] ],
[ "[50, 51]", [50, 51]]) as value
WHERE file.path = this.file.path
```

Using the default theme (not Minimal which hides the bullet points of lists), this renders as:

In short, ideally you should use either [ 20 ] or [20, 30, 132] to always ensure the array variant, which when would allow you to filter using only contains(requires, 20).

In the “normality restored” query above, I change the split(..., ",") to split(string( ... ), ",") and this actually takes care of most of the interesting cases in our scenario. So this can be utilised to normalise your query (without needing to change the markup) into:

```dataview
TABLE file.link AS "EIP", title AS "Title", requires, created as "Filed at", status
FROM "Github imports"
WHERE contains(split(string(requires), ","), "20")
SORT created ASC
```

Note that in this case you do need to do "20" in the contains(). An alternative would be to contains(number(split(string(requires))), 20) but I’m not sure it’s worth it… :smiley:

Side note on regex’s, you could most likely also do something like regexmatch( string(requires), "(^|\s)20($|\s|,)" )update: regextest( "(^|\s)20($|\s|,)", string(requires)), where before you do the regex match ensures the value is actually a string. I’ve not tested this particular solution, and whether you’re regex is correct, though. This is more of a hint related to how you could do proper regex testing against something which might be a number (or other types) in some cases.

1 Like

Thank you for your efforts :pray: I think I understand now, why Dataview didn’t work as expected.

I am inching closer to the solution but I couldn’t get it to work just yet.

When querying 20 using your code above, it covers all cases but I also get false positives, e.g. when requires is 1820. Using econtains instead fixes this but then an entry such as 15, 18, 20 will not show anymore.

regexmatch gives results for regexmatch( string(requires), "20").
When trying regexmatch(string(requires), "(^|\s)20($|\s|,)" ) I get no results. Maybe this is because of the escape characters. I tried it with "\(^|\\s\)20\($|\\s|,\)" but I also do not get any results for that.

I checked the regex here and it should work fine theoretically.

Try doing econtains in combination with number. In other words, try doing econtains(number(split(string(requires), ",")), 20)

1 Like

Ehh… My bad, I didn’t test that, and the parameters are switched around, and we were using the wrong version. See String Operations, and you’ll see that:

  • regextest(pattern, string) – Match against somewhere within the string
  • regexmatch(pattern, string) – Match against the entire string
  • And that the parameters are in the order: pattern, string… At least for these two

In other words, and confirmed by testing, try doing: regextest( "(^|\s)20($|\s|,)", string(requires)), and you’ll get better results. I’d still use the econtains/number variant, but this would be the regex alternative.

1 Like

Thanks for the help! Both versions work :slightly_smiling_face:

Is there a way I can feed this subset of Dataview selection to the graph view? According to this posts from 2021 this feature does not work.

I know I can use the operator requires: 20 for example but it includes false negatives.

You’ve got to turn the dynamic query into static information for the graph view to pick up on it. This can be done, with some caveats, through the use of templater to store the result of your query.

Would this be a one time operation for you, or would you need to rerun the query? And if you need to rerun it, how often do you think you’ll need to rerun it? (Aka how often do the base data for the query change? )

I do not want to edit the source files since I want to be able to refresh them. Refreshing them would probably not occur that often but it should be possible.

Generally, what I am trying to do is:

1st: see which EIPs are have most requires entries

Tried making that work using GROUP BY number(split(string(requires), ",")). As seen below, it groups for each individual combination of numbers seperated by ", " in requires but not for each individual number.

TABLE length(rows) as Number 
FROM "Github imports"
GROUP BY number(split(string(requires), ","))

Result:

When trying to make it work with the regextest approach, this just works for one value, here requires = 20. It checks each reuqires entry for the occurance of a certain number and counts true and false values. But I would need it to iterate through all values and count the true for each individual reuqires value.

TABLE length(rows) as Number 
FROM "Github imports"
GROUP BY regextest( "(^|\s)20($|\s|,)", string(requires))

Result:

2nd: After a brief overview investigating further using Template note

Then I would select EIPs I am interested in and investigate those further using a new template note.

The template would have a YAML Frontmatter Tag that the EIP number of interest can be typed in or maybe is automatically fetched from somewhere. Then, the Dataview should adjust according the selector we have talked earlier. This should be possible with the solution you gave me above.

Edit: Solution found by using Templater.

Example: If YAML frontmatter is…

---
EIP: 20
---

… then either one of the two following expressions below …

<% tp.frontmatter.EIP %>
<% tp.frontmatter["EIP"] %>

… will get replaced by 20 when pressing Alt + R .

3rd: Visualizing the Dataview subset using Graph View

Then, ideally the local graph view should visualize the selection of this subset to see the relation and allow me to dig deeper

Embedding graph view does not work yet apparently but I’m fine to have it in a window of the workspace and save the workspace for now.

1st – Locating the highest occurence

In order for the individual numbers to be grouped properly you’ll need to FLATTEN any list of numbers. So maybe the following would give the wanted output:

```dataview
TABLE length(rows) as count
FROM "Github imports"
FLATTEN number(split(string(requires), ",")) as aRequire
GROUP BY aRequire
SORT aRequire DESC
```

3rd – Visualizing the subset

Some alternate approaches to this case:

  • Just use the query, which has the serious drawback of no graph view. Not a good alternative, I reckon
  • Have the query in a commented section, copy it into the clipboard, and apply a template which execute the clipboard as a dataview query
  • Automate the process in the previous step

I’m working on the third step, but the middle step is easier to implement. For that to work, create a template (e.g. queryClipboard) in your template folder with the following content:

<%*
const dv = app.plugins.plugins.dataview.api

const query = await tp.system.clipboard()
console.log(query)
if ( query == "" ) 
  window.alert("Copy the query (starting with " +
               "LIST|TASK|TABLE|CALENDAR) into the clipboard. " +
               "Before inserting the template.")
                
const result = await dv.queryMarkdown(query, tp.config.target_file.path);
if ( result.successful ) {
 tR += result.value
} else {
  tR += "~~~~\n" + result.error + "\n~~~~"
}
%>

You can now optionally create a hotkey for this template by doing:

  • Go to _Settings > Templater > Template Hotkeys>, and hit the Add new hotkey for template button, and search up your newly created template
  • Press the plus icon, and the hotkey dialog opens up
  • Hit the plus icon in the new dialog next to the “Templater: Insert …/queryClipboard.md” command, and press the hotkey combination you want for triggering this template

Now given a query like the one below:

```dataview
LIST LIMIT 5
```

Select just the lines between the code block fences, aka LIST LIMIT 5, and copy it to the clipboard using Cmd + C (or Ctrl + C on windows). Position your cursor a few lines down, and hit the designated hotkey, and it’ll insert the result of the query there.

A better setup using this template

To make it even easier to rerun your query I would place the original query in a comment block, so say that you’ve got something like the following:

<!-- The query is usually commented out, since we've persisted the
   result below 
```dataview
... the query of your choice ...
```
-->

- previous
- result
- list

You would of course replace the ... your query ... part with your query, and this is contained within a comment block, so that it doesn’t “duplicate” the output, but this allows for two things. One it’s accessible for selection and copying into the clipboard (and subsequently ready for use by the template). And secondly, if you just remove the <!-- part, you can see the dynamic result of the query (with a little extra garbage before and after).

The process to rerun your query would now be:

  • Make sure the query is commented out
  • Select the inner part of the query, that is don’t include the code fence of the triple backticks, and copy that to the clipboard
  • Select the previous output, and delete or replace that with a space or something like that
  • Hit the designated hotkey for the queryClipboard template (or select it from the insert templater menu), and see your output be refreshed
  • Enjoy the graph view of the current output

WIP - Fully automated persisting of a query

When I get around to it, I’ll make one template to change the syntax around an existing query, so that it’ll end up look something like:

<!-- query-abc begin --
LIST LIMIT 5
<!-- query-abc end-query -->
- previous
- result
- set
<!-- query-abc end-result -->

This markup would then be able to be parsed by another template which will pick up the query between the first two markers, and replace the contents between the last two markers with the result of the executed query.

I’ve got the second part nearly done already, but I need to test and fine tune it a little more before presenting it in #share-showcase . (And I’m a little undecided on whether I should require the query section to be selected before the template is inserted, or if I should apply the template to the entire file )

But this, as said already, is a work in progress, so one day in the future it’ll hopefully come in a post near you! :smiley:

Wow, thank you for your dedicated work! I really appreciate your help :pray:

1st – Locating the highest occurrence

Your code works. Just sorting the highest occurrence descending does not work. I tried a bit but couldn’t make it work. Apparently, Dataview has some problems sorting grouped entries. What do you think? This is my code:

TABLE length(rows) as reference_count
FROM "Github imports"
FLATTEN number(split(string(requires), ",")) as EIP 
GROUP BY EIP
SORT reference_count

… and this is my output:

3rd – Visualizing the subset

Just to understand the workflow, it is about getting the dynamic results displayed by the Dataview query in a static form, such as written text, so the links can be utilized by the graph view, correct?

I couldn’t quite get it to work… Regarding your template for Templater, I copied your code and inserted it into my template. I also set up the hotkey for it.

When I open a new note and copy any Query, for instance, this…

TABLE 
file.link AS "EIP", title AS "Title", requires, created as "Filed at", status 
FROM "Github imports" WHERE 
regextest( "(^|\s)20($|\s|,)", string(requires))
SORT created ASC 

… but also just LIST LIMIT 5 , I get a parsing error:

image

I tried deleting the query variable in the template note and inserting something like this instead:

  • LIST LIMIT 5 or
  • (LIST LIMIT 5) or
  • "LIST LIMIT 5" or
  • ("LIST LIMIT 5")

But it didn’t work. Am I doing something wrong?

Regarding highest occurrence

You can’t use a value declared in the table header in other parts of the query, but you can do it the other way around, so try this variant of the query:

```dataview
TABLE reference_count
FROM "Github imports"
FLATTEN number(split(string(requires), ",")) as EIP 
GROUP BY EIP
FLATTEN length(rows) as reference_count
SORT reference_count
```

On visualizing the subset

Yes, that is correct.

That template you’ve shown an image of isn’t pasted correctly. The line
shift are all mangled up. Try doing Cmd (or Ctrl) + Shift + V, and see if that doesn’t give you proper line endings, and removes the parsing error.

1 Like

Thanks, that solved all my problems. I can now create beautiful and useful images like this:

I’ll mention you in my thesis for helping out :slight_smile:

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.