Obsidian, dataview, and memory limit / memory explosion

Hello!
I’m dealing with a memory explosion with dataview. After using about 4gigs of memory, obsidian crashes to blank and releases memory. I’m on windows 10 64bit, 32gb memory

background

I’m trying to export a csv of media records.
I’m using dataview, tasks, and csv exporter for this. A single media record as a task might look like this:
- [x] SCP Archives podcast [sub::SCP-049: "The Plague Doctor"],[numeric::5],[publish::2019-04-09], ✅2020-07-17

I keep my records together in a file per media source (in this case, the SCP podcast) or a catch-all podcast page, and these tasks helpfully inherit metadata from the parent file. So far so good.

for reasons (driving some gdoc charts) I want to periodically export csv, so I have a dataview table as follows:

TABLE WITHOUT ID Tasks.completion AS date, regexreplace(Tasks.text, "[\[✅].*$", "") AS media, Tasks.numeric AS numeric, Tasks.sub AS sub, Tasks.published AS published, media-type AS type, Tasks.media-type AS unittype, Tasks.priority AS "pri"
FROM #_media
WHERE file.tasks
FLATTEN file.tasks AS Tasks
WHERE Tasks.completion > date(2022-12-31) 
LIMIT 10000

this dataview looks for all #_media tagged notes and flattens every task within into a table that gives me useful info - especially date, media name, and media type.
Then I can use table to csv exporter.
This had been going well until I finished ingesting my backlog of media records.
Now when I try to generate the above dataview, the flatten operation causes the memory to balloon to 4gb, and then obsidian crashes.

I’d love to know if there’s a way to let the process use near to my full 32 gb of memory to brute force the problem, or if there is a way to get this data transformation to use far less memory without me having to reorganise a bunch of data by hand.

Things I have tried

  • lowered the LIMIT - has no effect, probably would just be filtering the results
  • removed the regexreplace - no discernible difference
  • excluded a number of files - when I excluded a large portion of my media notes, I got the process to peak at about 3gb then successfully complete, which then released the memory back down to about 400mb.
  • checked that I am in fact using the 64 bit version of obsidian

Any help would be greatly appreciated!

This is most likely to late, as you’ve suggested. Limit is most likely for at the very end.

I think the way forward is to limit the tasks to process, and without knowing your complete data set here are some ideas:

  • limit based on file names, like all files starting with “A”
  • (utilize tags to divide the tasks)
  • possibly use modification time on files to sort out stuff? If completed in 2023, then the files are modified in 2023, right?
  • introduce folders to your media storage, and split the file sets just to reduce the amount of files processed
  • instead of starting with FLATTEN, start with a filter to only keep processing interesting tasks

Hope this helps

I think you might be right about the direction to go.
I’ll do some head-scratching about how best to filter more before the flatten.
Since I am mostly using this system to track my year media goals (as far as the export is concerned) maybe i could filter out anything that doesn’t contain “2023” in the body…
I’ll have a tinker in the morning.
thanks!

I think doing something like the following could help:

FLATTEN filter(file.tasks, (t) => t.status == "x" && t.includes("2023")) as tasks23
FLATTEN tasks23 as Tasks 

You’ll already should have lost some tasks , before you expand the task list.

I had to give it a try… and my word, it is lightning fast now!
This example has helped me to understand a lot more about how dataview queries work.
I owe you a beverage. Thank you!

for posterity, this dataview is equivalent to my first example with the same output, with the suggested optimisation implemented:

TABLE WITHOUT ID tasks23.completion AS date, regexreplace(tasks23.text, "[\[✅].*$", "") AS media, tasks23.numeric AS numeric, tasks23.sub AS sub, tasks23.published as published, tasks23.media-type AS type, media-type AS unittype, tasks23.pri as pri
FROM #_media
WHERE contains(file.tasks.text,"✅2023-")
LIMIT 10000
FLATTEN filter(file.tasks, (t) => contains(t.text,"✅2023-")) as tasks23
LIMIT 10000

hey bro, i have the same problem of the friend of this post (didnt crash at all, but took some long time freezing) with this query:

table tags as "Matéria" , choice(!T.completed, split(T.text, "_")[1], "\-") AS "Data" , split(T.text, "_")[0] as Tipo
From ""

FLATTEN file.tasks AS T
WHERE !T.completed  and 
(
contains(T.text,"2023-01-17")
or
contains(T.text,"2023-01-16")
or
contains(T.text,"2023-01-15")
or
contains(T.text,"2023-01-14")
or
contains(T.text,"2023-01-13")
or
contains(T.text,"2023-01-12")
or
contains(T.text,"2023-01-11")
)
sort tags descending

You have any tips to optimize ?
thanks in advance

TABLE T.text as text
FLATTEN filter(file.tasks, (t) => (contains(t.text,"2023-01-11")|contains(t.text,"2023-01-12")|contains(t.text,"2023-01-13")|contains(t.text,"2023-01-14")|contains(t.text,"2023-01-15")|contains(t.text,"2023-01-16")|contains(t.text,"2023-01-17")) ) as T
WHERE !T.completed
LIMIT 100

here I applied your WHERE filtering logic in the filter() function instead, by chaining together a bunch of “contains” functions with OR operator (“|”). There may be a way to do this more elegantly with a list, but I don’t know how.

If you have the same issue as me, something like this should have the same speed and memory improvements.

Thanks bro !
But somehow the query that you post didnt output the same content. Although it looks like do the same stuff

hmm, I’m not sure what to suggest without having data to try it on.
but hopefully the example gives you an idea of how to modify your own version to use the filter within the FLATTEN step rather than using the WHERE to do it.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.