hckrnws
I know I'm repeating my self (it must be my third comment on HN about this topic), but this does not match my experience at all.
DuckDB will error-out with an out-of-memory exception in very simple DISTINCT ON / GROUP BY queries.
Even with a temporary file, an on-disk database and not keeping the initial order.
On any version of DuckDB.
Curious about your post, and DuckDB, since have seen many previous post here on HN about it...I just took 10 min to do some quick tests, and experiment with DuckDB for the first time. :-)
While I am a Linux user, tried this on a available Windows 10 machine (I know...Yuck!)
1) Setup and install no prob. Tracked the duckdb process memory usage with PowerShell, like this:
Get-Process -Name duckdb | Select-Object Name, @{Name="Memory (MB)";Expression={[math]::round($_.WorkingSet64/1MB,2)}}
2) Used this simulated netflix table dataset available from an S3 bucket, as used in this example blog. Installed the aws extension not the one mentioned in the blog: https://motherduck.com/blog/duckdb-tutorial-for-beginners/Section in the blog: "FIRST ANALYTICS PROJECT"
Table has 7100 rows as seen like this:
SELECT COUNT(*) AS RowCount FROM netflix;
3) Did 8 different queries using DISTINCT ON / GROUP BY The one below, being one example:
SELECT DISTINCT ON (Title)
Title,
"As of",
Rank
FROM netflix
ORDER BY Title, "As of" DESC;
I am not seeing any out of memory or memory leak from these quick tests.I also tested, with the parquet file from the thread mentioned below by mgt19937. It is 89475 rows. Did some complex DISTINCT ON / GROUP BY on it, without seeing neither explosive memory use or something similar to a memory leak.
Do you have a more specific example?
> 7100 rows
For analytic queries, I don't feel that is anywhere near what I've seen at multiple companies who gave opted for columnar storage. That would be at most a few seconds of incoming data.
With so few rows, I would not be surprised if you could use standard command line tools to get the same results from a text file of similar size in an acceptable time.
Certainly not. We typically would talk about hundreds of millions to billions of rows.
Just trying to reproduce the use case...Also tried with the Parquet file from the support ticket. Around 90,000 rows...
Quick example:
``` COPY ( SELECT DISTINCT ON (b, c) * FROM READ_PARQUET('input.parquet') ORDER BY a DESC ) TO 'output.parquet' ( FORMAT PARQUET, COMPRESSION 'ZSTD' ) ; ```
Where the input file has 25M rows (500Mb in Parquet format) containing 4 columns, a and b are BIGINTs and c and d are VARCHARs.
On a Mac Book Pro M1 with 8GB of RAM (16x the original file size), the query will not finish.
This is a query that could very easily be optimised to take little amounts of space (hash the DISTINCT ON key, and replace in-place the already seen values if the value of "a" is larger than the one that already exists.)
With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)
Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).
You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.
duckdb 321.01 -> Start Query
duckdb 6302.12
duckdb 13918.04
duckdb 10963.74
duckdb 8586.76
duckdb 7613.86
duckdb 6749.53
duckdb 5990.96
duckdb 5293.35
duckdb 4205.53
duckdb 3153.59
duckdb 1482.86
duckdb 386.29 -> End Query
So yes, there are some opportunities for optimization here :-)Thanks for the benchmarks! :)
Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.
Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.
It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.
Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.
What is the error message you’re getting? Or is it simply that the query will not finish? (Does it continue to run?)
Parquet files are compressed, and many analytic operations require more memory than the on disk size. When you don’t have enough ram DuckDb has to switch to out of core mode which is slower. (It’s the classic performance trade off)
8gb of ram is not enough usually to expect performance from analytic operations - I usually have minimum 16. My current instance is remote which has 256gb ram. I never run out of ram and DuckDb never fails and runs super fast.
In general DuckDB is great but I had similar out-of-memory issues specifically when using "distinct on" and "unnest". In such cases I usually delegate some queries to chdb / clickhouse-local instead.
DuckDB has been evolving nicely, I especially like their SQL dialect (things like list(column1 order by column2), columns regex / replace / etc) and trust they'll eventually resolve the memory issues (hopefully).
Thanks for repeating yourself; this comment potentially influences my decision-making about DuckDB. Good to know about the negatives too.
At what data volumes does it start erroring out? Are these volumes larger than RAM? Is there a minimal example to reproduce it? Is this ticket related to your issue? https://github.com/duckdb/duckdb/issues/12480
In my experience, recent versions work well for out-of-core computation (I use it frequently). What is the size of the Parquet file that you are using? - I can't find the details in you previous comments.
Sounds interesting. Is there a open issue for this? I found https://github.com/duckdb/duckdb/issues/8505 but it seems that that specific issue is closed.
Comment was deleted :(
Sounds like the very early versions of EF Core which did not translate C# GroupBy into SQL GROUP BY, but instead loaded the /entire/ data set into memory and did the aggregate there.
I've had similar times with DuckDB, it feels nicer to use on the surface but in terms of perf and actual function I've had a better experience with clickhouse-local.
Are you using it for simple SQL retrieval or complex analytic queries? They’re both similar for the former use case, but DuckDB — being an analytic engine — supports the latter use case much better.
DuckDB has great ergonomics for moving data between different databases and making copies for local analysis. The one thing that differed in my experience with it from the author’s is how much of the Postgres sql dialect (and extensions) it supports. Attempting to run my Postgres analytics sql code in duckdb errors out on most json operations - to be fair, the DuckDB json functions have cleaner names than jsonb_path_query - also, DuckDB has no support for handling xml, so all xpath calls fail as well.
You may know this already but the postgres extension[1] may help:
If I understand it correctly, when you use it it:
- Pulls the minimal data required (inferred from the query) from postgres into duckdb
- Executes your query using duckdb execution engine
BUT, if your postgres function is not supported by DuckDB I think you can use the `postgres_execute` [2] to execute the function within postgres itself
I'm not sure whether you can e.g do a CTE pipeline that starts with postgres_execute, and then executes Duckdb sql in later stages of the pipeline
[1] https://duckdb.org/docs/extensions/postgres.html#running-sql... [2]https://duckdb.org/docs/extensions/postgres.html#the-postgre...
Thanks for the suggestion! As I understand, you can only postgres_execute against a running Postgres db. It does work and I’ve used it in my tests, I think I could get around the limitations that I ran into by running a pg instance alongside DuckDB. For now I think I’ll stick with just pg, as I was looking into DuckDB to replace pg in my local analytic workloads: load data from rest apis, dump into a database and use sql in a custom dbt-like pipeline to build the tables for analysis in bi tools. Unfortunately, many endpoints return xml data and much of the sql I’ve already written deals with json, meaning it would have to be adapted to work with DuckDB.
Really, is this what's getting praised? I mean specifically the first point: the whole "just paste the url into the DB" - thing, + inferring the column names. That looks like the laziest and shakiest basis, and if I ever saw that in production i d be both stunned and scared
It is a really useful feature for ad hoc examination of datasets. Not every data-analysis is going into a massive, mission critical production build chain. Quite often you just want to look at the data in a REPL and IMO that sort of usage could bear to get a little easier
Inference for column names and datatypes seems only relevant for csv? Other formats convey that metadata, as e.g. parquet...
csv metadata inference in duckdb is amazing. There is some research in this domain and duckdb does a great job there, but yes there might be some really strange csv files, which require manual intervention.
It would be perfectly reasonable IMO to put your metadata elsewhere and still use DuckDB for processing.
What's wrong with the URL in the query?
observablehq.com has built in support for duckdb, and I have found it to be very easy to use. Getting windowing and cte and derived columns is great and being able to just refer to sql query cells as an array of rows makes things much easier for me than breaking out into js right away.
Someone wrote an export function, so I can make a select into a table and grab that as csv to use elsewhere.
I wish for Simon Willison to adopt duckdb as he has with sqlite to see what he would create!
I very, very nearly migrated to a full Duckdb solution for customer-facing historical stock data. It would have been magical, and ridiculously, absurdly, ungodly fast. But the cloud costs ended up being close to a managed analytics solution, with significantly more moving parts (on our end). But I think thats just our use case, going forward I'd look at duckdb as an option for any large-scale datasets.
Using ECS/EKS containers reading from a segmented dataset in EFS is a really solid solution, you can get sub second performance over 6 billion rows / 10000 columns with proper management and reasonably restrictive queries.
Another option is to just deploy a couple huge EC2 instances that can fully fit the dataset. Costs here were about the same, but with a little more pain in server management. But the speed man, its just unbelievable.
Co-founder and head of produck at MotherDuck here - would love to chat. We're running DuckDB in a serverless fashion, so you're only paying for what you consume.
Feel free to reach out to tino@motherduck.com.
To note, we migrated from Redshfit, which had 7-30 second performance. Our current managed solution is something like 1 - 5. Duckdb just smashes everything else, at least on our data.
Did you check the cost to run it on Motherduck (https://motherduck.com/)?
What do you mean by segmented dataset in EFS?
Great, I didn't know about fsspec!
uhm why would you ever use this instead of sqlite
DuckDB is optimised for analytical tasks, whereas SQLite isn't.
For those not aware: analytical tasks involve a lot of groupings, aggregations, running sums/averages, sorting, ranking etc. Columnar databases like duckdb are more focused on those tasks so you can do these tasks much faster.
Ty! Never got into the analytics side of things
Have a look at this section of the AWS Redshift documentation (also a columnar database) to understand the advantages of these types of systems: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st...
Or the advantage of columnar file formats, like ORC or Parquet, for analytical queries. Normally you are only interested in a few columns.
Columnar formats allow for some pretty interesting optimizations with respect to filtering by rewriting and pushing down constraints / checks; and evaluating multiple of them simultaneously, pulling as little as necessary.
Comment was deleted :(
Crafted by Rajat
Source Code