A Dab of DuckDB

Published on: 4/23/2026

I’ve been hearing about DuckDB for a while now but I hadn’t yet had a good excuse to try it. I proposed it as a possible solution during a call with a prospective client recently, so I thought it was a good time to jump in with both feet and gain a better understanding of how it works and where it could be useful. πŸ˜…

Overview

As DuckDB.org states:

DuckDB is a SQL database that runs everywhere: on your laptop, on a server and even in the browser

This description sells the project quite short. Yes, it’s accurate but one of the major differentiators between it and similar solutions not mentioned is that DuckDB is capable of turning almost any data source into a SQL database. This is a wild concept and it feels magical. With two lines of code, you can turn a directory full of JSON files (also, CSV, TSV, etc., etc.) into a functional database:

-- sample JSON: {"jobs_processed": 95384}
CREATE TABLE job_summaries AS SELECT jobs_processed FROM read_json_auto('data/*.json', columns = {jobs_processed: 'INTEGER'}, ignore_errors = true);
SELECT sum(jobs_processed) FROM job_summaries where jobs_processed not null;

-- jobs processed => 1818170796
-- 22,229 files
-- ~6s to create DB and import data
-- ~0.03s to query / sum

Use Cases

As I understand it, DuckDB is well suited to data sets which:

- fit on a single, local SSD
- can be *much* larger than RAM (e.g. 64GB RAM / 1TB data set)
- are read-oriented
- are analytical (i.e. you're asking questions about and operating on the collection instead of looking up individual records)
    - logs
    - DB exports from legacy systems
    - financial data
    - public records
    - sensor data

Anything else?

Rapid Prototyping

While it’s not necessarily one of DuckDB’s advertised use cases, it’s extremely useful for rapid prototyping. For instance, if I have an historical set of log files generated by A Widget Machine and I want to prototype a system which makes this data discoverable by a web interface, I can do this in short order using code similar to the sample shared above. The added benefit is that because DuckDB’s flavor of SQL is mostly a superset of ANSI SQL, these initial queries can (probably) be rolled into a more traditional RDBMS if you determine that’s a better fit. Something like an ETL pipeline which moves data from the source, through some sync/transformation mechanism (I like Airbyte) through to Postgres as the final destination.

It’s also superior to β€œdocument storage” solutions because while the schema is inferred based on the contents of the source data, it’s still present and you get all of the DB-level type validation you would expect – if you want it. Common operations, like joins, are also considerably simpler to write in SQL. I’ll touch on this more below but you have the ability to make data imports as strict or as permissive as you need to using options like ignore_errors and union_by_name.

Art/Interactive Installations

I think DuckDB is also well suited for use in interactive installations and art projects. I can think of a number of such projects I’ve worked on over the years where DuckDB would have enabled exploration and presentation and precluded the need for traditional DB or ETL infrastructure. SQLite is often used in these instances but, as simple as it is, it still requires a modicum of planning in order to carve out a schema, devise a strategy for migrations if the schema changes and a data ingestion pipeline to get data from the source (e.g.Β CSV of sensor data). DuckDB papers over all of this and lets you start experimenting with and visualizing your data almost immediately.

Using It

I installed DuckDB using the Linux shell script (after carefully reviewing it!) from the Installation page and did not have any issues. There are options for all common platforms.

You can run sql directly, work in the CLI/REPL and there are client libraries for most popular languages.

CLI

duckdb -c "SELECT COUNT(*) FROM 'data/*.json'"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ count_star() β”‚
β”‚    int64     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        22229 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rust Client

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    conn.execute_batch(
        "
        CREATE TABLE job_summaries AS SELECT jobs_processed FROM read_json_auto('data/*.json', columns = {jobs_processed: 'INTEGER'}, ignore_errors = true),
        "
    )?;
    let total_jobs_processed: i32 = conn.query_row(
        "SELECT sum(jobs_processed) FROM job_summaries where jobs_processed not null",
        [],
        |row| row.get(0),
    )?;
    println!("total_jobs_processed: {}", total_jobs_processed);
    Ok(())
}

Caveats

I ran into a few unexpected issues around installing/building DuckDB and libduckdb-sys when first attempting to use the Rust client. A common approach is to use the β€œbundled” option when installing the library. From the docs:

Uses a bundled version of DuckDB’s source code and compiles it during build. This is the simplest way to get started and avoids needing DuckDB system libraries.

This worked fine in isolation but I have Vim configured to run :Cargo check on write and this resulted in DuckDB being rebuilt every time I saved. This could probably be worked around in my .vimrc but I instead used the DUCKDB_DOWNLOAD_LIB=1 option to have the build step download the binary once.

This approach also taught me something new about setting environment variables for Cargo, so I didn’t have to modify .bashrc or continually prefix commands. You can create a .cargo/config.toml file in your project and set useful config options, like environment variables for Cargo to use. In this case, I added the following and Cargo began doing the correct thing both from the CLI and Vim.

.cargo/config.toml
...
[env]
DUCKDB_DOWNLOAD_LIB = "1"

Strict vs.Β Permissive Imports

By default, DuckDB is strict about data types, field presence, etc. when importing data using read_json_auto and friends. You can work around these β€œlimitations” if you need to, though. I can absolutely see this being a real concern if you’re attempting to import decades worth of data whose internal schema may have drifted or that are subject to human error. The following options seem most relevant and I was able to test and verify that the have the intended results in my sample project:

union_by_name

If you are processing multiple files that have different schemas, perhaps because columns have been added or renamed, it might be desirable to unify the columns of different files by name instead. This can be done by providing the union_by_name option.

DuckDB can be clever and forgiving about how it infers the schema when importing data and you may or may not run afoul of issues resulting from schema drift. It comes down to the order in which files are sampled and the sample size DuckDB uses. You can force its hand and surface the class of errors which union_by_name addresses with the following:

{"title": "A", "pages": 12} // one.json
{"title": "B", "pages": 16, "author": "Smith"} // two.json
-- without union_by_name
SELECT * FROM read_json_auto('data/*.json', maximum_sample_files=1);
Invalid Input Error:
    JSON transform error in file "data/02.json", in line 1:
        Object {"title":"B","pages":16,"author":"Smith"} has unknown key "author"
Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.


-- with union_by_name
memory D SELECT * FROM read_json_auto('data/*.json',maximum_sample_files=1,union_by_name=true);
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  title  β”‚ pages β”‚ author  β”‚
β”‚ varchar β”‚ int64 β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ A       β”‚    12 β”‚ NULL    β”‚
β”‚ B       β”‚    16 β”‚ Smith   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
-- NOTICE: author is set to null in rows whose underlying JSON is missing the author key

The takeaway here is that union_by_name results in a superset of all fields encountered with missing fields set to null by default.

ignore_errors

There are cases where CSV files may have multiple structural errors, and users simply wish to skip these and read the correct data. Reading erroneous CSV files is possible by utilizing the ignore_errors option. With this option set, rows containing data that would otherwise cause the CSV parser to generate an error will be ignored. In our example, we will demonstrate a CAST error, but note that any of the errors described in our Structural Error section would cause the faulty line to be skipped.

This will ignore errors which would ordinarily result from files like:

{"jobs_processed": "nine"}

I think it’s also worth noting that just specifying the column and type will result in null column values instead of errors in the case that you try to import a file with missing fields. This all greatly depends on the specifics of particular use cases but some combination of the above should allow users to at least get started working and experimenting with legacy data sets.

Performance

It’s worth calling out a few things around the performance metrics noted above and how you’d use DuckDB in a day-to-day workflow.

The cited figures result from a machine with 64GB of RAM and 12 CPU cores, which DuckDB will happily utilize. You can control resource usage using pragmas on a more granular level, if you need to.

For example, to dial in RAM and CPU:

PRAGMA memory_limit = '4GB';
PRAGMA threads = 4;

There are many other options covered in the docs.

Performance can also be greatly improved by converting JSON to Parquet – either before the initial ingest or for posterity. DuckDB also does not only exist in-memory and I think this approach is probably the exception in day-to-day use. If you know your data set is relatively static, you can use a file-backed database (as opposed to in-memory, like the above) which will result in orders of magnitude better performance.

Summary

I’m very excited by DuckDB and I’m not the only one. Before I was able to publish this post, I saw multiple posts about it appear on Hacker News. This post is very superficial and there are many features and use cases which I haven’t touched on. For example, a feature I’d like to look further into is (basic) full text search. I could see this being extremely powerful in use cases surrounding historical data, log entries, manuals, etc.

As implied above, I have not yet used DuckDB for Real Work and can’t speak to issues like performance at scale, read/write contention, working cross-platform or trying to reuse DuckDB SQL in Postgres. I am going to look for excuses to use it going forward and may write follow up posts as I learn more. Stay tuned!