At Level 12, we often need to make sense of datasets from a variety of sources. One tool in our data analysis toolkit is DuckDB.
DuckDB's speed, versatility, and unified SQL interface allow us to quickly gain insights from diverse datasets.
What is DuckDB?
DuckDB is an embeddable, columnar database system. What does this mean?
- Embeddable: DuckDB can be used as a library in multiple programming languages, including Python.
- Columnar: DuckDB has a columnar, vectorized query engine which allows it to more efficiently process analytical queries compared to a traditional relational database management system (RDBMS).
File Type Support
DuckDB can load these file types:
- CSV
- Parquet
- Excel
- JSON
You can mix and match datasets of varying types into DuckDB tables, then run queries combining the disparate datasets using SQL joins.
This ability to query and join multiple datasets using a single interface, namely SQL, is immensely valuable.
Data Source Support
DuckDB can read from:
- Local filesystem
- http endpoints
- Cloud storage such as S3
- Database servers such as PostgreSQL and MySQL
This allows you to compare and join local data to files in cloud storage without having to manually download them.
For example, to view a parquet file from S3 using the DuckDB CLI:
SELECT * FROM 's3://my-bucket/file.parquet';
The full power of SQL is available to read and transform the data.
Also, while DuckDB can load datasets into memory, it can save multiple tables into a single database file for later use.
Diverse Data and DuckDB
In summary, DuckDB is a powerful and flexible tool for working with diverse datasets. Whether you’re working with local files, cloud storage, or live database connections, DuckDB’s unified SQL interface simplifies the process, allowing you to extract insights quickly and efficiently.
Its ability to efficiently process analytical queries, support multiple file formats, and seamlessly integrate with various data sources makes it an excellent choice for data analysis.