Skip to content

Commit b24df77

Browse files
authored
beam-duckdb release (#7)
1 parent f3d9223 commit b24df77

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
layout: post
3+
title: "beam-duckdb: a Beam backend for typed SQL over DuckDB"
4+
date: 2026-02-25 17:27:12 +0100
5+
categories: blog
6+
---
7+
8+
The dataHaskell organization is spearheading the development of a slew of tools for data science. However, sometimes, all you need is SQL.
9+
10+
The Haskell ecosystem is home to many database libraries adding a layer of type safety over SQL queries. One such pioneering library is [`beam`](https://github.com/haskell-beam/beam), which provides an embedded domain-specific language adding strong guarantees over SQL queries. One key design principle of `beam` is that it is _database-agnostic_; `beam` adapts its functionality to the specific database, which has traditionally been PostgreSQL, SQLite, or MySQL.
11+
12+
This blog post is about the addition of a new supported database, bringing the power of `beam` over to [DuckDB](https://duckdb.org/).
13+
14+
### Why DuckDB
15+
16+
DuckDB is an in-process database, much like SQlite, but with a focus on analytics instead of transaction processing. Specifically, DuckDB focuses on online analytical processing (OLAP) workloads.
17+
18+
If you want to drive DuckDB from a Haskell program , you can use [`duckdb-simple`](https://hackage.haskell.org/package/duckdb-simple):
19+
20+
```haskell
21+
import Database.DuckDB.Simple
22+
23+
main :: IO
24+
main = do
25+
Only (maxScore :: Double) <- withConnection ":memory:" $ \conn ->
26+
query_
27+
conn
28+
"SELECT MAX(score) \
29+
\FROM read_parquet('scores.parquet');"
30+
31+
putStrLn "Max exam score: "
32+
print maxScore
33+
```
34+
35+
This interface is very similar to the experience that you would get using SQLite via `sqlite-simple`, but here you get access to DuckDB's high performance. However, there's no checking the query at compile-time; for example, we'll get a runtime error if the result of the query is not, in fact, a `Double`.
36+
37+
Note here that we're loading data from a Parquet file using `read_parquet`. DuckDB supports loading data from various data sources, including file formats such as Parquet and CSV, and data lake formats such as Apache Iceberg and Delta Lake.
38+
39+
In summary, DuckDB is a great choice of database for data science workloads because:
40+
* Its query engine is optimized for analytical workloads;
41+
* It is SQL-compliant;
42+
* It provides SQL extensions to interact with various data sources important to data science.
43+
44+
### Adding structure with beam
45+
46+
Once you have determined a good pipeline for your workload, you might want to add some structure to prevent mistakes. This is where `beam` comes in.
47+
48+
`beam`, at its core, is a database library that generates SQL statements from a type-safe domain-specific language embedded in Haskell. Since DuckDB supports the various standards that make up modern SQL, it was possible to write a backend for `beam` to be used to generate statements for DuckDB.
49+
50+
With `beam`, we first describe the schema for our data in Haskell:
51+
52+
```haskell
53+
-- Represents a table of exam scores.
54+
data ExamT f = Exam
55+
{ _examId :: Columnar f Int32,
56+
_examName :: Columnar f Text,
57+
_examScore :: Columnar f Double,
58+
_examDate :: Columnar f Day
59+
}
60+
61+
type Exam = ExamT Identity
62+
```
63+
64+
We can declare the database as having a source from a Parquet file like so:
65+
66+
```haskell
67+
data ScoresDB f = ScoresDB
68+
{ _scores :: f (DataSourceEntity ExamT),
69+
}
70+
deriving (Generic, Database DuckDB)
71+
72+
scoresDb :: DatabaseSettings DuckDB ScoresDB
73+
scoresDb =
74+
defaultDbSettings
75+
`withDbModification` (dbModification @_ @DuckDB)
76+
{ _scores =
77+
dataSource (parquet (NonEmpty.singleton "scores.parquet"))
78+
<> modifyDataSourceFields
79+
tableModification
80+
{ _examId = "id",
81+
_examName = "name",
82+
_examScore = "score",
83+
_examDate = "exam_date"
84+
}
85+
}
86+
```
87+
88+
Finally, we can query the data source:
89+
90+
```haskell
91+
main = do
92+
Just maxPrice <- withConnection ":memory:"
93+
$ \conn -> runBeamDuckDB conn
94+
$ runSelectReturningOne
95+
$ select
96+
$ aggregate_
97+
(max_ . _examScore)
98+
(allFromDataSource_ (_scores scoresDb))
99+
100+
putStrLn "Max exam score: "
101+
print maxScore
102+
```
103+
104+
[Beam makes it possible to construct much more powerful queries](https://haskell-beam.github.io/beam/). For example, what's the whole row associated with the highest score?
105+
106+
```haskell
107+
main = do
108+
Just rowWithMaxScore <- withConnection ":memory:"
109+
$ \conn -> runBeamDuckDB conn
110+
$ runSelectReturningOne
111+
$ select
112+
$ do
113+
-- subquery : SELECT MAX(score) FROM read_parquet(...)
114+
let findMaxScore =
115+
aggregate_
116+
(max_ . _examScore)
117+
(allFromDataSource_ (_dbExams testDb))
118+
119+
-- top-level query: SELECT * FROM read_parquet(...) where score=subquery
120+
e <- allFromDataSource_ (_dbExams testDb)
121+
guard_ (just_ (_examScore e) ==. subquery_ findMaxScore)
122+
pure e
123+
124+
putStrLn $ "The max exam score of "
125+
<> show (_examScore rowWithMaxScore)
126+
<> " was scored by "
127+
<> show (_examName rowWithMaxScore)
128+
<> " on "
129+
<> show (_examDate rowWithMaxScore)
130+
```
131+
132+
and voilà!
133+
134+
With the added structure of the schema (represented by the type `ExamT`), we get:
135+
* a very smart compiler checking the validity of our queries at compile-time;
136+
* composable queries that can be combined in a way that raw SQL cannot.
137+
138+
------------------------------
139+
140+
The `beam-duckdb` backend is very new. Do you want to use this but don't know where to start? Do you wish it supported features that are not currently implemented? Do not hesitate to [raise an issue](https://github.com/haskell-beam/beam/issues/new)!

0 commit comments

Comments
 (0)