|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "beam-duckdb: a Beam backend for typed SQL over DuckDB" |
| 4 | +date: 2026-02-25 17:27:12 +0100 |
| 5 | +categories: blog |
| 6 | +--- |
| 7 | + |
| 8 | +The dataHaskell organization is spearheading the development of a slew of tools for data science. However, sometimes, all you need is SQL. |
| 9 | + |
| 10 | +The Haskell ecosystem is home to many database libraries adding a layer of type safety over SQL queries. One such pioneering library is [`beam`](https://github.com/haskell-beam/beam), which provides an embedded domain-specific language adding strong guarantees over SQL queries. One key design principle of `beam` is that it is _database-agnostic_; `beam` adapts its functionality to the specific database, which has traditionally been PostgreSQL, SQLite, or MySQL. |
| 11 | + |
| 12 | +This blog post is about the addition of a new supported database, bringing the power of `beam` over to [DuckDB](https://duckdb.org/). |
| 13 | + |
| 14 | +### Why DuckDB |
| 15 | + |
| 16 | +DuckDB is an in-process database, much like SQlite, but with a focus on analytics instead of transaction processing. Specifically, DuckDB focuses on online analytical processing (OLAP) workloads. |
| 17 | + |
| 18 | +If you want to drive DuckDB from a Haskell program , you can use [`duckdb-simple`](https://hackage.haskell.org/package/duckdb-simple): |
| 19 | + |
| 20 | +```haskell |
| 21 | +import Database.DuckDB.Simple |
| 22 | + |
| 23 | +main :: IO |
| 24 | +main = do |
| 25 | + Only (maxScore :: Double) <- withConnection ":memory:" $ \conn -> |
| 26 | + query_ |
| 27 | + conn |
| 28 | + "SELECT MAX(score) \ |
| 29 | + \FROM read_parquet('scores.parquet');" |
| 30 | + |
| 31 | + putStrLn "Max exam score: " |
| 32 | + print maxScore |
| 33 | +``` |
| 34 | + |
| 35 | +This interface is very similar to the experience that you would get using SQLite via `sqlite-simple`, but here you get access to DuckDB's high performance. However, there's no checking the query at compile-time; for example, we'll get a runtime error if the result of the query is not, in fact, a `Double`. |
| 36 | + |
| 37 | +Note here that we're loading data from a Parquet file using `read_parquet`. DuckDB supports loading data from various data sources, including file formats such as Parquet and CSV, and data lake formats such as Apache Iceberg and Delta Lake. |
| 38 | + |
| 39 | +In summary, DuckDB is a great choice of database for data science workloads because: |
| 40 | +* Its query engine is optimized for analytical workloads; |
| 41 | +* It is SQL-compliant; |
| 42 | +* It provides SQL extensions to interact with various data sources important to data science. |
| 43 | + |
| 44 | +### Adding structure with beam |
| 45 | + |
| 46 | +Once you have determined a good pipeline for your workload, you might want to add some structure to prevent mistakes. This is where `beam` comes in. |
| 47 | + |
| 48 | +`beam`, at its core, is a database library that generates SQL statements from a type-safe domain-specific language embedded in Haskell. Since DuckDB supports the various standards that make up modern SQL, it was possible to write a backend for `beam` to be used to generate statements for DuckDB. |
| 49 | + |
| 50 | +With `beam`, we first describe the schema for our data in Haskell: |
| 51 | + |
| 52 | +```haskell |
| 53 | +-- Represents a table of exam scores. |
| 54 | +data ExamT f = Exam |
| 55 | + { _examId :: Columnar f Int32, |
| 56 | + _examName :: Columnar f Text, |
| 57 | + _examScore :: Columnar f Double, |
| 58 | + _examDate :: Columnar f Day |
| 59 | + } |
| 60 | + |
| 61 | +type Exam = ExamT Identity |
| 62 | +``` |
| 63 | + |
| 64 | +We can declare the database as having a source from a Parquet file like so: |
| 65 | + |
| 66 | +```haskell |
| 67 | +data ScoresDB f = ScoresDB |
| 68 | + { _scores :: f (DataSourceEntity ExamT), |
| 69 | + } |
| 70 | + deriving (Generic, Database DuckDB) |
| 71 | + |
| 72 | +scoresDb :: DatabaseSettings DuckDB ScoresDB |
| 73 | +scoresDb = |
| 74 | + defaultDbSettings |
| 75 | + `withDbModification` (dbModification @_ @DuckDB) |
| 76 | + { _scores = |
| 77 | + dataSource (parquet (NonEmpty.singleton "scores.parquet")) |
| 78 | + <> modifyDataSourceFields |
| 79 | + tableModification |
| 80 | + { _examId = "id", |
| 81 | + _examName = "name", |
| 82 | + _examScore = "score", |
| 83 | + _examDate = "exam_date" |
| 84 | + } |
| 85 | + } |
| 86 | +``` |
| 87 | + |
| 88 | +Finally, we can query the data source: |
| 89 | + |
| 90 | +```haskell |
| 91 | +main = do |
| 92 | + Just maxPrice <- withConnection ":memory:" |
| 93 | + $ \conn -> runBeamDuckDB conn |
| 94 | + $ runSelectReturningOne |
| 95 | + $ select |
| 96 | + $ aggregate_ |
| 97 | + (max_ . _examScore) |
| 98 | + (allFromDataSource_ (_scores scoresDb)) |
| 99 | + |
| 100 | + putStrLn "Max exam score: " |
| 101 | + print maxScore |
| 102 | +``` |
| 103 | + |
| 104 | +[Beam makes it possible to construct much more powerful queries](https://haskell-beam.github.io/beam/). For example, what's the whole row associated with the highest score? |
| 105 | + |
| 106 | +```haskell |
| 107 | +main = do |
| 108 | + Just rowWithMaxScore <- withConnection ":memory:" |
| 109 | + $ \conn -> runBeamDuckDB conn |
| 110 | + $ runSelectReturningOne |
| 111 | + $ select |
| 112 | + $ do |
| 113 | + -- subquery : SELECT MAX(score) FROM read_parquet(...) |
| 114 | + let findMaxScore = |
| 115 | + aggregate_ |
| 116 | + (max_ . _examScore) |
| 117 | + (allFromDataSource_ (_dbExams testDb)) |
| 118 | + |
| 119 | + -- top-level query: SELECT * FROM read_parquet(...) where score=subquery |
| 120 | + e <- allFromDataSource_ (_dbExams testDb) |
| 121 | + guard_ (just_ (_examScore e) ==. subquery_ findMaxScore) |
| 122 | + pure e |
| 123 | + |
| 124 | + putStrLn $ "The max exam score of " |
| 125 | + <> show (_examScore rowWithMaxScore) |
| 126 | + <> " was scored by " |
| 127 | + <> show (_examName rowWithMaxScore) |
| 128 | + <> " on " |
| 129 | + <> show (_examDate rowWithMaxScore) |
| 130 | +``` |
| 131 | + |
| 132 | +and voilà! |
| 133 | + |
| 134 | +With the added structure of the schema (represented by the type `ExamT`), we get: |
| 135 | +* a very smart compiler checking the validity of our queries at compile-time; |
| 136 | +* composable queries that can be combined in a way that raw SQL cannot. |
| 137 | + |
| 138 | +------------------------------ |
| 139 | + |
| 140 | +The `beam-duckdb` backend is very new. Do you want to use this but don't know where to start? Do you wish it supported features that are not currently implemented? Do not hesitate to [raise an issue](https://github.com/haskell-beam/beam/issues/new)! |
0 commit comments