Skip to content

boroknagyz/impala-1trc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

One Trillion Row Challenge on Apache Impala

Introduction

In February 2024 Coiled team published their One Trillion Row Challenge where they presented their performance test over 1T (One Trillion) rows dataset. The idea of the performance test is to measure performance when reading and aggregating one trillion records based on synthetic temperature measurements dataset.

Apache Impala has accepted the challenge.

Data Generation

One way to try out Apache Impala on the challenge is to download the original data set from the S3 bucket s3://coiled-datasets-rp/1trc. For simplicity, we also provide a few SQL statements to generate the data. You can find the code in part-1/data-generation.sql.

Run the challenge

You can find the SQL query of the challenge in part-1/query.sql.

Run the extended challenge

After running part-1/data-generation.sql, you can run part-2/data-generation.sql to run the challenge on an Iceberg table with lots of delete records that need to be processed runtime. Data generation needs a larger cluster, or you can split up the insertion to the partitioned table to multiple insert statements.

About

Impala 1 Trillion Row Challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors