RLI support for Flink streaming #17452

danny0405 · 2025-12-02T03:02:53Z

danny0405
Dec 2, 2025
Collaborator

Background

Flink does not support RLI while spark does, this caused inconsistency between engines, for tables migrated from Spark to flink streaming, the index type needs to be switched to either bucket or flink_state , this caused a overhead for users in production.

Another reason is for multiple partition upserts, currently the only choice is flink_state index, but the flink_state actually costs a lot of memory and can not be shared between different workloads.

Goals

Impl reliable and performant write and read support for RLI via Flink APIs [VC: list these out]
The RLI impl is engines compatible, for e.g, Flink can access and utilize the RLI written by Spark and vice versa
The RLI is global, upserts among partitions is supported; Also support partition level RLI for large fact tables.
Async compaction for MDT when RLI is enabled; in writer pipeline or table services background job.
Smart caching of RLI
Clearly set limits for the kind of write throughput supported by RLI (based on certain average response time for the RLI access, like from x0ms to x00ms) via empirical benchmarks
Ability to be expanded to arbitrary secondary indexing on different columns (or) at-least be compatible with how such secondary indexes can be built outside of Flink pipeline, in a consistent fashion.

Non Goals

[VC: what are these?]

The Design

The high-level ideas:

a RLI based index backend will be there to replace the flink_state index;
a cache of RLI would be introduced to speed the access
a separate index function to write the RLI/SI payloads;
the MDT RLI files is written synchronously with the data table data files, the metadata is sent to the coordinator for a final commit to the MDT(after FILES partition is ready)
the MDT compaction is switched to be async and the data files compaction pipeline is reused for less take up of task slots.

The Impl

The Write

The RLI Access

In BucketAssigner operator, the RLI index metadata would be utilitized as the index backend, the BucketAssigner operator will probe the RLI with the incoming record keys to figure out whether msg is update or insert or delete. In other words, the RLI index metadata will served as the same role of the flink_state index.

The Cache of RLI Access

We need fast access in streaming to have high throughput(ideally per record access should be < 10ms), thus a general hotspot cache is needed. We will build a in-memory LRU cache by the active upsert records keys, the cache items will be force evictted by a configured memory threshold.

We also need a memory cache for the index mappings of current checkpoint because it is not committed to Hudi table yet so invisible.

The query will access the in-memory cache first then the MDT RLI index:

The Shuffle of RLI Payloads

In StreamWrite operator, the the index items are inferred and sent to IndexWrite operator in streaming style, the index reocrds are shuffled by hash(record_key) % num_rli_shards(the same hashing algorithm of the MDT RLI index partitioner), this is critical to avoid N*M files to write to MDT partition(N is the RLI partition buckets number, M is the data table buckets involved in the current write).

How do we ensure the data record and index record always belong to one commit/checkpoint: the barrier is flowing together with the records in Flink, see how-does-state-snapshotting-work, when the StreamWrite operator received a record, it emits its corresponding index record in one #processElement call, so we can always keep the bindings of these two, in other words, no barrier would be amidst of the two.

The RLI Write

In IndexWrite operator, the index items are buffered first and write to the MDT (triggered by Flink checkpoint), the write status metadata will be sent to the coordinator. The metadata sent to the coordinator includes two parts:

A: the written data file handles ;
B: the written MDT file handles(specifically under the RLI/SI partition)

The Commit of MDT(including RLI)

And when commit to the data table, the MDT is committed firstly with the partial RLI/SI write metadata list(the MDT RLI/SI partition file handles info), the RLI and SI partition file handles info would be commited altogether with the FILES partition.

On Flink checkpoint : each index/data writing task flushes all its records to RLI and data files respectively. So the RLI and data files are always consistent. We commit both as we do now, from Coordinator into a single hudi commit.

In order to keep exactly once semantics for job recovery, the write status metadata will also needs to be stored both in the StreamWrite operator, IndexWrite operator and the coordinator, pretty much the same behaviors of the current maintainance of the data table metadata.

The Compaction

In order not take up too much task slot, we will reuse the current compaction sub-pipeline for scalable execution of the MDT compaction, it is auto applied when RLI is enabled.

Open Questions

needs to benchmark the read perf of index items in BucketAssign op, to see if we need to introudce layered cache strategies similiar with RocksDB ;

Appendix

SI support

Because SI needs to be figured out on the fly after data files are created, we generally needs another SI write op to handle the write of SI, the op parallelism is same with the SI partition buckets number, the shffle strategy should be in line with the MDT SI partitioner.

The partial commit metadata of SI is sent to the coordinator for MDT commit.

This will increase the checkpoint time a lot and has risk of checkpoint timeout and backpressure for hight volumn workloads. ~~One solution is to build the SI async before queries but that is another story and not in the scope of this design.~~

danny0405 · 2025-12-02T03:03:52Z

danny0405
Dec 2, 2025
Collaborator Author

cc @vinothchandar for the review.

0 replies

vinothchandar · 2025-12-05T23:21:32Z

vinothchandar
Dec 5, 2025
Collaborator

@HuangZhenQiu as well. along with @prashantwason @suryaprasanna

0 replies

vinothchandar · 2025-12-05T23:44:54Z

vinothchandar
Dec 5, 2025
Collaborator

1. Writing structure and missing details.

please separate design and implementation into two separate sections. Not mix them, makes it hard to focus on the high-level design alone (which I think is what we should agree on first). This will make it easy to follow for any contributor/developer.
Also, can we cleanup the writing to e.g the goal I am not sure what you intend to convey by "RLI may not work well for .." .
Please spell out bullets for goals and non-goals in more detailed fashion. e.g. Can the RLI written by Flink be used by a Spark/Trino query for data skipping. Just an example. But we need to clearly state the intended functional and performance goals.

2. Precise answers to 'What exactly happens during a Flink checkpoint? When do we exactly commit the write ?'

And when the commit to the data table, the MDT is committed firstly with the partial RLI write metadata list.

I don't fully follow this. I am specifically interested in this. The key problem to solve here IMO is : when a Flink checkpoint happens it can give us a list of data, mt files written. But some data files returned, its corresponding write to RLI may not have propagated. IIUC, in the proposed approach, you are just writing MT ahead of writing the data files (just like the Flink state backend approach rn), to keep this synchronized?

3. Caching parts of RLI / partitioned RLI

We are not addressing how we scale the cache. And what is expected to be hot. IMO we need partitioned RLI and the cache should be such that the "hot" partition paths's RLI entries are what are cached.. Missing these details.

We need a memory cache for the index mappings of current checkpoint because it is not committed to Hudi table yet so invisible.
we need to invalidate/refresh the cache as well after each write is successful . How does this all happen.

I agree that we need some caching. but we need to flesh this out more. We can come back to the caching after we align on the original approach.

4. Consider Secondary Index also in scope

The current design is not very extensible to secondary index updates. I'd like for that to be included in scope. In case of SI, the main challenge is that the index entries cannot be produced until we actually perform the writing of the data files. MT is updated in a downstream operator from the current operator that writes data files..

1 reply

danny0405 Dec 8, 2025
Collaborator Author

thanks, updated.

vinothchandar · 2025-12-05T23:47:49Z

vinothchandar
Dec 5, 2025
Collaborator

My preferred high-level approach here is to decouple the Flink checkpoint from the Hudi commit boundary. i.e every checkpoint we track the files written under .hoodie/aux , and once all files belonging to a certain instant are completed, then we complete that write.. Sth along those lines.

7 replies

geserdugarov Dec 9, 2025

As far as I know, Flink write operators currently flush buffers during checkpoint. As a result, we should wait for the slowest task manager to complete this flush to continue stream processing. This can lead to performance degradation in case of data skew, when one task manager receives significantly more data than others, or when some task manager has lower I/O bandwidth.
I suppose it might be possible to utilize Flink's local state on task managers, and flush buffers there, instead of writing to remote storage during the checkpoint. This should be faster, and would decouple Flink checkpoints from Hudi commits.
@danny0405 , what do you think?

danny0405 Dec 9, 2025
Collaborator Author

In general, the file creation must be in the same lifecycle of the Flink checkpoint, otherwise, Flink might deem the checkpoint as successful while Hudi actually fails in commit, this inconsistency will make it hard for job recover because checkpoint is what Flink recognize to recover from.

I suppose it might be possible to utilize Flink's local state on task managers, and flush buffers there, instead of writing to remote storage during the checkpoint.

@geserdugarov are you suggesting to write to local disk first then copy the files into remote? can you elaborate a little more?

geserdugarov Dec 9, 2025

Sorry, @danny0405 - you're right, it's not possible to decouple the Flink checkpoint and the Hudi commit while still providing exactly-once semantics. After I tried to describe step by step what I meant, I realized the issue.

So for exactly-once semantics, it could look like the following with Flink state:

income records >> buffer 
 -  if buffer is full >> append to existing log (or create new file, flush, and close)
 -  on checkpoint >> write buffer to local Flink state >> Hudi commit >> continue filling buffers
 -   - on failure >> load buffers from local state >> continue filling buffers
 -   - if buffer is full >> append to existing log ...

The Hudi commit here is still coupled with the Flink checkpoint, but saving byte buffers to local state during the checkpoint should be fast, and may improve performance. However, this idea goes off the main topic of this discussion.

vinothchandar Dec 11, 2025
Collaborator

Let me dig into the Flink checkpointing more.. and responding. Till then, hold off on my suggestion here for decoupling..

vinothchandar Dec 12, 2025
Collaborator

@geserdugarov @danny0405 I did some digging.. and it seems like we can just extend the current DAG (like we did for Spark actually in 1.1) and get a fully consistent Data + Index MT files in a single Hudi commit..

So ignore my suggestion to decouple the Flink Checkpoint and Hudi commit

HuangZhenQiu · 2025-12-06T05:05:39Z

HuangZhenQiu
Dec 6, 2025
Collaborator

Would you please describe how the existing record index will be loaded into Flink state when an upsert Flink job initially starts and writes to a dataset with MOR?

1 reply

danny0405 Dec 8, 2025
Collaborator Author

there is no flink state, the new design will read the metadata table RLI directly.

vinothchandar · 2025-12-09T04:28:11Z

vinothchandar
Dec 9, 2025
Collaborator

I am still only focussing on the RLI pieces and the write (not caching or compaction or SI yet)

In BucketAssigner operator, the RLI index metadata would be utilitized as the index state backend,

Are you basically saying instead of state index you ll lookup RLI? its not clear when reading.

In StreamWrite operator, the index items are buffered first and write to the MDT after the data items are flushed(triggered by Flink checkpoint),

So - this happens in the same operator that writes the data files? Stream Write op?

Then the index items are shuffled by record keys with the same hashing algorithm of the MDT

Your diagram says "shuffled by record key" which is different. Can you clarify - is it shuffled by record key or hash(record_key) % num_rli_shards or partitioned by bucket (update/insert)..

I see a basic conflict here.

Each write operator task is either updating or inserting (based on BucketAssignor?) to a file group. So all updates/inserts to a file group should be in 1 task right?
but the RLI update will be redistributed based on hash(record_key) % num_rli_shards ? So, these need to be done in separate operator stages right?

For anything, we propose around RLI writes, I want to understand how we will write 1 log file per each RLI filegroup (shard) for each commit .. ( we cannot have a lot of small files in RLI)

I thought you will do something like (still does not work for SI)

BucketAssignor : tag record as I/U/D => shuffle by hash(record key)%num_rli_shards => write to RLI; pass on RLI files written => shuffle by bucket/filegroup => perform write handle, merge/append/create

When the thing then checkpoints, you know what RLI files were written and what data files were written. You commit both respectively into MT, MT files and DT . Note that the above does not work with positional updates/deletes, since we don't know the position ahead of time.

I want to first understand your proposal. I am not very sure, if this is the direction we go.

4 replies

vinothchandar Dec 9, 2025
Collaborator

or this is what you are saying

BucketAssignor : tag record as I/U/D
=> produce index record and data record for each incoming record
=> custom partitioning such that index records are partitioned by hash(record key)%num_rli_shards AND data records are shuffled by bucket/file group id
=> write stage with two types of writer tasks: data write handles doing append, create, merge. RLI writes

On Flink checkpoint : each index/data writing task flushes all its records to RLI and data files respectively. So the RLI and data files are always consistent. We commit both as we do now, from Coordinator into a single hudi commit

danny0405 Dec 9, 2025
Collaborator Author

yes, updated.

vinothchandar Dec 11, 2025
Collaborator

@bhasudha 's question: how do we ensure the checkpoint will always have the corresponding index record to a data record and vice-versa?

e.g: [r4 r3 r2 r1 ] => BucketAssignor => [ ir4 dr4 ir3 dr3 ir2 dr2 ir1 dr1]

What prevents the checkpoint barrier from being placed like this: [ ir4 dr4 ir3 dr3 ir2 <checkpoint barrier> dr2 ir1 dr1]

danny0405 Dec 11, 2025
Collaborator Author

The barrier is flowing together with the records in Flink, see how-does-state-snapshotting-work, when the BucketAssignor received a record, it emits both the data records and index records in one #processElement call, so we can always keep the bindings of these two, in other words, no barrier would be between these two.

vinothchandar · 2025-12-09T04:30:17Z

vinothchandar
Dec 9, 2025
Collaborator

One solution is to build the SI async before queries but that is another story and not in the scope of this design.

No. This design has to cover both RLI and SI. in some form. else, I am happy to take a swing at the design. We should not push forward with a RLI-only approach, that will render all the SI capability unusable for e.g when reading from Trino or Spark.

5 replies

vinothchandar Dec 9, 2025
Collaborator

The SI Write Op Tasks should flush all MT records corresponding to the checkpoint barrier from the upstream StreamWrite Op - before we complete the checkpoint and commit.

danny0405 Dec 9, 2025
Collaborator Author

+1

HuangZhenQiu Dec 9, 2025
Collaborator

+1

HuangZhenQiu Dec 9, 2025
Collaborator

The data skew could easily happen for different upset patterns. For example, so upsert table always update the latest partition, but some other table update latest several partitions. Would you please also describe any traffic partition policy to reduce the end end checkpoint time? Without a good solution, it will easily cause checkpoint timeout for large tables.

danny0405 Dec 10, 2025
Collaborator Author

we do have some traffic partition policy for data inputs, but for SI, the partioning is kind of deterministic now and can not be customized.

vinothchandar · 2025-12-12T02:27:09Z

vinothchandar
Dec 12, 2025
Collaborator

@danny0405 @geserdugarov @HuangZhenQiu I am sketching an approach here, to seed further discussion. Please take this forward.

Assumption:

Flink implements a version of the chandy-lamport distributed checkpointing such that all operators are synchronized by the checkpoint barrier to process the same data.
The Hudi write operators (StreamOp for data writing , IndexOp for index writing) flushes all records before the checkpoint barrier to storage and returns the files produced to coordinator
The Flink Co-ordinator waits for all operators to finish returning checkpointed data (i.e files produced), and then proceeds to write both FILES in MT, commit MT timeline, commit data table timeline

DAG:

The main difference is instead of special casing the RLI write, we do after StreamWrite Op. Keeping all MT writes consistently in the same operator. This is how Spark is in 1.1. So, we need real strong technical reasons to deviate from this.

In terms of some comments on performance or slowest operator etc, its understandable that if there are a lot of SIs to be updated, it will proportionally longer. But this design still will be similar in perf if only RLI is enabled.

Once, we align on this - lets update the top level discussion description. We can move to discussing caching design for BucketAssignor Op.

8 replies

HuangZhenQiu Dec 13, 2025
Collaborator

+1 It is the definitely the right direction to go. For production workload, we may further optimize solution for unbalanced updates in partitions and host file groups after the version is ready.

vinothchandar Dec 16, 2025
Collaborator

Can we then move forward with the RFC now, based on this approach, and produce a detailed design doc, that also highlights implementation. @danny0405 @HuangZhenQiu - do you both want to co-author the RFC . Having Peter can help us close all gaps needed to take this to a production cluster at scale.

@geserdugarov @vinothchandar @cshuo can be reviewers..

xushiyan Dec 16, 2025
Collaborator

+1 to the approach, this aligns with the streaming write flow on spark's side. Note that col/partition stats are by default enabled since 1.0 for index update to handle as well. Looking forward to reviewing the RFC too.

danny0405 Dec 16, 2025
Collaborator Author

will fire a RFC PR soon

HuangZhenQiu Dec 16, 2025
Collaborator

Thanks @vinothchandar, I am willing to contribute to the RFC :)

RLI support for Flink streaming #17452

Uh oh!

Uh oh!

danny0405 Dec 2, 2025 Collaborator

Background

Goals

Non Goals

The Design

The Impl

The Write

The RLI Access

The Cache of RLI Access

The Shuffle of RLI Payloads

The RLI Write

The Commit of MDT(including RLI)

The Compaction

Open Questions

Appendix

SI support

Replies: 8 comments · 26 replies

Uh oh!

danny0405 Dec 2, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 5, 2025 Collaborator

Uh oh!

vinothchandar Dec 5, 2025 Collaborator

1. Writing structure and missing details.

2. Precise answers to 'What exactly happens during a Flink checkpoint? When do we exactly commit the write ?'

3. Caching parts of RLI / partitioned RLI

4. Consider Secondary Index also in scope

Uh oh!

danny0405 Dec 8, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 5, 2025 Collaborator

Uh oh!

geserdugarov Dec 9, 2025

Uh oh!

Uh oh!

danny0405 Dec 9, 2025 Collaborator Author

Uh oh!

geserdugarov Dec 9, 2025

Uh oh!

vinothchandar Dec 11, 2025 Collaborator

Uh oh!

vinothchandar Dec 12, 2025 Collaborator

Uh oh!

HuangZhenQiu Dec 6, 2025 Collaborator

Uh oh!

danny0405 Dec 8, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 9, 2025 Collaborator

Uh oh!

Uh oh!

vinothchandar Dec 9, 2025 Collaborator

Uh oh!

Uh oh!

danny0405 Dec 9, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 11, 2025 Collaborator

Uh oh!

danny0405 Dec 11, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 9, 2025 Collaborator

Uh oh!

Uh oh!

vinothchandar Dec 9, 2025 Collaborator

Uh oh!

danny0405 Dec 9, 2025 Collaborator Author

Uh oh!

HuangZhenQiu Dec 9, 2025 Collaborator

Uh oh!

HuangZhenQiu Dec 9, 2025 Collaborator

Uh oh!

danny0405 Dec 10, 2025 Collaborator Author

Uh oh!

vinothchandar Dec 12, 2025 Collaborator

Uh oh!

HuangZhenQiu Dec 13, 2025 Collaborator

danny0405
Dec 2, 2025
Collaborator

Replies: 8 comments 26 replies

danny0405
Dec 2, 2025
Collaborator Author

vinothchandar
Dec 5, 2025
Collaborator

vinothchandar
Dec 5, 2025
Collaborator

danny0405 Dec 8, 2025
Collaborator Author

vinothchandar
Dec 5, 2025
Collaborator

danny0405 Dec 9, 2025
Collaborator Author

vinothchandar Dec 11, 2025
Collaborator

vinothchandar Dec 12, 2025
Collaborator

HuangZhenQiu
Dec 6, 2025
Collaborator

danny0405 Dec 8, 2025
Collaborator Author

vinothchandar
Dec 9, 2025
Collaborator

vinothchandar Dec 9, 2025
Collaborator

danny0405 Dec 9, 2025
Collaborator Author

vinothchandar Dec 11, 2025
Collaborator

danny0405 Dec 11, 2025
Collaborator Author

vinothchandar
Dec 9, 2025
Collaborator

vinothchandar Dec 9, 2025
Collaborator

danny0405 Dec 9, 2025
Collaborator Author

HuangZhenQiu Dec 9, 2025
Collaborator

HuangZhenQiu Dec 9, 2025
Collaborator

danny0405 Dec 10, 2025
Collaborator Author

vinothchandar
Dec 12, 2025
Collaborator

HuangZhenQiu Dec 13, 2025
Collaborator