-
Notifications
You must be signed in to change notification settings - Fork 108
Description
While working with the DLT Meta framework, I encountered a limitation with the sequence_by parameter. Currently, it seems that defining multiple columns for sequencing is not supported in the framework.
However, according to the official Databricks Spark Declarative Pipeline documentation, sequence_by supports sequencing by multiple columns using a struct(). This allows handling change events that arrive out of order by ordering first by one column, then breaking ties with additional columns.
I recommend adding support for multiple columns in sequence_by in the DLT Meta framework to align with the official Spark Declarative Pipeline behavior.
Reference:
Databricks Spark Declarative Pipeline documentation:
https://learn.microsoft.com/en-in/azure/databricks/ldp/developer/ldp-python-ref-apply-changes
Excerpt from documentation:
sequence_by | str, col() or struct() | Required.
The column names specifying the logical order of CDC events in the source data. Lakeflow Spark Declarative Pipelines uses this sequencing to handle change events that arrive out of order. The specified column must be a sortable data type. You can specify either:
- A string: "sequenceNum"
- A Spark SQL col() function: col("sequenceNum"). Arguments to col() functions cannot include qualifiers. For example, you can use col(userId), but you cannot use col(source.userId).
- A struct() combining multiple columns to break ties: struct("timestamp_col", "id_col"), it will order by the first struct field first, then by the second field if there's a tie, and so on.