Skip to content

Support for Multiple Columns in sequence_by in DLT Meta Framework #252

@shishupalgeek

Description

@shishupalgeek

While working with the DLT Meta framework, I encountered a limitation with the sequence_by parameter. Currently, it seems that defining multiple columns for sequencing is not supported in the framework.

However, according to the official Databricks Spark Declarative Pipeline documentation, sequence_by supports sequencing by multiple columns using a struct(). This allows handling change events that arrive out of order by ordering first by one column, then breaking ties with additional columns.

I recommend adding support for multiple columns in sequence_by in the DLT Meta framework to align with the official Spark Declarative Pipeline behavior.

Reference:
Databricks Spark Declarative Pipeline documentation:
https://learn.microsoft.com/en-in/azure/databricks/ldp/developer/ldp-python-ref-apply-changes

Excerpt from documentation:

sequence_by | str, col() or struct() | Required.
The column names specifying the logical order of CDC events in the source data. Lakeflow Spark Declarative Pipelines uses this sequencing to handle change events that arrive out of order. The specified column must be a sortable data type. You can specify either:

  • A string: "sequenceNum"
  • A Spark SQL col() function: col("sequenceNum"). Arguments to col() functions cannot include qualifiers. For example, you can use col(userId), but you cannot use col(source.userId).
  • A struct() combining multiple columns to break ties: struct("timestamp_col", "id_col"), it will order by the first struct field first, then by the second field if there's a tie, and so on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions