-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Question
TLDR; does pointblank always intend to collect dataframes in validation?
My team is wondering if there's a way to change pointblank's behavior so that it doesn't always collect the data frame when running validations. We think this might be a major lift, but maybe it's not as bad as we think. Some of our data sources are just too big to run collect(), but beyond that it creates some scalability concerns for us in our pipeline. Just wondering if there's room to modify that behavior over the mid to long-term, happy to discuss some of our ideas regarding how if that helps :)
As always, thanks for your work on this project, we're really enjoying using pointblank!
Background
Our team has been developing a kedro extension to use pointblank in our data pipeline, it works very nicely with the yaml workflow! We've been scaling pointblank across all of our datasets and we're starting to run into some problems due to collect().
Most of our data is pretty small, but some of it is big (e.g. transaction data). Our pipeline uses a layered approach, earlier layers in the pipeline apply very general transformations to our data such as converting data types, modeling into 3rd normal form, etc. These layers don't use filters and we simply defer writes if the data is large. We simply return a deferred expression and subsequent layers in the pipeline can add new expressions to the chain.
For the big data, as a starting point we'd love to be able to run table level assertions without relying on collect - things like schema enforcement at least help us insulate our later layers. Generally though, we're wondering if the validations could be passed to ibis as expressions and give us as end users more control over when / where data is collected.