Is collect() always necessary?

## Question
**TLDR; does pointblank always intend to _collect_ dataframes in validation?** 

My team is wondering if there's a way to change pointblank's behavior so that it doesn't always collect the data frame when running validations. We think this might be a major lift, but maybe it's not as bad as we think. Some of our data sources are just too big to run `collect()`, but beyond that it creates some scalability concerns for us in our pipeline. Just wondering if there's room to modify that behavior over the mid to long-term, happy to discuss some of our ideas regarding how if that helps :)

As always, thanks for your work on this project, we're really enjoying using pointblank! 

### Background
Our team has been developing a kedro extension to use pointblank in our data pipeline, it works very nicely with the yaml workflow! We've been scaling pointblank across all of our datasets and we're starting to run into some problems due to `collect()`. 

Most of our data is pretty small, but some of it is _big_ (e.g. transaction data). Our pipeline uses a layered approach, earlier layers in the pipeline apply very general transformations to our data such as converting data types, modeling into 3rd normal form, etc. These layers don't use filters and we simply defer writes if the data is _large_. We simply return a deferred expression and subsequent layers in the pipeline can add new expressions to the chain. 

For the _big_ data, as a starting point we'd love to be able to run _table level assertions_ without relying on collect - things like schema enforcement at least help us insulate our later layers. Generally though, we're wondering if the validations could be passed to ibis as expressions and give us as end users more control over when / where data is collected. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is collect() always necessary? #334

Question

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is collect() always necessary? #334

Description

Question

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions