Skip to content

CategoricalSampler support #119

@selmanozleyen

Description

@selmanozleyen

Description of feature

Hi @ilan-gold ,

After having the Sampler API we can finally support CategoricalSampler. I have an idea how this is going to look like.

Here are some of the decisions I made going forward, I'd appreciate any comments on them before I start:

  • I don't want the sampler to take the obs dataframe at this level so I will just assume the datasets are sorted and the boundaries for the categories are known, higher level API can do these and pass in the boundaries, this way we also don't deal with pandas categoricals. The core algorithm doesn't also deal with it so it makes sense to me
  • For my project w cellflow I need an config that ensures that
    • each batch is from same group, but one load request can have different groups (loadrequest1 can have = batch from grp A and batch from grp B but not mixed)
    • or a more natural option for the implementation would be each load request would have one group only but the above option should exist

Since we first sample the sequence of groups to sample from each batch like a,c,b,a,..~cat(a,b,c) it is going to be difficult to mix load requests with multiple workers without interprocess communitcation. I have one approach to tackle this: if each load request needs two batches then we actually sample pairs of categories per worker a,c~cat(a,b) then that worker pops the chunkids it has per it's chunk share from group a and b. Then it prepares it and yields the load request. But this will lead in a signal coming from the sampling scheme which isn't desired. Because for example within the same loadrequest only some pairs of rows can be seen. So there is this order signal of "when I see this row in this batch then that means I saw either this row before or after." Couldn't formulate it yet but it's there. This signal I think can only be gone when load request = batch without interproccess communication.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions