Batch Open-vocabulary Detection with Grounding Models

# About

Add a batch pipeline that takes 
- (a) an image corpus (folder or Parquet of binary images/URIs) and,
- (b) one or more text labels, and returns detection boxes (with scores + optional masks) for each image/label using an open-vocabulary grounding model such as [OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2)

# Objective
- [ ] **Support open-vocabulary text prompts**
    - [ ] Single label
    - [ ] Multiple labels

- [ ] **Run efficiently on GPU(s) with batch inference**

- [ ]  **Emit results in interoperable formats with stable schema**

# Example

One Label Detection
```
- RGB Image
- Text Label: ["Fish"]
```

<img width="648" height="447" alt="Image" src="https://github.com/user-attachments/assets/b9ee4330-d57d-41d7-948a-92daf49d0978" />


Multi-labels Detection

```
- RGB Image
- Text Label: ["coffee mug", "plate", "spoon"]
```

<img width="671" height="669" alt="Image" src="https://github.com/user-attachments/assets/b6d3c168-d166-4abd-9f7a-f338be1c49d3" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Open-vocabulary Detection with Grounding Models #18

About

Objective

Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch Open-vocabulary Detection with Grounding Models #18

Description

About

Objective

Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions