Skip to content

LLM-based naming of the spatial domains#48

Open
alihamraoui wants to merge 39 commits intomainfrom
do-annotation
Open

LLM-based naming of the spatial domains#48
alihamraoui wants to merge 39 commits intomainfrom
do-annotation

Conversation

@alihamraoui
Copy link
Copy Markdown
Collaborator

This pull request adds LLM-based domain annotation to Novae.

Use case 1: annotate directly with an API key

from novae.utils import annotate_domains

annotate_domains(
    adata,
    api_key=api_key,
    provider="openai",
    model="gpt-4.1",
)

One annotation is generated per domain and added to adata.obs["novae_domains_X_annotation"].

Use case 2: no API key (manual LLM workflow)

from novae.utils import annotate_domains, add_domain_annotation

payload = annotate_domains(
    adata,
    return_prompt=True,
)

With return_prompt=True, Novae does not call any API. It returns the full request payload (messages and output_schema), which can be copied into any LLM manually.

The LLM should return a structured annotation payload matching the schema. You can then pass that output directly to add_domain_annotation:

annotation = {
    "annotation": [
        {"novae_domains": "D1011", "annotation": "tumor epithelium", "score": 0.89}
    ]
}

add_domain_annotation(
    adata,
    annotation=annotation,
)

This adds one annotation per domain to adata.obs["novae_domains_X_annotation"].

Detailed documentation will be added in a follow-up PR.

@quentinblampey quentinblampey changed the title Do annotation LLM-based naming of the spatial domains Apr 24, 2026
Copy link
Copy Markdown
Collaborator

@quentinblampey quentinblampey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alihamraoui, again it looks really great!

I made many comments, but they are all minor: they concern only syntax and variable naming.

Since this function will be very useful, I think we could import it in the main __init__ file, so that we can call novae.name_domains directly instead of novae.utils.name_domains, what do you think?

NB: can you git pull? I resolved some conflicts with the main branch

Comment thread novae/utils/__init__.py Outdated

def __getattr__(name: str) -> Any:
if name == "annotate_domains":
from ._annotate_domains import annotate_domains
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do a normal import instead, because the openai and anthropic imports are nested within functions of _annotate_domains.py, so they won't be imported anyway

Comment thread novae/utils/_annotate_domains.py Outdated
"- Do NOT skip any domain. "
"- Do NOT add explanations."
"Return only valid JSON matching the provided schema."
).format(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use f-strings instead of .format? I think it's more readable!

Comment thread novae/utils/_annotate_domains.py Outdated
for domain_id in domain_ids:
domain_id_str = str(domain_id)
pct = proportions.get(domain_id_str, 0.0) * 100
lines.append(f"Domain {domain_id}: {pct:.2f}%")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lines.append(f"Domain {domain_id}: {pct:.2f}%")
lines.append(f"Domain {domain_id}: {pct:.2%}")

You can use the % formatting from the f-strings
You'll also need to remove the * 100 just above, because the :.2% already handles it

Comment thread novae/utils/_annotate_domains.py Outdated
return api_request_func


def _OpenAI_api_request(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use snakecase, which is the standard for python functions, i.e. _openai_api_request

Comment thread novae/utils/_annotate_domains.py Outdated
raise RuntimeError(f"OpenAI API request failed: {e}") from e


def _Anthropic_api_request(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: _anthropic_api_request

Comment thread novae/_constants.py Outdated
OPENAI_API_KEY: str = "OPENAI_API_KEY"
ANTHROPIC_API_KEY: str = "ANTHROPIC_API_KEY"
DOMAIN_ANNOTATION: str = "annotation"
DOMAIN_ID: str = "novae_domains"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use the already existing DOMAINS_PREFIX instead? It's because there is an underscore _ at the end?
EDIT: actually, I think you can remove it and use obs_key instead of DOMAIN_ID

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just forgot to update this part after the last commits.
bien vu!

Comment thread novae/utils/_utils.py Outdated
"""
Convert rank_genes_groups into dict
"""
names = adata.uns["rank_genes_groups"]["names"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we did not run scanpy.tl.rank_genes_groups before?

Shouldn't we call sc.tl.rank_genes_groups(adata, obs_key) ourselves?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll add a check for "rank_genes_groups" and run the function if it’s not found.
Thanks!

Comment thread novae/utils/_annotate_domains.py Outdated
) -> str:
if api_key is None:
warnings.warn(
f"`api_key` was not provided. Trying environment variable `{env_var}`.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this warning, because it's the intended behavior, no? And we'll get an error anyway if we don't have the right env variable

Comment thread novae/utils/_annotate_domains.py Outdated
seed=seed,
)

domain_ann = {d[Keys.DOMAIN_ID]: d[Keys.DOMAIN_ANNOTATION] for d in result[Keys.DOMAIN_ANNOTATION]}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can directly create a dataframe out of it:

df_naming = pd.DataFrame.from_records(result["annotation"], index="novae_domains")

And then, you can apply it with:

adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])

Comment thread novae/utils/_annotate_domains.py Outdated
return pd.DataFrame(result[Keys.DOMAIN_ANNOTATION])


def add_domain_annotation(
Copy link
Copy Markdown
Collaborator

@quentinblampey quentinblampey Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% we need this function, because the naming is just a single line adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])

Perhaps we can just let the user rename it? For instance, we just show in a tutorial how to transform the dict into a dataframe (as shown in my previous comment), and then the user just has to apply the .map?

This way, we would maybe not even need the key_added in annotate_domains, and it would just return a dataframe (without actually running the .map)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I agree.

Comment thread novae/utils/_annotate_domains.py Outdated
schema = {
"type": "object",
"properties": {
Keys.DOMAIN_ANNOTATION: {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this Keys.DOMAIN_ANNOTATION in the schema? I think you can ask to return the array directly, without nesting it into a dict with one key?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in a previous version to return additional metadata, "model_version"...
You’re right a JSON array is simpler :))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API requires the top-level schema to be a JSON object
I tried switching to response_format = {"type": "json_object"}, but it’s less reliable since the output isn’t schema-validated, can lead to parsing errors, and not all models support it.

I think it’s better to keep the array inside a top-level property.

Comment thread novae/utils/_annotate_domains.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the .astype(str)?

Comment thread novae/utils/_annotate_domains.py Outdated
pct = proportions.get(domain_id_str, 0.0) * 100
lines.append(f"Domain {domain_id}: {pct:.2f}%")
pct = proportions.get(domain_id_str, 0.0)
lines.append(f"Domain {domain_id}: {pct:.2%}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do it in one line using:

lines = [
    f"Domain {domain_id}: {pct:.2%}"
    for domain_id, pct in adata.obs[obs_key].value_counts(normalize=True).items()
]

Comment thread novae/utils/_annotate_domains.py Outdated

key_added = f"{obs_key}_{Keys.DOMAIN_ANNOTATION}" if key_added is None else key_added

gene_marker_dict = utils.markers_as_dict(adata, n_genes)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since markers_as_dict is only used there, maybe the function can be moved into this file as well?

Comment thread novae/utils/_annotate_domains.py Outdated
if "rank_genes_groups" in adata.uns:
names = adata.uns["rank_genes_groups"]["names"][:n_genes]
return {domain: list(names[domain]) for domain in domain_ids}
if "rank_genes_groups" not in adata.uns:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to check that it was grouped by the right obs_key
Else, if the user already ran rank_genes_groups to compare cell-types (and not domains), it will load the DEGs between cell-types instead of domains

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants