LLM-based naming of the spatial domains by alihamraoui · Pull Request #48 · prism-oncology/novae

alihamraoui · 2026-04-23T15:04:58Z

This pull request adds LLM-based domain annotation to Novae.

Use case 1: annotate directly with an API key

from novae.utils import annotate_domains

annotate_domains(
    adata,
    api_key=api_key,
    provider="openai",
    model="gpt-4.1",
)

One annotation is generated per domain and added to adata.obs["novae_domains_X_annotation"].

Use case 2: no API key (manual LLM workflow)

from novae.utils import annotate_domains, add_domain_annotation

payload = annotate_domains(
    adata,
    return_prompt=True,
)

With return_prompt=True, Novae does not call any API. It returns the full request payload (messages and output_schema), which can be copied into any LLM manually.

The LLM should return a structured annotation payload matching the schema. You can then pass that output directly to add_domain_annotation:

annotation = {
    "annotation": [
        {"novae_domains": "D1011", "annotation": "tumor epithelium", "score": 0.89}
    ]
}

add_domain_annotation(
    adata,
    annotation=annotation,
)

This adds one annotation per domain to adata.obs["novae_domains_X_annotation"].

Detailed documentation will be added in a follow-up PR.

…, annotation_conf_score

…n only the prompt output is returned

quentinblampey

Thanks @alihamraoui, again it looks really great!

I made many comments, but they are all minor: they concern only syntax and variable naming.

Since this function will be very useful, I think we could import it in the main __init__ file, so that we can call novae.name_domains directly instead of novae.utils.name_domains, what do you think?

NB: can you git pull? I resolved some conflicts with the main branch

quentinblampey · 2026-04-24T06:52:39Z

+
+def __getattr__(name: str) -> Any:
+    if name == "annotate_domains":
+        from ._annotate_domains import annotate_domains


You can do a normal import instead, because the openai and anthropic imports are nested within functions of _annotate_domains.py, so they won't be imported anyway

quentinblampey · 2026-04-24T06:53:21Z

+        "- Do NOT skip any domain. "
+        "- Do NOT add explanations."
+        "Return only valid JSON matching the provided schema."
+    ).format(


Could you use f-strings instead of .format? I think it's more readable!

quentinblampey · 2026-04-24T06:55:59Z

+    for domain_id in domain_ids:
+        domain_id_str = str(domain_id)
+        pct = proportions.get(domain_id_str, 0.0) * 100
+        lines.append(f"Domain {domain_id}: {pct:.2f}%")


Suggested change

lines.append(f"Domain {domain_id}: {pct:.2f}%")

lines.append(f"Domain {domain_id}: {pct:.2%}")

You can use the % formatting from the f-strings
You'll also need to remove the * 100 just above, because the :.2% already handles it

quentinblampey · 2026-04-24T06:58:03Z

+    return api_request_func
+
+
+def _OpenAI_api_request(


Please use snakecase, which is the standard for python functions, i.e. _openai_api_request

quentinblampey · 2026-04-24T06:58:29Z

+        raise RuntimeError(f"OpenAI API request failed: {e}") from e
+
+
+def _Anthropic_api_request(


Same here: _anthropic_api_request

quentinblampey · 2026-04-24T07:08:12Z

+    OPENAI_API_KEY: str = "OPENAI_API_KEY"
+    ANTHROPIC_API_KEY: str = "ANTHROPIC_API_KEY"
+    DOMAIN_ANNOTATION: str = "annotation"
+    DOMAIN_ID: str = "novae_domains"


Is it not possible to use the already existing DOMAINS_PREFIX instead? It's because there is an underscore _ at the end?
EDIT: actually, I think you can remove it and use obs_key instead of DOMAIN_ID

Just forgot to update this part after the last commits.
bien vu!

quentinblampey · 2026-04-24T07:09:02Z

+    """
+    Convert rank_genes_groups into dict
+    """
+    names = adata.uns["rank_genes_groups"]["names"]


What happens if we did not run scanpy.tl.rank_genes_groups before?

Shouldn't we call sc.tl.rank_genes_groups(adata, obs_key) ourselves?

I’ll add a check for "rank_genes_groups" and run the function if it’s not found.
Thanks!

quentinblampey · 2026-04-24T07:14:18Z

+) -> str:
+    if api_key is None:
+        warnings.warn(
+            f"`api_key` was not provided. Trying environment variable `{env_var}`.",


I think you can remove this warning, because it's the intended behavior, no? And we'll get an error anyway if we don't have the right env variable

quentinblampey · 2026-04-24T07:24:16Z

+        seed=seed,
+    )
+
+    domain_ann = {d[Keys.DOMAIN_ID]: d[Keys.DOMAIN_ANNOTATION] for d in result[Keys.DOMAIN_ANNOTATION]}


You can directly create a dataframe out of it:

df_naming = pd.DataFrame.from_records(result["annotation"], index="novae_domains")

And then, you can apply it with:

adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])

quentinblampey · 2026-04-24T07:27:51Z

+    return pd.DataFrame(result[Keys.DOMAIN_ANNOTATION])
+
+
+def add_domain_annotation(


I'm not 100% we need this function, because the naming is just a single line adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])

Perhaps we can just let the user rename it? For instance, we just show in a tutorial how to transform the dict into a dataframe (as shown in my previous comment), and then the user just has to apply the .map?

This way, we would maybe not even need the key_added in annotate_domains, and it would just return a dataframe (without actually running the .map)?

Awesome! I agree.

quentinblampey · 2026-04-24T07:35:48Z

+    schema = {
+        "type": "object",
+        "properties": {
+            Keys.DOMAIN_ANNOTATION: {


Do you need this Keys.DOMAIN_ANNOTATION in the schema? I think you can ask to return the array directly, without nesting it into a dict with one key?

I added it in a previous version to return additional metadata, "model_version"...
You’re right a JSON array is simpler :))

The API requires the top-level schema to be a JSON object
I tried switching to response_format = {"type": "json_object"}, but it’s less reliable since the output isn’t schema-validated, can lead to parsing errors, and not all models support it.

I think it’s better to keep the array inside a top-level property.

…novae.label_domains

…Data

quentinblampey · 2026-04-30T08:08:50Z

Do you need the .astype(str)?

quentinblampey · 2026-04-30T08:09:33Z

-        pct = proportions.get(domain_id_str, 0.0) * 100
-        lines.append(f"Domain {domain_id}: {pct:.2f}%")
+        pct = proportions.get(domain_id_str, 0.0)
+        lines.append(f"Domain {domain_id}: {pct:.2%}")


I think you can do it in one line using:

lines = [ f"Domain {domain_id}: {pct:.2%}" for domain_id, pct in adata.obs[obs_key].value_counts(normalize=True).items() ]

quentinblampey · 2026-04-30T08:11:37Z


-    key_added = f"{obs_key}_{Keys.DOMAIN_ANNOTATION}" if key_added is None else key_added
-
    gene_marker_dict = utils.markers_as_dict(adata, n_genes)


Since markers_as_dict is only used there, maybe the function can be moved into this file as well?

…tin)

quentinblampey · 2026-04-30T11:48:09Z

-    if "rank_genes_groups" in adata.uns:
-        names = adata.uns["rank_genes_groups"]["names"][:n_genes]
-        return {domain: list(names[domain]) for domain in domain_ids}
+    if "rank_genes_groups" not in adata.uns:


I think you also need to check that it was grouped by the right obs_key
Else, if the user already ran rank_genes_groups to compare cell-types (and not domains), it will load the DEGs between cell-types instead of domains

alihamraoui added 22 commits April 9, 2026 15:54

Add LLM-based annotation for Novae domains

403b8ae

Add OpenAI annotation function

481c22f

fix api_key docstring

3fecb3e

fix response_format ans schema typing

f1b87f6

update domain_annotation key

48d7b1d

ruff-formating

87da85e

fix lazy import for OpenAI module

77425a6

assert when domain_key is None

6363297

fix domain annotation added key to fit pathway_scores plot function

23c89a2

use obs_key option instead of domain_key

37f8281

fix(plot): respect show=False in pathway_scores heatmap

815f9b6

Add pathway scores to the prompt

8a485f8

return an annotation confidence score

9191d6d

fix pathway scores format

b20372a

add Anthropic as llm provider to use claude as option

990ae13

fix: lazy import for anthropic module

03e23a7

model.annotate_domains returns DataFrame of novae_domains, annotation…

98dd910

…, annotation_conf_score

fix(types): use dict return type in annotate_domains

42bdeeb

fix plt show

4d26002

add return_prompt option

1832127

Add a helper function to easily add domain annotations to AnnData whe…

dda5706

…n only the prompt output is returned

fix lazy import for annotate_domains

833ab11

alihamraoui requested a review from quentinblampey as a code owner April 23, 2026 15:04

add Cell percentages by domain

f4a8794

quentinblampey changed the title ~~Do annotation~~ LLM-based naming of the spatial domains Apr 24, 2026

Merge branch 'main' into do-annotation

025f322

quentinblampey requested changes Apr 24, 2026

View reviewed changes

fix pre-commit

add3a7e

quentinblampey reviewed Apr 24, 2026

View reviewed changes

deprecate add_domain_annotation

13c8493

alihamraoui added 5 commits April 29, 2026 11:55

Add niche naming rule to annotation prompt

cdd9f8a

Normal import of annotate_domains

dba34ed

use the % formatting from the f-strings for domain_cell_percentages

3888fae

Rename domain labeling function from novae.utils.annotate_domains to …

26983a1

…novae.label_domains

Return DataFrame or prompt payload; labels are no longer added to Ann…

c711f40

…Data

quentinblampey reviewed Apr 30, 2026

View reviewed changes

alihamraoui added 6 commits April 30, 2026 12:34

Run sc.tl.rank_genes_groups in label_domains

250fb89

Refactor _format_domain_cell_percentages for readability (thanks Quen…

f9b57fa

…tin)

use snakecase

eb7146f

remove warning for api-key env variable

023599d

Keep only gene markers with positive logFC

afd63d1

fix _markers_as_dict

0f63e0c

quentinblampey reviewed Apr 30, 2026

View reviewed changes

alihamraoui added 2 commits April 30, 2026 14:06

check if rank_genes_groups was grouped by obs_key

614f39b

Remove max_tokens for compatibility with latest OpenAI model

6b8c600

	lines.append(f"Domain {domain_id}: {pct:.2f}%")
	lines.append(f"Domain {domain_id}: {pct:.2%}")

		raise RuntimeError(f"OpenAI API request failed: {e}") from e


		def _Anthropic_api_request(

		return pd.DataFrame(result[Keys.DOMAIN_ANNOTATION])


		def add_domain_annotation(


		key_added = f"{obs_key}_{Keys.DOMAIN_ANNOTATION}" if key_added is None else key_added

		gene_marker_dict = utils.markers_as_dict(adata, n_genes)

Conversation

alihamraoui commented Apr 23, 2026

Uh oh!

quentinblampey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quentinblampey Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quentinblampey left a comment •

edited

Loading

quentinblampey Apr 24, 2026 •

edited

Loading