LLM-based naming of the spatial domains#48
Conversation
…, annotation_conf_score
…n only the prompt output is returned
There was a problem hiding this comment.
Thanks @alihamraoui, again it looks really great!
I made many comments, but they are all minor: they concern only syntax and variable naming.
Since this function will be very useful, I think we could import it in the main __init__ file, so that we can call novae.name_domains directly instead of novae.utils.name_domains, what do you think?
NB: can you git pull? I resolved some conflicts with the main branch
|
|
||
| def __getattr__(name: str) -> Any: | ||
| if name == "annotate_domains": | ||
| from ._annotate_domains import annotate_domains |
There was a problem hiding this comment.
You can do a normal import instead, because the openai and anthropic imports are nested within functions of _annotate_domains.py, so they won't be imported anyway
| "- Do NOT skip any domain. " | ||
| "- Do NOT add explanations." | ||
| "Return only valid JSON matching the provided schema." | ||
| ).format( |
There was a problem hiding this comment.
Could you use f-strings instead of .format? I think it's more readable!
| for domain_id in domain_ids: | ||
| domain_id_str = str(domain_id) | ||
| pct = proportions.get(domain_id_str, 0.0) * 100 | ||
| lines.append(f"Domain {domain_id}: {pct:.2f}%") |
There was a problem hiding this comment.
| lines.append(f"Domain {domain_id}: {pct:.2f}%") | |
| lines.append(f"Domain {domain_id}: {pct:.2%}") |
You can use the % formatting from the f-strings
You'll also need to remove the * 100 just above, because the :.2% already handles it
| return api_request_func | ||
|
|
||
|
|
||
| def _OpenAI_api_request( |
There was a problem hiding this comment.
Please use snakecase, which is the standard for python functions, i.e. _openai_api_request
| raise RuntimeError(f"OpenAI API request failed: {e}") from e | ||
|
|
||
|
|
||
| def _Anthropic_api_request( |
There was a problem hiding this comment.
Same here: _anthropic_api_request
| OPENAI_API_KEY: str = "OPENAI_API_KEY" | ||
| ANTHROPIC_API_KEY: str = "ANTHROPIC_API_KEY" | ||
| DOMAIN_ANNOTATION: str = "annotation" | ||
| DOMAIN_ID: str = "novae_domains" |
There was a problem hiding this comment.
Is it not possible to use the already existing DOMAINS_PREFIX instead? It's because there is an underscore _ at the end?
EDIT: actually, I think you can remove it and use obs_key instead of DOMAIN_ID
There was a problem hiding this comment.
Just forgot to update this part after the last commits.
bien vu!
| """ | ||
| Convert rank_genes_groups into dict | ||
| """ | ||
| names = adata.uns["rank_genes_groups"]["names"] |
There was a problem hiding this comment.
What happens if we did not run scanpy.tl.rank_genes_groups before?
Shouldn't we call sc.tl.rank_genes_groups(adata, obs_key) ourselves?
There was a problem hiding this comment.
I’ll add a check for "rank_genes_groups" and run the function if it’s not found.
Thanks!
| ) -> str: | ||
| if api_key is None: | ||
| warnings.warn( | ||
| f"`api_key` was not provided. Trying environment variable `{env_var}`.", |
There was a problem hiding this comment.
I think you can remove this warning, because it's the intended behavior, no? And we'll get an error anyway if we don't have the right env variable
| seed=seed, | ||
| ) | ||
|
|
||
| domain_ann = {d[Keys.DOMAIN_ID]: d[Keys.DOMAIN_ANNOTATION] for d in result[Keys.DOMAIN_ANNOTATION]} |
There was a problem hiding this comment.
You can directly create a dataframe out of it:
df_naming = pd.DataFrame.from_records(result["annotation"], index="novae_domains")And then, you can apply it with:
adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])| return pd.DataFrame(result[Keys.DOMAIN_ANNOTATION]) | ||
|
|
||
|
|
||
| def add_domain_annotation( |
There was a problem hiding this comment.
I'm not 100% we need this function, because the naming is just a single line adata.obs[key_added] = adata.obs[obs_key].map(df_naming["annotation"])
Perhaps we can just let the user rename it? For instance, we just show in a tutorial how to transform the dict into a dataframe (as shown in my previous comment), and then the user just has to apply the .map?
This way, we would maybe not even need the key_added in annotate_domains, and it would just return a dataframe (without actually running the .map)?
There was a problem hiding this comment.
Awesome! I agree.
| schema = { | ||
| "type": "object", | ||
| "properties": { | ||
| Keys.DOMAIN_ANNOTATION: { |
There was a problem hiding this comment.
Do you need this Keys.DOMAIN_ANNOTATION in the schema? I think you can ask to return the array directly, without nesting it into a dict with one key?
There was a problem hiding this comment.
I added it in a previous version to return additional metadata, "model_version"...
You’re right a JSON array is simpler :))
There was a problem hiding this comment.
The API requires the top-level schema to be a JSON object
I tried switching to response_format = {"type": "json_object"}, but it’s less reliable since the output isn’t schema-validated, can lead to parsing errors, and not all models support it.
I think it’s better to keep the array inside a top-level property.
…novae.label_domains
There was a problem hiding this comment.
Do you need the .astype(str)?
| pct = proportions.get(domain_id_str, 0.0) * 100 | ||
| lines.append(f"Domain {domain_id}: {pct:.2f}%") | ||
| pct = proportions.get(domain_id_str, 0.0) | ||
| lines.append(f"Domain {domain_id}: {pct:.2%}") |
There was a problem hiding this comment.
I think you can do it in one line using:
lines = [
f"Domain {domain_id}: {pct:.2%}"
for domain_id, pct in adata.obs[obs_key].value_counts(normalize=True).items()
]|
|
||
| key_added = f"{obs_key}_{Keys.DOMAIN_ANNOTATION}" if key_added is None else key_added | ||
|
|
||
| gene_marker_dict = utils.markers_as_dict(adata, n_genes) |
There was a problem hiding this comment.
Since markers_as_dict is only used there, maybe the function can be moved into this file as well?
| if "rank_genes_groups" in adata.uns: | ||
| names = adata.uns["rank_genes_groups"]["names"][:n_genes] | ||
| return {domain: list(names[domain]) for domain in domain_ids} | ||
| if "rank_genes_groups" not in adata.uns: |
There was a problem hiding this comment.
I think you also need to check that it was grouped by the right obs_key
Else, if the user already ran rank_genes_groups to compare cell-types (and not domains), it will load the DEGs between cell-types instead of domains
This pull request adds LLM-based domain annotation to Novae.
Use case 1: annotate directly with an API key
One annotation is generated per domain and added to
adata.obs["novae_domains_X_annotation"].Use case 2: no API key (manual LLM workflow)
With
return_prompt=True, Novae does not call any API. It returns the full request payload (messages and output_schema), which can be copied into any LLM manually.The LLM should return a structured annotation payload matching the schema. You can then pass that output directly to add_domain_annotation:
This adds one annotation per domain to
adata.obs["novae_domains_X_annotation"].Detailed documentation will be added in a follow-up PR.