chore: naming consistency for training/synthetic/test datasets by nina-xu · Pull Request #337 · NVIDIA-NeMo/Safe-Synthesizer

nina-xu · 2026-04-01T19:46:13Z

Summary

Standardize dataset naming conventions throughout the codebase, documentation, and quality report for consistency and clarity.

Canonical naming convention

Dataset	Old names	New name
Training data	`reference`, `real`, `training`, `train`, `df1`, `original`, `df_all`	`training`
Synthetic data	`output`, `synthetic`, `synth`, `df2`	`synthetic`
Test holdout	`test`	`test` (unchanged)
Full data before split	`df`	`input`

Variable naming conventions

Pydantic model fields: short form (training, synthetic)
pandas DataFrame variables/params: _df suffix (training_df, synthetic_df, test_df)
Embedding DataFrames: _embd_df suffix (training_embd_df, synthetic_embd_df)
Parameter ordering: always training, synthetic, test when multiple datasets appear together

Other fixes for clarity

Renamed class EvaluationDataset → EvaluationDatasets; file renamed to evaluation_datasets.py
Renamed mode parameter → based_on in column-type query methods (get_tabular_columns, etc.)
Fixed some inaccuracies in the evaluation html report while I was at it

Pre-Review Checklist

Ensure that the following pass:

make format && make check or via prek validation.
make test passes locally
make test-e2e passes locally
make test-ci-container passes locally (recommended)
GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

New or updated tests for any fix or new behavior
Updated documentation for new features and behaviors, including docstrings for API docs.

Testing Plan

make test-e2e
run a few end-to-end jobs via sdk/cli. inspect the new evaluation report
evaluation_report.html
run slurm jobs and compare with baseline to ensure that no implementation change is accidentally introduced

Closes chore: naming consistency for train/output/test datasets #143

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nina-xu · 2026-04-02T13:49:52Z

src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py

            raise ValueError(f"{df_name} is empty!")

-    def get_columns_of_type(self, types: set[FieldType], mode="reference") -> list[str]:
+    def get_columns_of_type(self, types: set[FieldType], based_on="training") -> list[str]:


changed the argument name here because mode="training" feels misleading

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Copilot

Pull request overview

This PR standardizes dataset naming across the project (code, tests, docs, and the evaluation HTML report) to consistently use training, synthetic, test, and input (pre-split) terminology.

Changes:

Renames evaluation data model EvaluationDataset → EvaluationDatasets and propagates the new training/synthetic naming through evaluation components, reports, and templates.
Updates holdout splitting APIs/params and training pipeline variables to use training_df / synthetic_df naming consistently.
Refreshes tests and documentation/report copy to reflect the new canonical naming and labels.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/stub_datasets/licenses.md	Adds SPDX headers for test stub dataset licensing doc.
tests/smoke/test_evaluation_cpu.py	Updates report construction args to `training`/`synthetic`.
tests/sdk/test_process_data.py	Renames builder internals to `_training_df` / `_original_training_df` in tests.
tests/holdout/test_holdout.py	Updates holdout split call to `input_df=...`.
tests/evaluation/test_render.py	Updates fixtures/kwargs to `training`/`synthetic`.
tests/evaluation/reports/test_multimodal_report.py	Updates report args to `training`/`synthetic`.
tests/evaluation/datamodel/test_evaluation_datasets.py	New tests for `EvaluationDatasets` model behavior.
tests/evaluation/datamodel/test_evaluation_dataset.py	Removes obsolete tests for old `EvaluationDataset` model.
tests/evaluation/conftest.py	Renames fixtures and switches to `EvaluationDatasets`.
tests/evaluation/components/test_pii_replay.py	Updates PII replay construction and field names.
tests/evaluation/components/test_multi_modal_figures.py	Updates figure trace naming (“Training”, “Synthetic”).
tests/evaluation/components/test_membership_inference_protection.py	Updates MIA tests to new model/constructor names.
tests/evaluation/components/test_deep_structure.py	Updates PCA outputs to training/synthetic naming.
tests/evaluation/components/test_correlation.py	Updates correlation outputs to training/synthetic naming.
tests/evaluation/components/test_composite_score.py	Updates component construction helper fixture usage.
tests/evaluation/components/test_column_distribution.py	Updates column distribution construction and inputs.
tests/evaluation/components/test_attribute_inference_protection.py	Updates AIA construction to new model/constructor names.
tests/data_processing/test_assembler.py	Updates assembler attribute naming (`training_dataset`).
src/nemo_safe_synthesizer/utils.py	Updates wrapper to pass `input_df` to holdout module.
src/nemo_safe_synthesizer/training/huggingface_backend.py	Renames internal training/test df fields; updates trainer result payload.
src/nemo_safe_synthesizer/training/backend.py	Renames `df_train` → `training_df` and `df_test` → `test_df` in base types.
src/nemo_safe_synthesizer/sdk/library_builder.py	Renames internal cached split fields and evaluator arg to `training_df`.
src/nemo_safe_synthesizer/holdout/holdout.py	Renames split fns/params to `input_df` and returns `training_df`.
src/nemo_safe_synthesizer/evaluation/statistics/stats.py	Renames parameters for memorization/PCA helpers to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/reports/multimodal/multimodal_report.py	Switches report assembly to `EvaluationDatasets` and new component constructors.
src/nemo_safe_synthesizer/evaluation/evaluator.py	Renames evaluator input from `train_df` to `training_df`.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_report.py	Renames report field to `evaluation_datasets`.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_field.py	Renames field feature/distribution attributes to training/synthetic.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py	Renames and updates evaluation data model; renames `mode` → `based_on`.
src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py	Updates component to consume `EvaluationDatasets` and new naming.
src/nemo_safe_synthesizer/evaluation/components/text_semantic_similarity.py	Updates component naming and PCA labeling (“training” vs “train”).
src/nemo_safe_synthesizer/evaluation/components/sqs_score.py	Updates field-type lookup to training-based features.
src/nemo_safe_synthesizer/evaluation/components/pii_replay.py	Renames PII replay fields to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py	Updates figure labels and function parameter names to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/membership_inference_protection.py	Renames params and normalization flow to training/synthetic/test naming.
src/nemo_safe_synthesizer/evaluation/components/deep_structure.py	Renames PCA outputs and data-prep params to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py	Renames dataset stats fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/correlation.py	Renames correlation matrices and wiring to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/component.py	Renames base factory method to `from_evaluation_datasets`.
src/nemo_safe_synthesizer/evaluation/components/column_distribution.py	Updates plotting/data access to training/synthetic fields.
src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py	Renames params and internal variables to training/synthetic.
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py	Updates report tooltip text to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/assets/jinja/reports/multi_modal_report.j2	Updates report navigation and includes to “Training Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2	Renames “Reference Columns” section to “Training Data Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/pii_replay.j2	Updates headings/fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/dataset_statistics.j2	Updates table labels and memorized-lines row wording.
src/nemo_safe_synthesizer/data_processing/assembler.py	Renames internal dataset attributes and split variables to training/validation.
docs/user-guide/running.md	Updates generation-stage description to “training dataset schema”.
docs/product-overview/evaluation.md	Updates SQS/DPS copy to training vs synthetic terminology.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:35

The "Type" column in the "Training Data Columns" table is currently sourced from synthetic_field_features.type. If this table is intended to describe the training data (as the header + tooltip indicate), this should use training_field_features.type instead; otherwise the label/tooltip should be updated to clarify which dataset’s type is shown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/nemo_safe_synthesizer/evaluation/components/pii_replay.py

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py

kendrickb-nvidia

Thanks @nina-xu this is really good to be consistent on naming.

As a larger refactoring, I'd like to get our ty type checking updated and working more fully in #141 first. That will give us more confidence that the refactoring didn't miss anything or introduce any typos.

My goal is to get #141 ready today. Hopefully I can run a bit of slurm testing over the weekend and we can merge #141 on Monday. Then do one merge conflict resolution with this PR and proceed. If things don't go well on the type checking PR today, then we can swap the order and I'll figure out merge conflicts in the already massive #141. Will check in on Monday on how to proceed.

naming consistency for training/synthetic/test datasets

2928706

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nina-xu changed the title ~~naming consistency for training/synthetic/test datasets~~ chore: naming consistency for training/synthetic/test datasets Apr 1, 2026

nina-xu added 3 commits April 1, 2026 12:47

make format

485f119

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

fix

97c3435

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

rename an arg for clarity & revert an accidnetal change

22acabe

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nina-xu commented Apr 2, 2026

View reviewed changes

nina-xu added 3 commits April 2, 2026 06:51

fix text semantic similarity description

284f953

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>

nit naming in report

b6e9e15

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

caught some inaccuracies in the report language

2f46f16

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

nina-xu marked this pull request as ready for review April 2, 2026 15:18

nina-xu requested review from a team as code owners April 2, 2026 15:18

Copilot AI review requested due to automatic review settings April 2, 2026 15:18

Copilot started reviewing on behalf of nina-xu April 2, 2026 15:18 View session

nina-xu requested a review from alexahaushalter April 2, 2026 15:21

Copilot AI reviewed Apr 2, 2026

View reviewed changes

src/nemo_safe_synthesizer/evaluation/components/pii_replay.py Show resolved Hide resolved

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Show resolved Hide resolved

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Show resolved Hide resolved

kendrickb-nvidia reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: naming consistency for training/synthetic/test datasets#337

chore: naming consistency for training/synthetic/test datasets#337
nina-xu wants to merge 7 commits intomainfrom
ninaxu/143-dataset-naming-consistency

nina-xu commented Apr 1, 2026 •

edited

Loading

Uh oh!

nina-xu Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nina-xu commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Canonical naming convention

Variable naming conventions

Other fixes for clarity

Pre-Review Checklist

Pre-Merge Checklist

Testing Plan

Uh oh!

nina-xu Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nina-xu commented Apr 1, 2026 •

edited

Loading