Skip to content

chore: naming consistency for training/synthetic/test datasets#337

Open
nina-xu wants to merge 7 commits intomainfrom
ninaxu/143-dataset-naming-consistency
Open

chore: naming consistency for training/synthetic/test datasets#337
nina-xu wants to merge 7 commits intomainfrom
ninaxu/143-dataset-naming-consistency

Conversation

@nina-xu
Copy link
Copy Markdown
Contributor

@nina-xu nina-xu commented Apr 1, 2026

Summary

Standardize dataset naming conventions throughout the codebase, documentation, and quality report for consistency and clarity.

Canonical naming convention

Dataset Old names New name
Training data reference, real, training, train, df1, original, df_all training
Synthetic data output, synthetic, synth, df2 synthetic
Test holdout test test (unchanged)
Full data before split df input

Variable naming conventions

  • Pydantic model fields: short form (training, synthetic)
  • pandas DataFrame variables/params: _df suffix (training_df, synthetic_df, test_df)
  • Embedding DataFrames: _embd_df suffix (training_embd_df, synthetic_embd_df)
  • Parameter ordering: always training, synthetic, test when multiple datasets appear together

Other fixes for clarity

  • Renamed class EvaluationDatasetEvaluationDatasets; file renamed to evaluation_datasets.py
  • Renamed mode parameter → based_on in column-type query methods (get_tabular_columns, etc.)
  • Fixed some inaccuracies in the evaluation html report while I was at it

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Testing Plan

  • make test-e2e
  • run a few end-to-end jobs via sdk/cli. inspect the new evaluation report
    evaluation_report.html
  • run slurm jobs and compare with baseline to ensure that no implementation change is accidentally introduced

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu changed the title naming consistency for training/synthetic/test datasets chore: naming consistency for training/synthetic/test datasets Apr 1, 2026
nina-xu added 3 commits April 1, 2026 12:47
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
raise ValueError(f"{df_name} is empty!")

def get_columns_of_type(self, types: set[FieldType], mode="reference") -> list[str]:
def get_columns_of_type(self, types: set[FieldType], based_on="training") -> list[str]:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the argument name here because mode="training" feels misleading

nina-xu added 3 commits April 2, 2026 06:51
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu marked this pull request as ready for review April 2, 2026 15:18
@nina-xu nina-xu requested review from a team as code owners April 2, 2026 15:18
Copilot AI review requested due to automatic review settings April 2, 2026 15:18
@nina-xu nina-xu requested a review from alexahaushalter April 2, 2026 15:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes dataset naming across the project (code, tests, docs, and the evaluation HTML report) to consistently use training, synthetic, test, and input (pre-split) terminology.

Changes:

  • Renames evaluation data model EvaluationDatasetEvaluationDatasets and propagates the new training/synthetic naming through evaluation components, reports, and templates.
  • Updates holdout splitting APIs/params and training pipeline variables to use training_df / synthetic_df naming consistently.
  • Refreshes tests and documentation/report copy to reflect the new canonical naming and labels.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/stub_datasets/licenses.md Adds SPDX headers for test stub dataset licensing doc.
tests/smoke/test_evaluation_cpu.py Updates report construction args to training/synthetic.
tests/sdk/test_process_data.py Renames builder internals to _training_df / _original_training_df in tests.
tests/holdout/test_holdout.py Updates holdout split call to input_df=....
tests/evaluation/test_render.py Updates fixtures/kwargs to training/synthetic.
tests/evaluation/reports/test_multimodal_report.py Updates report args to training/synthetic.
tests/evaluation/datamodel/test_evaluation_datasets.py New tests for EvaluationDatasets model behavior.
tests/evaluation/datamodel/test_evaluation_dataset.py Removes obsolete tests for old EvaluationDataset model.
tests/evaluation/conftest.py Renames fixtures and switches to EvaluationDatasets.
tests/evaluation/components/test_pii_replay.py Updates PII replay construction and field names.
tests/evaluation/components/test_multi_modal_figures.py Updates figure trace naming (“Training”, “Synthetic”).
tests/evaluation/components/test_membership_inference_protection.py Updates MIA tests to new model/constructor names.
tests/evaluation/components/test_deep_structure.py Updates PCA outputs to training/synthetic naming.
tests/evaluation/components/test_correlation.py Updates correlation outputs to training/synthetic naming.
tests/evaluation/components/test_composite_score.py Updates component construction helper fixture usage.
tests/evaluation/components/test_column_distribution.py Updates column distribution construction and inputs.
tests/evaluation/components/test_attribute_inference_protection.py Updates AIA construction to new model/constructor names.
tests/data_processing/test_assembler.py Updates assembler attribute naming (training_dataset).
src/nemo_safe_synthesizer/utils.py Updates wrapper to pass input_df to holdout module.
src/nemo_safe_synthesizer/training/huggingface_backend.py Renames internal training/test df fields; updates trainer result payload.
src/nemo_safe_synthesizer/training/backend.py Renames df_traintraining_df and df_testtest_df in base types.
src/nemo_safe_synthesizer/sdk/library_builder.py Renames internal cached split fields and evaluator arg to training_df.
src/nemo_safe_synthesizer/holdout/holdout.py Renames split fns/params to input_df and returns training_df.
src/nemo_safe_synthesizer/evaluation/statistics/stats.py Renames parameters for memorization/PCA helpers to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/reports/multimodal/multimodal_report.py Switches report assembly to EvaluationDatasets and new component constructors.
src/nemo_safe_synthesizer/evaluation/evaluator.py Renames evaluator input from train_df to training_df.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_report.py Renames report field to evaluation_datasets.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_field.py Renames field feature/distribution attributes to training/synthetic.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py Renames and updates evaluation data model; renames modebased_on.
src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py Updates component to consume EvaluationDatasets and new naming.
src/nemo_safe_synthesizer/evaluation/components/text_semantic_similarity.py Updates component naming and PCA labeling (“training” vs “train”).
src/nemo_safe_synthesizer/evaluation/components/sqs_score.py Updates field-type lookup to training-based features.
src/nemo_safe_synthesizer/evaluation/components/pii_replay.py Renames PII replay fields to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py Updates figure labels and function parameter names to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/membership_inference_protection.py Renames params and normalization flow to training/synthetic/test naming.
src/nemo_safe_synthesizer/evaluation/components/deep_structure.py Renames PCA outputs and data-prep params to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py Renames dataset stats fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/correlation.py Renames correlation matrices and wiring to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/component.py Renames base factory method to from_evaluation_datasets.
src/nemo_safe_synthesizer/evaluation/components/column_distribution.py Updates plotting/data access to training/synthetic fields.
src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py Renames params and internal variables to training/synthetic.
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Updates report tooltip text to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/assets/jinja/reports/multi_modal_report.j2 Updates report navigation and includes to “Training Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2 Renames “Reference Columns” section to “Training Data Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/pii_replay.j2 Updates headings/fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/dataset_statistics.j2 Updates table labels and memorized-lines row wording.
src/nemo_safe_synthesizer/data_processing/assembler.py Renames internal dataset attributes and split variables to training/validation.
docs/user-guide/running.md Updates generation-stage description to “training dataset schema”.
docs/product-overview/evaluation.md Updates SQS/DPS copy to training vs synthetic terminology.
Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:35

  • The "Type" column in the "Training Data Columns" table is currently sourced from synthetic_field_features.type. If this table is intended to describe the training data (as the header + tooltip indicate), this should use training_field_features.type instead; otherwise the label/tooltip should be updated to clarify which dataset’s type is shown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nina-xu this is really good to be consistent on naming.

As a larger refactoring, I'd like to get our ty type checking updated and working more fully in #141 first. That will give us more confidence that the refactoring didn't miss anything or introduce any typos.

My goal is to get #141 ready today. Hopefully I can run a bit of slurm testing over the weekend and we can merge #141 on Monday. Then do one merge conflict resolution with this PR and proceed. If things don't go well on the type checking PR today, then we can swap the order and I'll figure out merge conflicts in the already massive #141. Will check in on Monday on how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: naming consistency for train/output/test datasets

3 participants