chore: naming consistency for training/synthetic/test datasets#337
chore: naming consistency for training/synthetic/test datasets#337
Conversation
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
| raise ValueError(f"{df_name} is empty!") | ||
|
|
||
| def get_columns_of_type(self, types: set[FieldType], mode="reference") -> list[str]: | ||
| def get_columns_of_type(self, types: set[FieldType], based_on="training") -> list[str]: |
There was a problem hiding this comment.
changed the argument name here because mode="training" feels misleading
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR standardizes dataset naming across the project (code, tests, docs, and the evaluation HTML report) to consistently use training, synthetic, test, and input (pre-split) terminology.
Changes:
- Renames evaluation data model
EvaluationDataset→EvaluationDatasetsand propagates the new training/synthetic naming through evaluation components, reports, and templates. - Updates holdout splitting APIs/params and training pipeline variables to use
training_df/synthetic_dfnaming consistently. - Refreshes tests and documentation/report copy to reflect the new canonical naming and labels.
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/stub_datasets/licenses.md | Adds SPDX headers for test stub dataset licensing doc. |
| tests/smoke/test_evaluation_cpu.py | Updates report construction args to training/synthetic. |
| tests/sdk/test_process_data.py | Renames builder internals to _training_df / _original_training_df in tests. |
| tests/holdout/test_holdout.py | Updates holdout split call to input_df=.... |
| tests/evaluation/test_render.py | Updates fixtures/kwargs to training/synthetic. |
| tests/evaluation/reports/test_multimodal_report.py | Updates report args to training/synthetic. |
| tests/evaluation/datamodel/test_evaluation_datasets.py | New tests for EvaluationDatasets model behavior. |
| tests/evaluation/datamodel/test_evaluation_dataset.py | Removes obsolete tests for old EvaluationDataset model. |
| tests/evaluation/conftest.py | Renames fixtures and switches to EvaluationDatasets. |
| tests/evaluation/components/test_pii_replay.py | Updates PII replay construction and field names. |
| tests/evaluation/components/test_multi_modal_figures.py | Updates figure trace naming (“Training”, “Synthetic”). |
| tests/evaluation/components/test_membership_inference_protection.py | Updates MIA tests to new model/constructor names. |
| tests/evaluation/components/test_deep_structure.py | Updates PCA outputs to training/synthetic naming. |
| tests/evaluation/components/test_correlation.py | Updates correlation outputs to training/synthetic naming. |
| tests/evaluation/components/test_composite_score.py | Updates component construction helper fixture usage. |
| tests/evaluation/components/test_column_distribution.py | Updates column distribution construction and inputs. |
| tests/evaluation/components/test_attribute_inference_protection.py | Updates AIA construction to new model/constructor names. |
| tests/data_processing/test_assembler.py | Updates assembler attribute naming (training_dataset). |
| src/nemo_safe_synthesizer/utils.py | Updates wrapper to pass input_df to holdout module. |
| src/nemo_safe_synthesizer/training/huggingface_backend.py | Renames internal training/test df fields; updates trainer result payload. |
| src/nemo_safe_synthesizer/training/backend.py | Renames df_train → training_df and df_test → test_df in base types. |
| src/nemo_safe_synthesizer/sdk/library_builder.py | Renames internal cached split fields and evaluator arg to training_df. |
| src/nemo_safe_synthesizer/holdout/holdout.py | Renames split fns/params to input_df and returns training_df. |
| src/nemo_safe_synthesizer/evaluation/statistics/stats.py | Renames parameters for memorization/PCA helpers to training/synthetic terminology. |
| src/nemo_safe_synthesizer/evaluation/reports/multimodal/multimodal_report.py | Switches report assembly to EvaluationDatasets and new component constructors. |
| src/nemo_safe_synthesizer/evaluation/evaluator.py | Renames evaluator input from train_df to training_df. |
| src/nemo_safe_synthesizer/evaluation/data_model/evaluation_report.py | Renames report field to evaluation_datasets. |
| src/nemo_safe_synthesizer/evaluation/data_model/evaluation_field.py | Renames field feature/distribution attributes to training/synthetic. |
| src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py | Renames and updates evaluation data model; renames mode → based_on. |
| src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py | Updates component to consume EvaluationDatasets and new naming. |
| src/nemo_safe_synthesizer/evaluation/components/text_semantic_similarity.py | Updates component naming and PCA labeling (“training” vs “train”). |
| src/nemo_safe_synthesizer/evaluation/components/sqs_score.py | Updates field-type lookup to training-based features. |
| src/nemo_safe_synthesizer/evaluation/components/pii_replay.py | Renames PII replay fields to training/synthetic terminology. |
| src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py | Updates figure labels and function parameter names to training/synthetic. |
| src/nemo_safe_synthesizer/evaluation/components/membership_inference_protection.py | Renames params and normalization flow to training/synthetic/test naming. |
| src/nemo_safe_synthesizer/evaluation/components/deep_structure.py | Renames PCA outputs and data-prep params to training/synthetic naming. |
| src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py | Renames dataset stats fields to training/synthetic naming. |
| src/nemo_safe_synthesizer/evaluation/components/correlation.py | Renames correlation matrices and wiring to training/synthetic. |
| src/nemo_safe_synthesizer/evaluation/components/component.py | Renames base factory method to from_evaluation_datasets. |
| src/nemo_safe_synthesizer/evaluation/components/column_distribution.py | Updates plotting/data access to training/synthetic fields. |
| src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py | Renames params and internal variables to training/synthetic. |
| src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py | Updates report tooltip text to training/synthetic terminology. |
| src/nemo_safe_synthesizer/evaluation/assets/jinja/reports/multi_modal_report.j2 | Updates report navigation and includes to “Training Columns”. |
| src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2 | Renames “Reference Columns” section to “Training Data Columns”. |
| src/nemo_safe_synthesizer/evaluation/assets/jinja/components/pii_replay.j2 | Updates headings/fields to training/synthetic naming. |
| src/nemo_safe_synthesizer/evaluation/assets/jinja/components/dataset_statistics.j2 | Updates table labels and memorized-lines row wording. |
| src/nemo_safe_synthesizer/data_processing/assembler.py | Renames internal dataset attributes and split variables to training/validation. |
| docs/user-guide/running.md | Updates generation-stage description to “training dataset schema”. |
| docs/product-overview/evaluation.md | Updates SQS/DPS copy to training vs synthetic terminology. |
Comments suppressed due to low confidence (1)
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:35
- The "Type" column in the "Training Data Columns" table is currently sourced from
synthetic_field_features.type. If this table is intended to describe the training data (as the header + tooltip indicate), this should usetraining_field_features.typeinstead; otherwise the label/tooltip should be updated to clarify which dataset’s type is shown.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py
Show resolved
Hide resolved
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py
Show resolved
Hide resolved
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Thanks @nina-xu this is really good to be consistent on naming.
As a larger refactoring, I'd like to get our ty type checking updated and working more fully in #141 first. That will give us more confidence that the refactoring didn't miss anything or introduce any typos.
My goal is to get #141 ready today. Hopefully I can run a bit of slurm testing over the weekend and we can merge #141 on Monday. Then do one merge conflict resolution with this PR and proceed. If things don't go well on the type checking PR today, then we can swap the order and I'll figure out merge conflicts in the already massive #141. Will check in on Monday on how to proceed.
Summary
Standardize dataset naming conventions throughout the codebase, documentation, and quality report for consistency and clarity.
Canonical naming convention
reference,real,training,train,df1,original,df_alltrainingoutput,synthetic,synth,df2synthetictesttest(unchanged)dfinputVariable naming conventions
training,synthetic)_dfsuffix (training_df,synthetic_df,test_df)_embd_dfsuffix (training_embd_df,synthetic_embd_df)training, synthetic, testwhen multiple datasets appear togetherOther fixes for clarity
EvaluationDataset→EvaluationDatasets; file renamed toevaluation_datasets.pymodeparameter →based_onin column-type query methods (get_tabular_columns, etc.)Pre-Review Checklist
Ensure that the following pass:
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist
Testing Plan
make test-e2eevaluation_report.html