You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Added automatic duplicate rate checking for training and edge case datasets
- Configurable 5% threshold for duplicate warnings (DUPLICATE_RATE_THRESHOLD)
- Only shows warnings when duplicates could harm model performance
- Switched default data generation model to mistralai/mistral-nemo
- Enhanced system prompts for better violation content generation
- Fixed GitHub workflow argument parsing for problem descriptions
This improves data quality monitoring and model performance by alerting users when duplicate training data could cause overfitting or poor generalization.
Copy file name to clipboardExpand all lines: text_classifier/settings.py
+4-1Lines changed: 4 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@
11
11
DEFAULT_CONFIG_MODEL= (
12
12
"x-ai/grok-3-beta"# "anthropic/claude-3-opus" # More capable model
13
13
)
14
-
DEFAULT_DATA_GEN_MODEL="openai/gpt-4o-mini"# Cheaper/faster for bulk generation
14
+
DEFAULT_DATA_GEN_MODEL="mistralai/mistral-nemo"# Less restrictive for content moderation data
15
15
16
16
# --- Default Paths ---
17
17
DEFAULT_OUTPUT_PATH="models"# Changed for differentiation
@@ -36,6 +36,9 @@
36
36
DEFAULT_PROMPT_REFINEMENT_CYCLES=1# How many times to refine prompts
37
37
DEFAULT_GENERATE_EDGE_CASES=True
38
38
39
+
# --- Data Quality Control ---
40
+
DUPLICATE_RATE_THRESHOLD=5.0# Percentage threshold for duplicate rate warnings
41
+
39
42
# --- Prompts ---
40
43
CONFIG_SYSTEM_PROMPT="You are an expert AI assistant specializing in data generation and configuration for machine learning. Follow instructions precisely and provide output in the requested JSON format."
0 commit comments