-
Notifications
You must be signed in to change notification settings - Fork 100
Description
cmr_processing_level_report.json
generate_processing_level_report.py
Summary
The CMR search API's processing_level and processing_level_id parameters completely fail to return collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') in UMM-C format. 40,121 collections (74.5% of all CMR collections) are invisible to processing level searches.
Environment
- CMR Base URL:
https://cmr.earthdata.nasa.gov/search/ - API Endpoint:
collections.umm_json - Date of Analysis: December 16, 2024
- Total Collections in CMR: 53,852
Bug Description
Expected Behavior
Collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') should be searchable using the processing_level_id parameter.
From exhaustive scan of all collections:
- 40,121 collections have
ProcessingLevel.Id = "Not provided"(lowercase 'p') - 161 collections have
ProcessingLevel.Id = "Not Provided"(capital 'P') - These are distinct, non-overlapping sets (verified)
When searching with processing_level_id="Not provided", the API should return the 40,121 + 161 collections (given it is case insensitive)
Actual Behavior
Searching for any case variant ('not provided', 'Not provided', 'NOT PROVIDED') all return only 161 collections - these are the collections with ProcessingLevel.Id = "Not Provided" (capital 'P').
The 40,121 collections with lowercase "Not provided" are completely missing from the search results.
Analysis Result (from cmr_processing_level_search_index_report.json):
{
"search_filter_results": {
"Not provided": {
"processing_level_id": 161,
"processing_level": 161,
"expected": 40121,
},
"Not Provided": {
"processing_level_id": 161,
"processing_level": 161,
"expected": 161,
}
}
}Impact
- Severity: CRITICAL
- Collections Affected: 40,121 collections (74.5% of all CMR collections)
- User Impact: Users searching for data by processing level miss 99.6% of collections with "Not provided" values. This prevents implementing a fail-closed approach for processing level filtering. For example, when searching for
processing_level=1, users cannot reliably include collections with unspecified/unknown processing levels by addingOR processing_level="Not provided"to their query, because 99.6% of those collections are missing from the search index. - Scope: Only affects "Not provided" (lowercase 'p'); all other 17 processing levels work correctly
Detailed Analysis
Exhaustive Scan Results (Ground Truth)
From comprehensive analysis of all 53,852 collections:
{
"exhaustive_scan": {
"total_collections": 53852,
"collections_with_levels": 53852,
"unique_levels": 18,
"levels": {
"Not provided": 40121,
"Not Provided": 161,
"NA": 1763,
"3": 4401,
"2": 3476,
"4": 1471,
"1B": 1088,
"1": 705,
"1A": 234,
"0": 221,
"2G": 76,
"2P": 53,
"2B": 34,
"1C": 20,
"2A": 14,
"1T": 11,
"L2": 2,
"Level 3": 1
}
}
}Search Filter Test Results
Testing all 18 unique processing level values:
| Processing Level | Expected | Search Returns | Status |
|---|---|---|---|
| Not provided | 40,121 | 161 | ❌ BROKEN |
| Not Provided | 161 | 161 | ✅ Works |
| NA | 1,763 | 1,763 | ✅ Works |
| 3 | 4,401 | 4,401 | ✅ Works |
| 2 | 3,476 | 3,476 | ✅ Works |
| 4 | 1,471 | 1,471 | ✅ Works |
| 1B | 1,088 | 1,088 | ✅ Works |
| 1 | 705 | 705 | ✅ Works |
| 1A | 234 | 234 | ✅ Works |
| 0 | 221 | 221 | ✅ Works |
| (13 others) | * | * | ✅ Works |
Result: 17 out of 18 processing levels work perfectly. Only "Not provided" (lowercase 'p') is broken.
Reproduction Steps
Prerequisites
- Python 3.8+
Reproduce the Bug
Download and run the comprehensive analysis script:
#Download the attached script
# Install dependencies
pip install httpx
# Run analysis (takes ~10-15 minutes to scan all 53,852 collections)
python generate_processing_level_report.pyThe json consists of two fields exhaustive_scan and search_filter_results. exhaustive_scan consists of the actual number of collections per processing level in CMR while search_filter_results consists of number of collections returned from searching with filter on that processing level.
Root Cause Analysis
What Works ✅
- All 17 other processing level values index and search correctly
- "Not Provided" (capital 'P') works perfectly (returns all 161 collections)
- Both
processing_levelandprocessing_level_idparameters behave identically
What's Broken ❌
- Collections with
ProcessingLevel.Id = "Not provided"(lowercase 'p') are NOT indexed - These 40,121 collections are completely invisible to processing level searches
- Searching for any case variant only returns the 161 "Not Provided" (capital P) collections
Attachments
Analysis Files
-
generate_processing_level_report.py: Complete analysis script that:- Performs exhaustive scan of all 53,852 collections
- Tests search filters for all 18 unique processing levels
- Generates comprehensive JSON report
- Identifies discrepancies and problematic levels
-
cmr_processing_level_search_index_report.json: Full analysis results including:- Exhaustive scan results (ground truth)
- Search filter test results for each processing level
Conclusion
This is a critical search indexing bug affecting 74% of CMR collections. The bug prevents users from discovering 40,121 collections when filtering by processing level, severely impacting data discovery for Earth science research.
The bug is:
- 100% reproducible with provided analysis script
- Well-isolated to lowercase "Not provided" only
Report Generated: December 16, 2025
Analysis Scope: All 53,852 CMR collections
Test Coverage: All 18 unique processing level values
Reproducibility: 100% (verified with comprehensive automated testing)