Skip to content

Bug Report: Processing Level Search Missing 40,121 Collections with "Not provided" #2358

@iamsims

Description

@iamsims

cmr_processing_level_report.json
generate_processing_level_report.py

Summary

The CMR search API's processing_level and processing_level_id parameters completely fail to return collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') in UMM-C format. 40,121 collections (74.5% of all CMR collections) are invisible to processing level searches.

Environment

  • CMR Base URL: https://cmr.earthdata.nasa.gov/search/
  • API Endpoint: collections.umm_json
  • Date of Analysis: December 16, 2024
  • Total Collections in CMR: 53,852

Bug Description

Expected Behavior

Collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') should be searchable using the processing_level_id parameter.

From exhaustive scan of all collections:

  • 40,121 collections have ProcessingLevel.Id = "Not provided" (lowercase 'p')
  • 161 collections have ProcessingLevel.Id = "Not Provided" (capital 'P')
  • These are distinct, non-overlapping sets (verified)

When searching with processing_level_id="Not provided", the API should return the 40,121 + 161 collections (given it is case insensitive)

Actual Behavior

Searching for any case variant ('not provided', 'Not provided', 'NOT PROVIDED') all return only 161 collections - these are the collections with ProcessingLevel.Id = "Not Provided" (capital 'P').

The 40,121 collections with lowercase "Not provided" are completely missing from the search results.

Analysis Result (from cmr_processing_level_search_index_report.json):

{
  "search_filter_results": {
    "Not provided": {
      "processing_level_id": 161,
      "processing_level": 161,
      "expected": 40121,
    },
    "Not Provided": {
      "processing_level_id": 161,
      "processing_level": 161,
      "expected": 161,
    }
  }
}

Impact

  • Severity: CRITICAL
  • Collections Affected: 40,121 collections (74.5% of all CMR collections)
  • User Impact: Users searching for data by processing level miss 99.6% of collections with "Not provided" values. This prevents implementing a fail-closed approach for processing level filtering. For example, when searching for processing_level=1, users cannot reliably include collections with unspecified/unknown processing levels by adding OR processing_level="Not provided" to their query, because 99.6% of those collections are missing from the search index.
  • Scope: Only affects "Not provided" (lowercase 'p'); all other 17 processing levels work correctly

Detailed Analysis

Exhaustive Scan Results (Ground Truth)

From comprehensive analysis of all 53,852 collections:

{
  "exhaustive_scan": {
    "total_collections": 53852,
    "collections_with_levels": 53852,
    "unique_levels": 18,
    "levels": {
      "Not provided": 40121,
      "Not Provided": 161,
      "NA": 1763,
      "3": 4401,
      "2": 3476,
      "4": 1471,
      "1B": 1088,
      "1": 705,
      "1A": 234,
      "0": 221,
      "2G": 76,
      "2P": 53,
      "2B": 34,
      "1C": 20,
      "2A": 14,
      "1T": 11,
      "L2": 2,
      "Level 3": 1
    }
  }
}

Search Filter Test Results

Testing all 18 unique processing level values:

Processing Level Expected Search Returns Status
Not provided 40,121 161 ❌ BROKEN
Not Provided 161 161 ✅ Works
NA 1,763 1,763 ✅ Works
3 4,401 4,401 ✅ Works
2 3,476 3,476 ✅ Works
4 1,471 1,471 ✅ Works
1B 1,088 1,088 ✅ Works
1 705 705 ✅ Works
1A 234 234 ✅ Works
0 221 221 ✅ Works
(13 others) * * ✅ Works

Result: 17 out of 18 processing levels work perfectly. Only "Not provided" (lowercase 'p') is broken.

Reproduction Steps

Prerequisites

  • Python 3.8+

Reproduce the Bug

Download and run the comprehensive analysis script:

#Download the attached script
# Install dependencies
pip install httpx

# Run analysis (takes ~10-15 minutes to scan all 53,852 collections)
python generate_processing_level_report.py

The json consists of two fields exhaustive_scan and search_filter_results. exhaustive_scan consists of the actual number of collections per processing level in CMR while search_filter_results consists of number of collections returned from searching with filter on that processing level.

Root Cause Analysis

What Works ✅

  • All 17 other processing level values index and search correctly
  • "Not Provided" (capital 'P') works perfectly (returns all 161 collections)
  • Both processing_level and processing_level_id parameters behave identically

What's Broken ❌

  • Collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') are NOT indexed
  • These 40,121 collections are completely invisible to processing level searches
  • Searching for any case variant only returns the 161 "Not Provided" (capital P) collections

Attachments

Analysis Files

  1. generate_processing_level_report.py: Complete analysis script that:

    • Performs exhaustive scan of all 53,852 collections
    • Tests search filters for all 18 unique processing levels
    • Generates comprehensive JSON report
    • Identifies discrepancies and problematic levels
  2. cmr_processing_level_search_index_report.json: Full analysis results including:

    • Exhaustive scan results (ground truth)
    • Search filter test results for each processing level

Conclusion

This is a critical search indexing bug affecting 74% of CMR collections. The bug prevents users from discovering 40,121 collections when filtering by processing level, severely impacting data discovery for Earth science research.

The bug is:

  • 100% reproducible with provided analysis script
  • Well-isolated to lowercase "Not provided" only

Report Generated: December 16, 2025
Analysis Scope: All 53,852 CMR collections
Test Coverage: All 18 unique processing level values
Reproducibility: 100% (verified with comprehensive automated testing)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions