Problem
The process_record function currently tightly couples content extraction and aggregation logic. This makes it difficult to:
- Reuse the extraction logic across different parts of the codebase.
- Isolate and test the extraction logic effectively.
Proposed Improvement
Introduce a separate step for content extraction. This abstraction will:
- Encourage Reusability: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.
- Enhance Testability: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.
Implementation Suggestions
- Extract the content extraction logic into a dedicated function.
- Extract the content aggregation logic into a dedicated function.
- Modify
process_record to delegate to the new abstractions
This can be implemented at two potential levels:
-
At the CCSparkJob Level:
- Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.
-
At Specific Examples:
- Implement the abstraction in specific examples like
ExtractLinksJob to showcase the idea as a suggestion.
- Provides flexibility for contributors to adopt or adapt the approach as needed.
Problem
The
process_recordfunction currently tightly couples content extraction and aggregation logic. This makes it difficult to:Proposed Improvement
Introduce a separate step for content extraction. This abstraction will:
Implementation Suggestions
process_recordto delegate to the new abstractionsThis can be implemented at two potential levels:
At the
CCSparkJobLevel:At Specific Examples:
ExtractLinksJobto showcase the idea as a suggestion.