Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing  

#### **Problem**  
The `process_record` function currently tightly couples content extraction and aggregation logic. This makes it difficult to:  
1. Reuse the extraction logic across different parts of the codebase.  
2. Isolate and test the extraction logic effectively.  

#### **Proposed Improvement**  
Introduce a separate step for content extraction. This abstraction will:  
- **Encourage Reusability**: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.  
- **Enhance Testability**: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.  



#### **Implementation Suggestions**  
1. Extract the content **extraction** logic into a dedicated function.  
2. Extract the content **aggregation** logic into a dedicated function.  
3. Modify `process_record` to delegate to the new abstractions

This can be implemented at two potential levels:  
1. **At the `CCSparkJob` Level**:  
   - Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.  

2. **At Specific Examples**:  
   - Implement the abstraction in specific examples like `ExtractLinksJob` to showcase the idea as a suggestion.  
   - Provides flexibility for contributors to adopt or adapt the approach as needed.  





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

Problem

Proposed Improvement

Implementation Suggestions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Abstract process_record to Separate Content Extraction Step for Reusability and Testing #48

Description

Problem

Proposed Improvement

Implementation Suggestions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48