graph LR
BigWig_Data_Access_Layer["BigWig Data Access Layer"]
BigWig_Collection_Manager["BigWig Collection Manager"]
Genomic_Data_Sampler["Genomic Data Sampler"]
Core_Data_Processing_Engine["Core Data Processing Engine"]
Data_Loading_Pipeline_Orchestrator["Data Loading Pipeline Orchestrator"]
Dataset_API_Framework_Integration["Dataset API & Framework Integration"]
BigWig_Data_Access_Layer -- "used by" --> BigWig_Collection_Manager
BigWig_Data_Access_Layer -- "used by" --> Data_Loading_Pipeline_Orchestrator
BigWig_Collection_Manager -- "uses" --> BigWig_Data_Access_Layer
BigWig_Collection_Manager -- "configures" --> Data_Loading_Pipeline_Orchestrator
Genomic_Data_Sampler -- "feeds queries to" --> Data_Loading_Pipeline_Orchestrator
Core_Data_Processing_Engine -- "used by" --> Data_Loading_Pipeline_Orchestrator
Data_Loading_Pipeline_Orchestrator -- "orchestrates" --> BigWig_Data_Access_Layer
Data_Loading_Pipeline_Orchestrator -- "orchestrates" --> Core_Data_Processing_Engine
Data_Loading_Pipeline_Orchestrator -- "receives queries from" --> Genomic_Data_Sampler
Data_Loading_Pipeline_Orchestrator -- "provides batches to" --> Dataset_API_Framework_Integration
Dataset_API_Framework_Integration -- "consumes batches from" --> Data_Loading_Pipeline_Orchestrator
Dataset_API_Framework_Integration -- "initializes" --> BigWig_Collection_Manager
click Genomic_Data_Sampler href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/bigwig-loader/Genomic_Data_Sampler.md" "Details"
click Core_Data_Processing_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/bigwig-loader/Core_Data_Processing_Engine.md" "Details"
click Data_Loading_Pipeline_Orchestrator href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/bigwig-loader/Data_Loading_Pipeline_Orchestrator.md" "Details"
click Dataset_API_Framework_Integration href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/bigwig-loader/Dataset_API_Framework_Integration.md" "Details"
The bigwig-loader library is architected as a high-performance data loading pipeline for genomic deep learning, emphasizing GPU acceleration and modularity. The core design revolves around efficient data access, flexible sampling, and a streamlined processing pipeline that integrates seamlessly with deep learning frameworks.
This component is responsible for the low-level parsing of individual BigWig file headers, managing internal file structures (e.g., BBI, Zoom, R-Tree indices), and providing efficient, direct access to raw compressed data blocks. It encapsulates the logic for reading and interpreting data from a single BigWig file.
Related Classes/Methods:
Manages a collection of multiple BigWig files. It interprets file paths, maps them to specific values, and establishes a unified global coordinate system across all managed files. This component orchestrates the initialization and management of multiple BigWig Data Access Layer instances.
Related Classes/Methods:
Genomic Data Sampler [Expand]
Generates random genomic positions or intervals (e.g., chromosomes, start, end), selects specific BigWig files (tracks) from the managed collection, and samples genomic sequences. It defines the regions of interest for data extraction, forming the basis for subsequent query batches.
Related Classes/Methods:
bigwig_loader.sampler.position_sampler(1:1)bigwig_loader.sampler.track_sampler(1:1)bigwig_loader.sampler.genome_sampler(1:1)
Core Data Processing Engine [Expand]
Contains performance-critical, low-level functions for data manipulation and optimization. This includes efficient decompression of raw data blocks, memory management (especially for GPU-accelerated operations), optimized search algorithms (e.g., searchsorted for interval queries), transformation of interval-based genomic data into dense value arrays, and GPU-accelerated operations leveraging CuPy.
Related Classes/Methods:
bigwig_loader.decompressor(1:1)bigwig_loader.memory_bank(1:1)bigwig_loader.functional(1:1)bigwig_loader.searchsorted(43:83)bigwig_loader.intervals_to_values(22:150)bigwig_loader.cupy_functions(1:1)
Data Loading Pipeline Orchestrator [Expand]
The core orchestration engine for multi-process/multi-threaded data loading, processing, and batching. It takes sampled queries, retrieves raw data via the BigWig Data Access Layer (coordinated by BigWig Collection Manager), processes it using Core Data Processing Engine, and assembles standardized data batches. It manages worker contexts and streams processed batches for consumption.
Related Classes/Methods:
bigwig_loader.input_generator(1:1)bigwig_loader.batch(1:1)bigwig_loader.batch_processor(1:1)bigwig_loader.streamed_dataset(1:1)
Dataset API & Framework Integration [Expand]
Provides a high-level, framework-agnostic dataset interface for consuming genomic data. It integrates the Genomic Data Sampler, BigWig Collection Manager, and Data Loading Pipeline Orchestrator to present a unified view of the genomic data. This component also includes specific adapters for seamless integration with deep learning frameworks like PyTorch, converting data to appropriate tensor formats.
Related Classes/Methods: