Comprehensive bioinformatics toolkit for multi-omic analysis. Domain-driven, modular architecture designed for performance and scientific rigor.
- Domain-Driven Design (DDD): Logic partitioned into biological domains (DNA, RNA, Protein, etc.)
- No-Mock Testing: All tests use real implementations; mocks are strictly prohibited
- UV Package Management: Dependencies managed exclusively via
uv(never pip) - AI-Native Documentation: High-fidelity function indices for AI agent navigation
metainformant/
├── src/metainformant/ # Source code (25 domain modules + core)
│ ├── core/ # Shared infrastructure (I/O, config, logging)
│ ├── dna/ # Genomic analysis, alignment, population genetics
│ ├── rna/ # Transcriptomic workflows, Amalgkit integration
│ ├── protein/ # Proteomic analysis, structure modeling
│ ├── gwas/ # Genome-wide association studies
│ ├── epigenome/ # Methylation, ChIP-seq, ATAC-seq
│ ├── networks/ # Biological networks, community detection
│ ├── multiomics/ # Multi-omic data integration
│ ├── singlecell/ # Single-cell RNA-seq analysis
│ ├── visualization/ # 70+ plot types, publication-quality output
│ ├── quality/ # QC metrics, contamination detection
│ ├── ml/ # Machine learning pipelines
│ ├── math/ # Population genetics theory, coalescent
│ ├── information/ # Information theory (entropy, MI)
│ ├── ontology/ # GO analysis, semantic similarity
│ ├── phenotype/ # Trait analysis, curation
│ ├── ecology/ # Community diversity
│ ├── simulation/ # Synthetic data generation
│ ├── life_events/ # Event sequence analysis
│ ├── menu/ # Interactive menu and discovery system
│ ├── longread/ # PacBio/Nanopore long-read analysis
│ ├── metagenomics/ # Microbiome and metagenomic analysis
│ ├── spatial/ # Spatial transcriptomics
│ ├── structural_variants/ # CNV/SV detection and annotation
│ ├── pharmacogenomics/ # Clinical variant analysis
│ └── metabolomics/ # Metabolite identification, MS data, pathway mapping
├── scripts/ # Thin wrapper orchestrators
├── tests/ # Pytest test suite (real implementations only)
├── docs/ # Documentation by domain
├── config/ # YAML configuration templates
├── data/ # Input data (read-mostly)
└── output/ # Program-generated results (ephemeral)
Prioritize local data sources before remote acquisition (NCBI, SRA).
Scripts in scripts/ are thin wrappers around core methods. Business logic resides in src/.
YAML configs in config/ can be overridden via environment variables with domain prefixes (AK_, GWAS_, DNA_, etc.).
All program-generated results go to output/. Never create documentation or reports in output/.
- No Mocking: Tests use real implementations with actual file I/O and API calls
- Graceful Skips: When external dependencies unavailable, skip with clear messages
- Markers:
@pytest.mark.network,@pytest.mark.external_tool,@pytest.mark.slow
- Python 3.11+ minimum
- Black formatting (120 char lines)
- mypy type checking (strict)
- All functions must have type hints
- Module README.md: User-facing documentation and examples
- docs//: Extended guides and tutorials
Raw Data (FASTQ/VCF) → Preprocessing & QC → Domain Analysis → Multi-Omic Integration → Visualization
↑ ↑ ↑ ↑
data/ directory quality/ module domain modules visualization/
Modules communicate via:
- Standard Python protocols
- Shared data structures (
pandasDataFrames,numpyarrays) - Core infrastructure (
metainformant.core) for I/O, config, and logging