diff --git a/data/genomics/homo_sapiens/README.md b/data/genomics/homo_sapiens/README.md index 6389b9e0c..8d7d1e686 100644 --- a/data/genomics/homo_sapiens/README.md +++ b/data/genomics/homo_sapiens/README.md @@ -579,6 +579,75 @@ This dataset contains: This folder contains `AnnotFilterRule.pm` which comes from [The Broad](https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/AnnotFilterRule.pm) and is used for filtering in `starfusion`. +### Gens test data + +These files are used to test the Gens input preprocessing module. + +#### Binned coverage + +This file was obtained from the following WGS processing pipeline: [SMD WGS pipeline](https://github.com/SMD-Bioinformatics-Lund/nextflow_wgs). + +`data/genomics/homo_sapiens/illumina/gatk/hg002_chr20_90000_to_100000.standardizedCR.tsv` + +The relevant pipeline commands. Inputs is an aligned BAM-file, and an interval file specifying 100bp bins. It also requires a GATK format panel of normal. + +``` +gatk CollectReadCounts \\ + -I $bam -L $params.COV_INTERVAL_LIST \\ + --interval-merging-rule OVERLAPPING_ONLY -O ${bam}.hdf5 + +gatk --java-options "-Xmx30g" DenoiseReadCounts \\ + -I ${bam}.hdf5 --count-panel-of-normals ${PON[sex]} \\ + --standardized-copy-ratios ${id}.standardizedCR.tsv \\ + --denoised-copy-ratios ${id}.denoisedCR.tsv +``` + +This output is then processed to retrieve only chromsome 20 entries in the positions 90,000 - 100,000. + +``` +grep -E "^@|^20" hg002.standardizedCR.tsv | awk '$2 >= 90000 && $2 <= 100000' > hg002_chr20_90000_to_100000.standardizedCR.tsv +``` + +#### SNV calls (gGVCF) + +This file was obtained from the following WGS processing pipeline: [SMD WGS pipeline](https://github.com/SMD-Bioinformatics-Lund/nextflow_wgs). + +`data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz` + +Output from running DNA-scope on the aligned BAM file and then exacting the range 90000 to 100000 from chromosome 20. + +DNAscope is a reimplementation and slight improvement on GATK's HaplotypeCaller. + +Using masked hg38 as reference. Using base quality calibrated inputs. + +``` +sentieon driver \\ + -r ${params.genome_file} \\ + -q $bqsr \\ + -i $bam \\ + --algo DNAscope --emit_mode GVCF ${id}.dnascope.gvcf.gz +``` + +``` +zcat hg002.dnascope.gvcf.gz | grep -E "^#|^20" | awk '/^#/ || ($2 >= 90000 && $2 <= 100000>)' > hg002_chr20_90000_to_100000.dnascope.gvcf +bgzip hg002_chr20_90000_to_100000.dnascope.gvcf +tabix hg002_chr20_90000_to_100000.dnascope.gvcf.gz +``` + +#### B-allele frequency sampling locations + +* data/genomics/homo_sapiens/illumina/tab/gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz + +Subset of the file https://github.com/SMD-Bioinformatics-Lund/gens/releases/download/v4.3.0/gnomad_hg38.0.05.txt.gz + +It is based on Gnomad (v2), where locations with an ALT allele frequency >= 0.05 is extracted. This file only contains the locations on these calls (i.e. col 1: chrom, col 2: position). + +Then the target range is extracted: + +``` +zcat gnomad_hg38.0.05.txt.gz | awk '$1 == 20 && ($2 >= 90000 && $2 <= 100000)' | gzip > gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz +``` + ### Missing files 1. Single-end reads diff --git a/data/genomics/homo_sapiens/illumina/gatk/hg002_chr20_90000_to_100000.standardizedCR.tsv b/data/genomics/homo_sapiens/illumina/gatk/hg002_chr20_90000_to_100000.standardizedCR.tsv new file mode 100644 index 000000000..fe9409feb --- /dev/null +++ b/data/genomics/homo_sapiens/illumina/gatk/hg002_chr20_90000_to_100000.standardizedCR.tsv @@ -0,0 +1,297 @@ +@HD VN:1.6 +@SQ SN:1 LN:248956422 +@SQ SN:2 LN:242193529 +@SQ SN:3 LN:198295559 +@SQ SN:4 LN:190214555 +@SQ SN:5 LN:181538259 +@SQ SN:6 LN:170805979 +@SQ SN:7 LN:159345973 +@SQ SN:8 LN:145138636 +@SQ SN:9 LN:138394717 +@SQ SN:10 LN:133797422 +@SQ SN:11 LN:135086622 +@SQ SN:12 LN:133275309 +@SQ SN:13 LN:114364328 +@SQ SN:14 LN:107043718 +@SQ SN:15 LN:101991189 +@SQ SN:16 LN:90338345 +@SQ SN:17 LN:83257441 +@SQ SN:18 LN:80373285 +@SQ SN:19 LN:58617616 +@SQ SN:20 LN:64444167 +@SQ SN:21 LN:46709983 +@SQ SN:22 LN:50818468 +@SQ SN:X LN:156040895 +@SQ SN:Y LN:57227415 +@SQ SN:M LN:16569 +@SQ SN:1_KI270706v1_random LN:175055 +@SQ SN:1_KI270707v1_random LN:32032 +@SQ SN:1_KI270708v1_random LN:127682 +@SQ SN:1_KI270709v1_random LN:66860 +@SQ SN:1_KI270710v1_random LN:40176 +@SQ SN:1_KI270711v1_random LN:42210 +@SQ SN:1_KI270712v1_random LN:176043 +@SQ SN:1_KI270713v1_random LN:40745 +@SQ SN:1_KI270714v1_random LN:41717 +@SQ SN:2_KI270715v1_random LN:161471 +@SQ SN:2_KI270716v1_random LN:153799 +@SQ SN:3_GL000221v1_random LN:155397 +@SQ SN:4_GL000008v2_random LN:209709 +@SQ SN:5_GL000208v1_random LN:92689 +@SQ SN:9_KI270717v1_random LN:40062 +@SQ SN:9_KI270718v1_random LN:38054 +@SQ SN:9_KI270719v1_random LN:176845 +@SQ SN:9_KI270720v1_random LN:39050 +@SQ SN:11_KI270721v1_random LN:100316 +@SQ SN:14_GL000009v2_random LN:201709 +@SQ SN:14_GL000225v1_random LN:211173 +@SQ SN:14_KI270722v1_random LN:194050 +@SQ SN:14_GL000194v1_random LN:191469 +@SQ SN:14_KI270723v1_random LN:38115 +@SQ SN:14_KI270724v1_random LN:39555 +@SQ SN:14_KI270725v1_random LN:172810 +@SQ SN:14_KI270726v1_random LN:43739 +@SQ SN:15_KI270727v1_random LN:448248 +@SQ SN:16_KI270728v1_random LN:1872759 +@SQ SN:17_GL000205v2_random LN:185591 +@SQ SN:17_KI270729v1_random LN:280839 +@SQ SN:17_KI270730v1_random LN:112551 +@SQ SN:22_KI270731v1_random LN:150754 +@SQ SN:22_KI270732v1_random LN:41543 +@SQ SN:22_KI270733v1_random LN:179772 +@SQ SN:22_KI270734v1_random LN:165050 +@SQ SN:22_KI270735v1_random LN:42811 +@SQ SN:22_KI270736v1_random LN:181920 +@SQ SN:22_KI270737v1_random LN:103838 +@SQ SN:22_KI270738v1_random LN:99375 +@SQ SN:22_KI270739v1_random LN:73985 +@SQ SN:Y_KI270740v1_random LN:37240 +@SQ SN:Un_KI270302v1 LN:2274 +@SQ SN:Un_KI270304v1 LN:2165 +@SQ SN:Un_KI270303v1 LN:1942 +@SQ SN:Un_KI270305v1 LN:1472 +@SQ SN:Un_KI270322v1 LN:21476 +@SQ SN:Un_KI270320v1 LN:4416 +@SQ SN:Un_KI270310v1 LN:1201 +@SQ SN:Un_KI270316v1 LN:1444 +@SQ SN:Un_KI270315v1 LN:2276 +@SQ SN:Un_KI270312v1 LN:998 +@SQ SN:Un_KI270311v1 LN:12399 +@SQ SN:Un_KI270317v1 LN:37690 +@SQ SN:Un_KI270412v1 LN:1179 +@SQ SN:Un_KI270411v1 LN:2646 +@SQ SN:Un_KI270414v1 LN:2489 +@SQ SN:Un_KI270419v1 LN:1029 +@SQ SN:Un_KI270418v1 LN:2145 +@SQ SN:Un_KI270420v1 LN:2321 +@SQ SN:Un_KI270424v1 LN:2140 +@SQ SN:Un_KI270417v1 LN:2043 +@SQ SN:Un_KI270422v1 LN:1445 +@SQ SN:Un_KI270423v1 LN:981 +@SQ SN:Un_KI270425v1 LN:1884 +@SQ SN:Un_KI270429v1 LN:1361 +@SQ SN:Un_KI270442v1 LN:392061 +@SQ SN:Un_KI270466v1 LN:1233 +@SQ SN:Un_KI270465v1 LN:1774 +@SQ SN:Un_KI270467v1 LN:3920 +@SQ SN:Un_KI270435v1 LN:92983 +@SQ SN:Un_KI270438v1 LN:112505 +@SQ SN:Un_KI270468v1 LN:4055 +@SQ SN:Un_KI270510v1 LN:2415 +@SQ SN:Un_KI270509v1 LN:2318 +@SQ SN:Un_KI270518v1 LN:2186 +@SQ SN:Un_KI270508v1 LN:1951 +@SQ SN:Un_KI270516v1 LN:1300 +@SQ SN:Un_KI270512v1 LN:22689 +@SQ SN:Un_KI270519v1 LN:138126 +@SQ SN:Un_KI270522v1 LN:5674 +@SQ SN:Un_KI270511v1 LN:8127 +@SQ SN:Un_KI270515v1 LN:6361 +@SQ SN:Un_KI270507v1 LN:5353 +@SQ SN:Un_KI270517v1 LN:3253 +@SQ SN:Un_KI270529v1 LN:1899 +@SQ SN:Un_KI270528v1 LN:2983 +@SQ SN:Un_KI270530v1 LN:2168 +@SQ SN:Un_KI270539v1 LN:993 +@SQ SN:Un_KI270538v1 LN:91309 +@SQ SN:Un_KI270544v1 LN:1202 +@SQ SN:Un_KI270548v1 LN:1599 +@SQ SN:Un_KI270583v1 LN:1400 +@SQ SN:Un_KI270587v1 LN:2969 +@SQ SN:Un_KI270580v1 LN:1553 +@SQ SN:Un_KI270581v1 LN:7046 +@SQ SN:Un_KI270579v1 LN:31033 +@SQ SN:Un_KI270589v1 LN:44474 +@SQ SN:Un_KI270590v1 LN:4685 +@SQ SN:Un_KI270584v1 LN:4513 +@SQ SN:Un_KI270582v1 LN:6504 +@SQ SN:Un_KI270588v1 LN:6158 +@SQ SN:Un_KI270593v1 LN:3041 +@SQ SN:Un_KI270591v1 LN:5796 +@SQ SN:Un_KI270330v1 LN:1652 +@SQ SN:Un_KI270329v1 LN:1040 +@SQ SN:Un_KI270334v1 LN:1368 +@SQ SN:Un_KI270333v1 LN:2699 +@SQ SN:Un_KI270335v1 LN:1048 +@SQ SN:Un_KI270338v1 LN:1428 +@SQ SN:Un_KI270340v1 LN:1428 +@SQ SN:Un_KI270336v1 LN:1026 +@SQ SN:Un_KI270337v1 LN:1121 +@SQ SN:Un_KI270363v1 LN:1803 +@SQ SN:Un_KI270364v1 LN:2855 +@SQ SN:Un_KI270362v1 LN:3530 +@SQ SN:Un_KI270366v1 LN:8320 +@SQ SN:Un_KI270378v1 LN:1048 +@SQ SN:Un_KI270379v1 LN:1045 +@SQ SN:Un_KI270389v1 LN:1298 +@SQ SN:Un_KI270390v1 LN:2387 +@SQ SN:Un_KI270387v1 LN:1537 +@SQ SN:Un_KI270395v1 LN:1143 +@SQ SN:Un_KI270396v1 LN:1880 +@SQ SN:Un_KI270388v1 LN:1216 +@SQ SN:Un_KI270394v1 LN:970 +@SQ SN:Un_KI270386v1 LN:1788 +@SQ SN:Un_KI270391v1 LN:1484 +@SQ SN:Un_KI270383v1 LN:1750 +@SQ SN:Un_KI270393v1 LN:1308 +@SQ SN:Un_KI270384v1 LN:1658 +@SQ SN:Un_KI270392v1 LN:971 +@SQ SN:Un_KI270381v1 LN:1930 +@SQ SN:Un_KI270385v1 LN:990 +@SQ SN:Un_KI270382v1 LN:4215 +@SQ SN:Un_KI270376v1 LN:1136 +@SQ SN:Un_KI270374v1 LN:2656 +@SQ SN:Un_KI270372v1 LN:1650 +@SQ SN:Un_KI270373v1 LN:1451 +@SQ SN:Un_KI270375v1 LN:2378 +@SQ SN:Un_KI270371v1 LN:2805 +@SQ SN:Un_KI270448v1 LN:7992 +@SQ SN:Un_KI270521v1 LN:7642 +@SQ SN:Un_GL000195v1 LN:182896 +@SQ SN:Un_GL000219v1 LN:179198 +@SQ SN:Un_GL000220v1 LN:161802 +@SQ SN:Un_GL000224v1 LN:179693 +@SQ SN:Un_KI270741v1 LN:157432 +@SQ SN:Un_GL000226v1 LN:15008 +@SQ SN:Un_GL000213v1 LN:164239 +@SQ SN:Un_KI270743v1 LN:210658 +@SQ SN:Un_KI270744v1 LN:168472 +@SQ SN:Un_KI270745v1 LN:41891 +@SQ SN:Un_KI270746v1 LN:66486 +@SQ SN:Un_KI270747v1 LN:198735 +@SQ SN:Un_KI270748v1 LN:93321 +@SQ SN:Un_KI270749v1 LN:158759 +@SQ SN:Un_KI270750v1 LN:148850 +@SQ SN:Un_KI270751v1 LN:150742 +@SQ SN:Un_KI270752v1 LN:27745 +@SQ SN:Un_KI270753v1 LN:62944 +@SQ SN:Un_KI270754v1 LN:40191 +@SQ SN:Un_KI270755v1 LN:36723 +@SQ SN:Un_KI270756v1 LN:79590 +@SQ SN:Un_KI270757v1 LN:71251 +@SQ SN:Un_GL000214v1 LN:137718 +@SQ SN:Un_KI270742v1 LN:186739 +@SQ SN:Un_GL000216v2 LN:176608 +@SQ SN:Un_GL000218v1 LN:161147 +@SQ SN:EBV LN:171823 +@RG ID:GATKCopyNumber SM:hg002 +20 90001 90100 -0.082386 +20 90101 90200 -0.042729 +20 90201 90300 0.246508 +20 90301 90400 -0.133577 +20 90401 90500 -0.248078 +20 90501 90600 -0.377049 +20 90601 90700 -0.030020 +20 90701 90800 -0.205219 +20 90801 90900 0.096751 +20 90901 91000 0.276859 +20 91001 91100 -0.076416 +20 91101 91200 0.435519 +20 91201 91300 -0.260407 +20 91301 91400 0.314984 +20 91401 91500 0.036276 +20 91501 91600 -0.190666 +20 91601 91700 0.278775 +20 91701 91800 0.115213 +20 91801 91900 -0.097977 +20 91901 92000 0.419661 +20 92001 92100 0.391919 +20 92101 92200 0.114558 +20 92201 92300 0.309484 +20 92301 92400 -0.121151 +20 92401 92500 -0.019289 +20 92501 92600 -0.590304 +20 92601 92700 -0.222244 +20 92701 92800 -0.351925 +20 92801 92900 0.225040 +20 92901 93000 0.104344 +20 93001 93100 -0.158730 +20 93101 93200 0.559246 +20 93201 93300 0.407682 +20 93301 93400 -0.083459 +20 93401 93500 0.344357 +20 93501 93600 -0.230986 +20 93601 93700 -0.033631 +20 93701 93800 -0.243347 +20 93801 93900 0.339796 +20 93901 94000 0.219763 +20 94001 94100 0.163647 +20 94101 94200 0.284660 +20 94201 94300 0.077820 +20 94301 94400 -0.108470 +20 94401 94500 0.040518 +20 94501 94600 0.137304 +20 94601 94700 -0.420225 +20 94701 94800 -0.237642 +20 94801 94900 -0.337373 +20 94901 95000 0.484378 +20 95001 95100 -0.351364 +20 95101 95200 -0.540956 +20 95201 95300 0.411661 +20 95301 95400 0.145571 +20 95401 95500 0.400021 +20 95501 95600 -0.149275 +20 95601 95700 -0.042284 +20 95701 95800 0.126260 +20 95801 95900 -0.123011 +20 95901 96000 0.260356 +20 96001 96100 -0.092015 +20 96101 96200 0.316630 +20 96201 96300 -0.262502 +20 96301 96400 -0.098632 +20 96401 96500 -0.436316 +20 96501 96600 0.180802 +20 96601 96700 -0.314259 +20 96701 96800 -0.056099 +20 96801 96900 0.222985 +20 96901 97000 0.256015 +20 97001 97100 0.234977 +20 97101 97200 0.015928 +20 97201 97300 0.542067 +20 97301 97400 0.052211 +20 97401 97500 0.347509 +20 97501 97600 0.139405 +20 97601 97700 -0.402959 +20 97701 97800 0.025147 +20 97801 97900 -0.340951 +20 97901 98000 -0.085121 +20 98001 98100 -0.314344 +20 98101 98200 -0.281636 +20 98201 98300 0.147276 +20 98301 98400 0.229243 +20 98401 98500 -0.196340 +20 98501 98600 0.139677 +20 98601 98700 0.164471 +20 98701 98800 -0.072459 +20 98801 98900 0.230651 +20 98901 99000 -0.079648 +20 99001 99100 0.026914 +20 99101 99200 -0.662400 +20 99201 99300 0.485609 +20 99301 99400 -0.168460 +20 99401 99500 -0.186852 +20 99501 99600 0.164201 +20 99601 99700 -0.247786 +20 99701 99800 0.229623 +20 99801 99900 0.395136 +20 99901 100000 0.323178 diff --git a/data/genomics/homo_sapiens/illumina/tab/gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz b/data/genomics/homo_sapiens/illumina/tab/gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz new file mode 100644 index 000000000..86b689634 Binary files /dev/null and b/data/genomics/homo_sapiens/illumina/tab/gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz differ diff --git a/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz b/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz new file mode 100644 index 000000000..8642e6c86 Binary files /dev/null and b/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz differ diff --git a/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz.tbi b/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz.tbi new file mode 100644 index 000000000..5e157dcd2 Binary files /dev/null and b/data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz.tbi differ