Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions data/genomics/homo_sapiens/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -579,6 +579,75 @@ This dataset contains:

This folder contains `AnnotFilterRule.pm` which comes from [The Broad](https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/AnnotFilterRule.pm) and is used for filtering in `starfusion`.

### Gens test data

These files are used to test the Gens input preprocessing module.

#### Binned coverage

This file was obtained from the following WGS processing pipeline: [SMD WGS pipeline](https://github.com/SMD-Bioinformatics-Lund/nextflow_wgs).

`data/genomics/homo_sapiens/illumina/gatk/hg002_chr20_90000_to_100000.standardizedCR.tsv`

The relevant pipeline commands. Inputs is an aligned BAM-file, and an interval file specifying 100bp bins. It also requires a GATK format panel of normal.

```
gatk CollectReadCounts \\
-I $bam -L $params.COV_INTERVAL_LIST \\
--interval-merging-rule OVERLAPPING_ONLY -O ${bam}.hdf5

gatk --java-options "-Xmx30g" DenoiseReadCounts \\
-I ${bam}.hdf5 --count-panel-of-normals ${PON[sex]} \\
--standardized-copy-ratios ${id}.standardizedCR.tsv \\
--denoised-copy-ratios ${id}.denoisedCR.tsv
```

This output is then processed to retrieve only chromsome 20 entries in the positions 90,000 - 100,000.

```
grep -E "^@|^20" hg002.standardizedCR.tsv | awk '$2 >= 90000 && $2 <= 100000' > hg002_chr20_90000_to_100000.standardizedCR.tsv
```

#### SNV calls (gGVCF)

This file was obtained from the following WGS processing pipeline: [SMD WGS pipeline](https://github.com/SMD-Bioinformatics-Lund/nextflow_wgs).

`data/genomics/homo_sapiens/illumina/vcf/hg002_chr20_90000_to_100000.dnascope.gvcf.gz`

Output from running DNA-scope on the aligned BAM file and then exacting the range 90000 to 100000 from chromosome 20.

DNAscope is a reimplementation and slight improvement on GATK's HaplotypeCaller.

Using masked hg38 as reference. Using base quality calibrated inputs.

```
sentieon driver \\
-r ${params.genome_file} \\
-q $bqsr \\
-i $bam \\
--algo DNAscope --emit_mode GVCF ${id}.dnascope.gvcf.gz
```

```
zcat hg002.dnascope.gvcf.gz | grep -E "^#|^20" | awk '/^#/ || ($2 >= 90000 && $2 <= 100000>)' > hg002_chr20_90000_to_100000.dnascope.gvcf
bgzip hg002_chr20_90000_to_100000.dnascope.gvcf
tabix hg002_chr20_90000_to_100000.dnascope.gvcf.gz
```

#### B-allele frequency sampling locations

* data/genomics/homo_sapiens/illumina/tab/gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz

Subset of the file https://github.com/SMD-Bioinformatics-Lund/gens/releases/download/v4.3.0/gnomad_hg38.0.05.txt.gz

It is based on Gnomad (v2), where locations with an ALT allele frequency >= 0.05 is extracted. This file only contains the locations on these calls (i.e. col 1: chrom, col 2: position).

Then the target range is extracted:

```
zcat gnomad_hg38.0.05.txt.gz | awk '$1 == 20 && ($2 >= 90000 && $2 <= 100000)' | gzip > gnomad_hg38_chr20_90000_to_100000.0.05.txt.gz
```

### Missing files

1. Single-end reads
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
@HD VN:1.6
@SQ SN:1 LN:248956422
@SQ SN:2 LN:242193529
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@SQ SN:M LN:16569
@SQ SN:1_KI270706v1_random LN:175055
@SQ SN:1_KI270707v1_random LN:32032
@SQ SN:1_KI270708v1_random LN:127682
@SQ SN:1_KI270709v1_random LN:66860
@SQ SN:1_KI270710v1_random LN:40176
@SQ SN:1_KI270711v1_random LN:42210
@SQ SN:1_KI270712v1_random LN:176043
@SQ SN:1_KI270713v1_random LN:40745
@SQ SN:1_KI270714v1_random LN:41717
@SQ SN:2_KI270715v1_random LN:161471
@SQ SN:2_KI270716v1_random LN:153799
@SQ SN:3_GL000221v1_random LN:155397
@SQ SN:4_GL000008v2_random LN:209709
@SQ SN:5_GL000208v1_random LN:92689
@SQ SN:9_KI270717v1_random LN:40062
@SQ SN:9_KI270718v1_random LN:38054
@SQ SN:9_KI270719v1_random LN:176845
@SQ SN:9_KI270720v1_random LN:39050
@SQ SN:11_KI270721v1_random LN:100316
@SQ SN:14_GL000009v2_random LN:201709
@SQ SN:14_GL000225v1_random LN:211173
@SQ SN:14_KI270722v1_random LN:194050
@SQ SN:14_GL000194v1_random LN:191469
@SQ SN:14_KI270723v1_random LN:38115
@SQ SN:14_KI270724v1_random LN:39555
@SQ SN:14_KI270725v1_random LN:172810
@SQ SN:14_KI270726v1_random LN:43739
@SQ SN:15_KI270727v1_random LN:448248
@SQ SN:16_KI270728v1_random LN:1872759
@SQ SN:17_GL000205v2_random LN:185591
@SQ SN:17_KI270729v1_random LN:280839
@SQ SN:17_KI270730v1_random LN:112551
@SQ SN:22_KI270731v1_random LN:150754
@SQ SN:22_KI270732v1_random LN:41543
@SQ SN:22_KI270733v1_random LN:179772
@SQ SN:22_KI270734v1_random LN:165050
@SQ SN:22_KI270735v1_random LN:42811
@SQ SN:22_KI270736v1_random LN:181920
@SQ SN:22_KI270737v1_random LN:103838
@SQ SN:22_KI270738v1_random LN:99375
@SQ SN:22_KI270739v1_random LN:73985
@SQ SN:Y_KI270740v1_random LN:37240
@SQ SN:Un_KI270302v1 LN:2274
@SQ SN:Un_KI270304v1 LN:2165
@SQ SN:Un_KI270303v1 LN:1942
@SQ SN:Un_KI270305v1 LN:1472
@SQ SN:Un_KI270322v1 LN:21476
@SQ SN:Un_KI270320v1 LN:4416
@SQ SN:Un_KI270310v1 LN:1201
@SQ SN:Un_KI270316v1 LN:1444
@SQ SN:Un_KI270315v1 LN:2276
@SQ SN:Un_KI270312v1 LN:998
@SQ SN:Un_KI270311v1 LN:12399
@SQ SN:Un_KI270317v1 LN:37690
@SQ SN:Un_KI270412v1 LN:1179
@SQ SN:Un_KI270411v1 LN:2646
@SQ SN:Un_KI270414v1 LN:2489
@SQ SN:Un_KI270419v1 LN:1029
@SQ SN:Un_KI270418v1 LN:2145
@SQ SN:Un_KI270420v1 LN:2321
@SQ SN:Un_KI270424v1 LN:2140
@SQ SN:Un_KI270417v1 LN:2043
@SQ SN:Un_KI270422v1 LN:1445
@SQ SN:Un_KI270423v1 LN:981
@SQ SN:Un_KI270425v1 LN:1884
@SQ SN:Un_KI270429v1 LN:1361
@SQ SN:Un_KI270442v1 LN:392061
@SQ SN:Un_KI270466v1 LN:1233
@SQ SN:Un_KI270465v1 LN:1774
@SQ SN:Un_KI270467v1 LN:3920
@SQ SN:Un_KI270435v1 LN:92983
@SQ SN:Un_KI270438v1 LN:112505
@SQ SN:Un_KI270468v1 LN:4055
@SQ SN:Un_KI270510v1 LN:2415
@SQ SN:Un_KI270509v1 LN:2318
@SQ SN:Un_KI270518v1 LN:2186
@SQ SN:Un_KI270508v1 LN:1951
@SQ SN:Un_KI270516v1 LN:1300
@SQ SN:Un_KI270512v1 LN:22689
@SQ SN:Un_KI270519v1 LN:138126
@SQ SN:Un_KI270522v1 LN:5674
@SQ SN:Un_KI270511v1 LN:8127
@SQ SN:Un_KI270515v1 LN:6361
@SQ SN:Un_KI270507v1 LN:5353
@SQ SN:Un_KI270517v1 LN:3253
@SQ SN:Un_KI270529v1 LN:1899
@SQ SN:Un_KI270528v1 LN:2983
@SQ SN:Un_KI270530v1 LN:2168
@SQ SN:Un_KI270539v1 LN:993
@SQ SN:Un_KI270538v1 LN:91309
@SQ SN:Un_KI270544v1 LN:1202
@SQ SN:Un_KI270548v1 LN:1599
@SQ SN:Un_KI270583v1 LN:1400
@SQ SN:Un_KI270587v1 LN:2969
@SQ SN:Un_KI270580v1 LN:1553
@SQ SN:Un_KI270581v1 LN:7046
@SQ SN:Un_KI270579v1 LN:31033
@SQ SN:Un_KI270589v1 LN:44474
@SQ SN:Un_KI270590v1 LN:4685
@SQ SN:Un_KI270584v1 LN:4513
@SQ SN:Un_KI270582v1 LN:6504
@SQ SN:Un_KI270588v1 LN:6158
@SQ SN:Un_KI270593v1 LN:3041
@SQ SN:Un_KI270591v1 LN:5796
@SQ SN:Un_KI270330v1 LN:1652
@SQ SN:Un_KI270329v1 LN:1040
@SQ SN:Un_KI270334v1 LN:1368
@SQ SN:Un_KI270333v1 LN:2699
@SQ SN:Un_KI270335v1 LN:1048
@SQ SN:Un_KI270338v1 LN:1428
@SQ SN:Un_KI270340v1 LN:1428
@SQ SN:Un_KI270336v1 LN:1026
@SQ SN:Un_KI270337v1 LN:1121
@SQ SN:Un_KI270363v1 LN:1803
@SQ SN:Un_KI270364v1 LN:2855
@SQ SN:Un_KI270362v1 LN:3530
@SQ SN:Un_KI270366v1 LN:8320
@SQ SN:Un_KI270378v1 LN:1048
@SQ SN:Un_KI270379v1 LN:1045
@SQ SN:Un_KI270389v1 LN:1298
@SQ SN:Un_KI270390v1 LN:2387
@SQ SN:Un_KI270387v1 LN:1537
@SQ SN:Un_KI270395v1 LN:1143
@SQ SN:Un_KI270396v1 LN:1880
@SQ SN:Un_KI270388v1 LN:1216
@SQ SN:Un_KI270394v1 LN:970
@SQ SN:Un_KI270386v1 LN:1788
@SQ SN:Un_KI270391v1 LN:1484
@SQ SN:Un_KI270383v1 LN:1750
@SQ SN:Un_KI270393v1 LN:1308
@SQ SN:Un_KI270384v1 LN:1658
@SQ SN:Un_KI270392v1 LN:971
@SQ SN:Un_KI270381v1 LN:1930
@SQ SN:Un_KI270385v1 LN:990
@SQ SN:Un_KI270382v1 LN:4215
@SQ SN:Un_KI270376v1 LN:1136
@SQ SN:Un_KI270374v1 LN:2656
@SQ SN:Un_KI270372v1 LN:1650
@SQ SN:Un_KI270373v1 LN:1451
@SQ SN:Un_KI270375v1 LN:2378
@SQ SN:Un_KI270371v1 LN:2805
@SQ SN:Un_KI270448v1 LN:7992
@SQ SN:Un_KI270521v1 LN:7642
@SQ SN:Un_GL000195v1 LN:182896
@SQ SN:Un_GL000219v1 LN:179198
@SQ SN:Un_GL000220v1 LN:161802
@SQ SN:Un_GL000224v1 LN:179693
@SQ SN:Un_KI270741v1 LN:157432
@SQ SN:Un_GL000226v1 LN:15008
@SQ SN:Un_GL000213v1 LN:164239
@SQ SN:Un_KI270743v1 LN:210658
@SQ SN:Un_KI270744v1 LN:168472
@SQ SN:Un_KI270745v1 LN:41891
@SQ SN:Un_KI270746v1 LN:66486
@SQ SN:Un_KI270747v1 LN:198735
@SQ SN:Un_KI270748v1 LN:93321
@SQ SN:Un_KI270749v1 LN:158759
@SQ SN:Un_KI270750v1 LN:148850
@SQ SN:Un_KI270751v1 LN:150742
@SQ SN:Un_KI270752v1 LN:27745
@SQ SN:Un_KI270753v1 LN:62944
@SQ SN:Un_KI270754v1 LN:40191
@SQ SN:Un_KI270755v1 LN:36723
@SQ SN:Un_KI270756v1 LN:79590
@SQ SN:Un_KI270757v1 LN:71251
@SQ SN:Un_GL000214v1 LN:137718
@SQ SN:Un_KI270742v1 LN:186739
@SQ SN:Un_GL000216v2 LN:176608
@SQ SN:Un_GL000218v1 LN:161147
@SQ SN:EBV LN:171823
@RG ID:GATKCopyNumber SM:hg002
20 90001 90100 -0.082386
20 90101 90200 -0.042729
20 90201 90300 0.246508
20 90301 90400 -0.133577
20 90401 90500 -0.248078
20 90501 90600 -0.377049
20 90601 90700 -0.030020
20 90701 90800 -0.205219
20 90801 90900 0.096751
20 90901 91000 0.276859
20 91001 91100 -0.076416
20 91101 91200 0.435519
20 91201 91300 -0.260407
20 91301 91400 0.314984
20 91401 91500 0.036276
20 91501 91600 -0.190666
20 91601 91700 0.278775
20 91701 91800 0.115213
20 91801 91900 -0.097977
20 91901 92000 0.419661
20 92001 92100 0.391919
20 92101 92200 0.114558
20 92201 92300 0.309484
20 92301 92400 -0.121151
20 92401 92500 -0.019289
20 92501 92600 -0.590304
20 92601 92700 -0.222244
20 92701 92800 -0.351925
20 92801 92900 0.225040
20 92901 93000 0.104344
20 93001 93100 -0.158730
20 93101 93200 0.559246
20 93201 93300 0.407682
20 93301 93400 -0.083459
20 93401 93500 0.344357
20 93501 93600 -0.230986
20 93601 93700 -0.033631
20 93701 93800 -0.243347
20 93801 93900 0.339796
20 93901 94000 0.219763
20 94001 94100 0.163647
20 94101 94200 0.284660
20 94201 94300 0.077820
20 94301 94400 -0.108470
20 94401 94500 0.040518
20 94501 94600 0.137304
20 94601 94700 -0.420225
20 94701 94800 -0.237642
20 94801 94900 -0.337373
20 94901 95000 0.484378
20 95001 95100 -0.351364
20 95101 95200 -0.540956
20 95201 95300 0.411661
20 95301 95400 0.145571
20 95401 95500 0.400021
20 95501 95600 -0.149275
20 95601 95700 -0.042284
20 95701 95800 0.126260
20 95801 95900 -0.123011
20 95901 96000 0.260356
20 96001 96100 -0.092015
20 96101 96200 0.316630
20 96201 96300 -0.262502
20 96301 96400 -0.098632
20 96401 96500 -0.436316
20 96501 96600 0.180802
20 96601 96700 -0.314259
20 96701 96800 -0.056099
20 96801 96900 0.222985
20 96901 97000 0.256015
20 97001 97100 0.234977
20 97101 97200 0.015928
20 97201 97300 0.542067
20 97301 97400 0.052211
20 97401 97500 0.347509
20 97501 97600 0.139405
20 97601 97700 -0.402959
20 97701 97800 0.025147
20 97801 97900 -0.340951
20 97901 98000 -0.085121
20 98001 98100 -0.314344
20 98101 98200 -0.281636
20 98201 98300 0.147276
20 98301 98400 0.229243
20 98401 98500 -0.196340
20 98501 98600 0.139677
20 98601 98700 0.164471
20 98701 98800 -0.072459
20 98801 98900 0.230651
20 98901 99000 -0.079648
20 99001 99100 0.026914
20 99101 99200 -0.662400
20 99201 99300 0.485609
20 99301 99400 -0.168460
20 99401 99500 -0.186852
20 99501 99600 0.164201
20 99601 99700 -0.247786
20 99701 99800 0.229623
20 99801 99900 0.395136
20 99901 100000 0.323178
Binary file not shown.
Binary file not shown.
Binary file not shown.