Skip to content

Commit 61efcb3

Browse files
authored
[doc] Migrate docs from wiki (#78)
* [doc] Migrate docs from wiki
1 parent 510ffd5 commit 61efcb3

File tree

6 files changed

+247
-12
lines changed

6 files changed

+247
-12
lines changed

README.md

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,6 @@
55

66

77
Welcome to DWGSIM.
8+
----
89

9-
Please see the file LICENSE for details.
10-
Please see the file INSTALL for installation instructions;
11-
12-
This software package has limited support since it is not longer in active development.
13-
14-
Please use the following command when cloning this repository:
15-
16-
```git clone --recursive```
17-
18-
Please see the following page for more details:
19-
https://github.com/nh13/DWGSIM/wiki
10+
Documentation can be found in the [docs folder](docs/01_Introduction.md)

docs/01_Introduction.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Introduction
2+
3+
* [Installation](02_Installation.md)
4+
* [Simulating Reads](03_Simulating_Reads.md)
5+
* [Evaluating Mappings](04_Evaluating_Mappings.md)

docs/02_Installation.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Installation
2+
3+
Please use the following command when cloning this repository:
4+
5+
```console
6+
git clone --recursive git@github.com:nh13/DWGSIM.git
7+
```
8+
9+
Next, build the `dwgsim` and `dwgsim_eval` binaries:
10+
11+
```console
12+
make
13+
```
14+
15+
You can run a simple integration test with:
16+
17+
```console
18+
make test
19+
```

docs/03_Simulating_Reads.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Simulating Reads
2+
3+
<!---toc start-->
4+
* [Overview](#overview)
5+
* [Error rates explained](#error-rates-explained)
6+
* [Read names explained](#read-names-explained)
7+
* [Mate pair or paired end modes](#mate-pair-or-paired-end-modes)
8+
* [Output mutations file](#output-mutations-file)
9+
* [Output FASTQ files](#output-fastq-files)
10+
11+
<!---toc end-->
12+
13+
## Overview
14+
15+
The `dwgsim` tool simulates reads from a reference genome FASTA.
16+
It can be used to evaluate both mapping and variant calling.
17+
See the `dwgsim_eval` tool for [Evaluating Mappings](04_Evaluating_Mappings.md).
18+
19+
Use `dwgsim -h` to see the full set of command line options (usage).
20+
21+
An example command is as follows:
22+
23+
```console
24+
dwgsim -N 10000 -1 100 -2 100 -y 0 phix.fasta output
25+
```
26+
27+
This will simulate 10,000 reads (`-N 10000`), which are paired end 2x100bp (`-1 100` for R1 and `-2 100` for R2), with no random reads (`-y 0`),
28+
from the given genome FASTA (`phix.fasta`), producing output with prefix `output`.
29+
30+
The following output will be created:
31+
32+
| Name | Description |
33+
| --- | --- |
34+
| output.bfast.fastq.gz | Interleaved FASTQ containing both read one and read two |
35+
| output.bwa.read1.fastq.gz | FASTQ containing only read one |
36+
| output.bwa.read2.fastq.gz | FASTQ containing only read two |
37+
| output.mutations.vcf | VCF containing simulated mutations |
38+
| output.mutations.txt | TXT in a custom format containing simulated mutations (see [below](#output-mutations-file)) |
39+
40+
41+
Notes:
42+
43+
- the longest supported insertion is 255bp
44+
- The `-H` mode will simulate a haploid genome, whereas the default is to simulate a diploid genome.
45+
46+
## Error rates explained
47+
48+
The `-e` and `-E` options accept a uniform error rate (i.e. `-e 0.01` for 1%), or a uniformly increasing/decreasing error rate (i.e. `-e 0.01-0.1` for an error rate of 1% at the start of the read increasing to 10% at the end of the read).
49+
50+
## Read names explained
51+
52+
Read names are of the form:
53+
54+
```
55+
@<#1>_<#2>_<#3>_<#4>_<#5>_<#6>_<#7>_<#8>:<#9>:<#10>_<#11>:<#12>:<#13>_<#14>
56+
```
57+
58+
| Field | Description |
59+
| --- | --- |
60+
| 1 | contig name (chromsome name) |
61+
| 2 | start read 1 (one-based) |
62+
| 3 | start read 2 (one-based) |
63+
| 4 | strand read 1 (0 - forward, 1 - reverse) |
64+
| 5 | strand read 2 (0 - forward, 1 - reverse) |
65+
| 6 | random read 1 (0 - from the mutated reference, 1 - random) |
66+
| 7 | random read 2 (0 - from the mutated reference, 1 - random) |
67+
| 8 | number of sequencing errors read 1 (color errors for colorspace) |
68+
| 9 | number of SNPs read 1 |
69+
| 10 | number of indels read 1 |
70+
| 11 | number of sequencing errors read 2 (color errors for colorspace) |
71+
| 12 | number of SNPs read 2 |
72+
| 13 | number of indels read 2 |
73+
| 14 | read number (unique within a given contig/chromsome) |
74+
75+
Read 1 and read 2 correspond to the first and second reads from a paired-end/mate-pair read respectively.
76+
77+
## Mate pair or paired end modes
78+
79+
This utility can generate mate pair or paired end reads using the `-S` option.
80+
By default, Illumina (nucleotide) data are paired end, and SOLiD (color space) data are mate pair.
81+
For clarity, lets call the first end sequence E1 and the second end E2.
82+
83+
Paired end reads have the following orientation:
84+
85+
```
86+
5' E1 -----> .... 3'
87+
3' .... <------- E2 5'
88+
```
89+
90+
Above, the start co-ordinate of E1 is less than E2, with E1 and E2 reported on opposite strands.
91+
92+
Mate pair reads have following orientation
93+
94+
```
95+
5' E2 -----> .... E1 -------> 3'
96+
3' .... 5'
97+
```
98+
99+
Above, the start co-ordinate of E1 is greater than E2, with E1 and E2 reported on the same strand.
100+
101+
So for SOLiD mate pair reads, the R3 tag (E2) is listed before the F3 tag (E1).
102+
For SOLiD paired end reads, the F3 tag (E1) is listed before the F5 tag (E2).
103+
104+
## Output mutations file
105+
106+
The locations of introduced mutations are given in a `<prefix>.mutations.txt` text file.
107+
There are file columns:
108+
109+
1. the chromosome/contig name
110+
2. the one-based position
111+
3. the original reference base
112+
4. the new reference base(s)
113+
5. the variant strand(s)
114+
115+
SNPs are represented on one line, and in the case of heterozygous mutations, the new reference base is an [IUPAC](http://www.bioinformatics.org/sms/iupac.html) code.
116+
117+
```
118+
contig4 4 T K 1
119+
```
120+
121+
The above shows a heterozygous mutation at position 4 of contig4 on the first strand, mutating the T base to a heterozygous K (G or T) SNP.
122+
123+
Insertions are represented on one line, where the reference base is missing (indicated by a '-' in the third column).
124+
125+
```
126+
contig5 13 - TAC 3
127+
```
128+
129+
The above shows a homozygous insertion of TAC prior to position 13.
130+
131+
Each base of a deletion is represented on one line, where the new reference base is missing and represented by a '-'.
132+
133+
```
134+
contig6 22 A - 2
135+
```
136+
137+
The above shows a heterozygous deletion of T at position 22 on the second strand.
138+
Multi-base deletions are show on consecutive lines.
139+
140+
```
141+
contig6 22 A - 2
142+
contig6 23 C - 2
143+
```
144+
145+
The above shows a two base homozygous deletion of positions 22 and 23 on the second strand.
146+
147+
## Output FASTQ files
148+
149+
Three FASTQ files are produced, for use with BFAST (interleaved FASTQ) and BWA (one FASTQ per read end).
150+
151+
The FASTQ for BFAST is formatted so that the multi-end reads (paired end or mate pair) occur consecutively in the FASTQ (interleaved), with the read that is 5' of the other listed first.
152+
For paired end reads, this means that E1 is always listed before E2.
153+
For mate pair reads, this means that E2 is always listed before E1.
154+
155+
The FASTQs for BWA are split into two files, the first file for one end, the second file for the other, with the read that is 5' of the other in the first file.
156+
For paired end reads, this means that E1 is in the first file and E2 is in the second file.
157+
For mate pair reads, this means that E2 is in the first file and E1 is in the second file.
158+

docs/04_Evaluating_Mappings.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Evaluating Mappings
2+
3+
<!---toc start-->
4+
* [Overview](#overview)
5+
* [Key Options](#key-options)
6+
* [Output](#output)
7+
8+
<!---toc end-->
9+
10+
## Overview
11+
12+
The `dwgsim_eval` tool evaluates the mappings from reads produced by `dwgsim`.
13+
See the `dwgsim` tool for [Simulating Reads](03_Simulating_Reads.md).
14+
15+
Use `dwgsim_eval -h` to see the full set of command line options (usage).
16+
17+
An example command is as follows:
18+
19+
```console
20+
dwgsim_eval mapped.bam
21+
```
22+
23+
A read is correctly mapped if the mapped start position is within `INT` bases (`-g`) of the simulated read's start position.
24+
Otherwise, it is counted as incorrectly mapped.
25+
Furthermore, random reads created by `dwgsim` are expected to be unmapped (counted as false mappings otherwise).
26+
27+
## Key Options
28+
29+
- Use the `-c` option if the reads are from SOLID
30+
- Use the `-b` option if the alignments are from BWA when mapping SOLiD reads
31+
- Use the `-z` option if the input reads are expected to be single end (not paired end)
32+
- Use the `-s` option if the input is a SAM file (default: BAM)
33+
- Use the `-n` option to specify the number of reads if only a subset of the input reads were used (affects the denominator)
34+
35+
36+
## Output
37+
38+
The `dwgsim_eval` tool outputs a table with the following columns:
39+
40+
| Column | Description |
41+
| --- | --- |
42+
| `thr` | The alignment threshold (see `-a`) |
43+
| `mc` | The number of reads mapped correctly that should be mapped at the threshold |
44+
| `mi` | The number of reads mapped incorrectly that should be mapped be mapped at the threshold |
45+
| `mu` | The number of reads unmapped that should be mapped be mapped at the threshold |
46+
| `um` | The number of reads mapped that should be unmapped be mapped at the threshold |
47+
| `uu` | The number of reads unmapped that should be unmapped be mapped at the threshold |
48+
| `mc + mi + mu + um + uu` | The total number of reads that should be unmapped be mapped at the threshold |
49+
| `mc'` | The number of reads mapped correctly that should be mapped at or greater than the threshold |
50+
| `mi'` | The number of reads mapped incorrectly that should be mapped be mapped at or greater than the threshold |
51+
| `mu'` | The number of reads unmapped that should be mapped be mapped at or greater than the threshold |
52+
| `um'` | The number of reads mapped that should be unmapped be mapped at or greater than the threshold |
53+
| `uu'` | The number of reads unmapped that should be unmapped be mapped at or greater than the threshold |
54+
| `mc' + mi' + mu' + um' + uu'` | The total number of reads that should be unmapped be mapped at or greater than the threshold |
55+
| `mc / (mc + mi + mu)` | Sensitivity at the threshold. I.e. the fraction of reads that should be mapped that are mapped correctly. |
56+
| `mc / (mc + mi)` | Positive predictive value at the threshold. I.e. The fraction of mapped reads that are mapped correctly. |
57+
| `um / (um + uu)` | False discovery rate at the threshold. I.e. The fraction of random reads that are mapped. |
58+
| `mc' / (mc' + mi' + mu')` | Sensitivity at or greater than the threshold. I.e. the fraction of reads that should be mapped that are mapped correctly. |
59+
| `mc' / (mc' + mi')` | Positive predictive value at or greater than the threshold. I.e. The fraction of mapped reads that are mapped correctly. |
60+
| `um' / (um' + uu')` | False discovery rate at or greater than the threshold. I.e. The fraction of random reads that are mapped. |
61+
62+
"At or greater than the threshold" tells us what our sensitivity, PPV, and FDR would be if we filtered based on that threshold.

src/dwgsim_eval.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ print_usage(dwgsim_eval_args_t *args)
2727
fprintf(stderr, "\t\t\t\t\t1: by alignment score\n");
2828
fprintf(stderr, "\t\t\t\t\t2: by suboptimal alignment score\n");
2929
fprintf(stderr, "\t\t\t\t\t3: by alignment score - suboptimal alignment score\n");
30-
fprintf(stderr, "\t-b\t\talignments are from BWA (SOLiD only) [%s]\n", __IS_TRUE(args->b));
30+
fprintf(stderr, "\t-b\t\talignments are from BWA (for SOLiD data only) [%s]\n", __IS_TRUE(args->b));
3131
fprintf(stderr, "\t-c\t\tcolor space alignments [%s]\n", __IS_TRUE(args->c));
3232
fprintf(stderr, "\t-d\tINT\tdivide quality/alignment score by this factor [%d]\n", args->d);
3333
fprintf(stderr, "\t-g\t\tgap \"wiggle\" [%d]\n", args->g);

0 commit comments

Comments
 (0)