Skip to content

XP-EHH NaN values and duplicated lines after normalization (selscan v2.1.1 & v3.0) #152

@chencj2599

Description

@chencj2599

Hi,

I am running XP-EHH on unphased VCF data and see the same behavior with selscan v2.1.1 and v3.0 (and their corresponding norm):

  1. Many xpehh values are -nan in the raw output.
  2. After normalization, the .xpehh.out.norm file looks corrupted: one SNP position is repeated many times and normxpehh is also -nan.

Command and log (example)

selscan --xpehh --unphased \
  --vcf ../XP-EHH/pop1/Chr01A.vcf.gz \
  --vcf-ref ../XP-EHH/pop2/Chr01A.vcf.gz \
  --pmap \
  --max-gap 250000 \
  --threads 30 \
  --out Chr01A

Log excerpt:

selscan v2.1.1
Loading 14 haplotypes and 1031943 loci. Skipped 0 loci
Loading 47 haplotypes and 1031943 loci. Skipped 0 loci
...
Starting XP-EHH calculations.
WARNING: Reached chromosome edge before EHH decayed below 0.05. 
--trunc-ok set. Skipping calculation at position 8422 id: .
...
Finished XP-EHH.

(I get similar warnings and output patterns with v3.0.)


Raw XP-EHH output (excerpt)

id      pos     gpos    p1      ihh1    p2      ihh2    xpehh
.       8173    0.008173        0.464286        0       0.882979        0       -nan
.       10802   0.010802        0.464286        0       0.861702        0       -nan
.       10809   0.010809        0.428571        0       0.861702        0       -nan
.       10833   0.010833        0.428571        0       0.882979        0       -nan
.       10834   0.010834        0.107143        0.00028772      0.297872        0.000508399     -0.247235
...

At many SNPs near the chromosome start, ihh1 = ihh2 = 0 and xpehh = -nan.


Normalized output (excerpt)

id      pos     gpos    p1      ihh1    p2      ihh2    xpehh   normxpehh   crit
.       8173    0.008173        0.464286        0       0.882979        0       0   -nan    0
.       8173    0.008173        0.464286        0       0.882979        0       0   -nan    0
.       8173    0.008173        0.464286        0       0.882979        0       0   -nan    0
...
(repeated many times with the same position)

So after norm, one position (8173) is duplicated many times, xpehh becomes 0 and normxpehh is -nan for all these rows.


Questions

  1. Are xpehh = -nan values near chromosome edges (with ihh1 = ihh2 = 0 and truncation warnings) expected, or do they suggest a problem with my data or parameters (--max-gap, EHH cutoff, MAF, etc.)?

  2. Is it expected that norm behaves like this when the input contains many NaN values, or does this indicate a bug or misuse?

    • Under what conditions would norm output one SNP position many times with normxpehh = -nan?
    • Should I pre-filter rows with xpehh = NaN before running norm?
  3. Are there recommended parameter settings or QC steps (e.g. excluding chromosome ends, adjusting --max-gap or EHH cutoff, extra filtering for XP-EHH on unphased data) to reduce these NaN values and obtain more stable normalized scores?

I can share a small subset of the VCFs, the map file, and the corresponding .xpehh.out / .xpehh.out.norm files if helpful.

Thank you very much for any guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions