Skip to content

Comments

Bug Fix: cut_point not in exon_positions#150

Open
mbowcut2 wants to merge 7 commits intomasterfrom
mckay/amino-acid-cut-point-bug
Open

Bug Fix: cut_point not in exon_positions#150
mbowcut2 wants to merge 7 commits intomasterfrom
mckay/amino-acid-cut-point-bug

Conversation

@mbowcut2
Copy link

@mbowcut2 mbowcut2 commented Feb 13, 2026

Bug Summary

When a sgRNA cut point falls before the start of the exon (cut_point < exon_positions[0]), the amino acid cut point
calculation produces a negative value:

  amino_acid_cut_point = (cut_point - refs[ref_name]['exon_positions'][0] + 1) // 3                                          
  # e.g., (97 - 100 + 1) // 3 = -2 // 3 = -1, or worse cases → -3                                                            

This negative value is then passed to get_amino_acid_row() which does row['ref_positions'].index(-3), and since
ref_positions never contains negative values, it throws ValueError: -3 is not in list.

Changes

  1. CRISPResso2/CRISPRessoCORE.py line 6121 (root cause fix): Clamp amino_acid_cut_point to valid range [0,
    len(coding_seq_amino_acids) - 1]. This handles both the case where the cut is before the exon (clamp to 0) and after the
    exon (clamp to last amino acid).
  2. CRISPResso2/CRISPRessoShared.py line 1540 (defensive fallback): Wrap row['ref_positions'].index(amino_acid_cut_point) in
    a try/except ValueError that falls back to cut_idx = 0, so any other code path that passes an out-of-range value won't crash either.

Colelyman and others added 4 commits February 11, 2026 12:06
* vcf file writing initial concept

* splitting vcf into testable functions, adding tests

* adding unit tests, adding nullcontext to vcf writing file

* fixed first insertion bug

* Mckay/base edit plot (#119)

* Cole/plot fixes (#121)

* fix setting of 99%ile in negative direction (deletions)

* need another break

* When reading CRISPRessoPooled amplicon file, only skip lines if they start with the comment character(#)

* Implement new overwrite_crispresso_options in CRISPRessoPooled to only add non-default commands when propagating

* In CRISPRessoPooled implement multiplexing for sub-runs and create a run_name that is filename-safe and separate from a display name.

* Add arg for display_name

* Change display name name to Display Name

* CRISPRessoWGS fix unpickleable partial error

* Move complete message to after crispresso cup to allow for json parsing of status.json

* Update CRISPRessoWGS to allow for run names

* Messages to users about counting reads in input (sometimes this takes a while for large samples)

* Cast guardrail values as ints because division casts them as floats

* Remove invalid escape of _ when writing to JSON status file

* Fix documentatio for `CRISPRessoShared.check_if_failed` and remove extraneous whitespace

* Point to updated test branch

* Fix Cython bug

* Read version from toml file

* Fix import error

* update version

* Update integration_tests.yml

Remove Native Merge test temporarily

* Pin fastp version

* Run native merge test and point tests to master

---------

Co-authored-by: kclem <[email protected]>
Co-authored-by: Cole Lyman <[email protected]>

* Additional plot fixes (#122)

* fix setting of 99%ile in negative direction (deletions)

* need another break

* When reading CRISPRessoPooled amplicon file, only skip lines if they start with the comment character(#)

* Implement new overwrite_crispresso_options in CRISPRessoPooled to only add non-default commands when propagating

* In CRISPRessoPooled implement multiplexing for sub-runs and create a run_name that is filename-safe and separate from a display name.

* Add arg for display_name

* Change display name name to Display Name

* CRISPRessoWGS fix unpickleable partial error

* Move complete message to after crispresso cup to allow for json parsing of status.json

* Update CRISPRessoWGS to allow for run names

* Messages to users about counting reads in input (sometimes this takes a while for large samples)

* Cast guardrail values as ints because division casts them as floats

* Remove invalid escape of _ when writing to JSON status file

* Fix documentatio for `CRISPRessoShared.check_if_failed` and remove extraneous whitespace

* Point to updated test branch

* Fix Cython bug

* Read version from toml file

* Move printing header info to after console log level setting in main function

* Add import for importlib.metadata to CRISPRessoShared.py

* Standardize intermediate file names for CRISPRessoPooled info and fastq files

* CRISPRessoPooled avoid gzipping nonexistant files

* Remove warning if no config file

* Allow none for custom_colors in CRISPRessoPlot

* read version from toml file

* Make fig_filename_root default to None, in which case the figure is shown interactively - e.g. in a jupyter notebook

* Print tool description after logging level has been set

* update testRelease.sh script

* Don't rerun if the 'verbosity' value has changed.

* CRISPResso core will not rerun if there are changes in the 'debug', 'n_processes', and 'verbosity' arguments

* Verbosity levels <=1 are set to 1, >= 4 are set to 4.

* update for future pd.read_json deprecation

* Fix import error

* update version

* Update integration_tests.yml

Remove Native Merge test temporarily

* Pin fastp version

* Run native merge test and point tests to master

* Point tests to cole/plot_fixes branch

* Point tests back to master

---------

Co-authored-by: kclem <[email protected]>
Co-authored-by: Sam <[email protected]>

* - added Kendall's code
- added args
- added fig 10i to report
- added tests for data prep

* added upsetplot to Dockerfile

* added save_png to plot args

* removed unused args

* remove unused variable

* cole's comments

* removed unused function

* comments

* comments

* updated test_be_df with UNMODIFIED row

* remove comment

* removed extra args
fixed name

* Update help string for `--base_editor_consider_changes_outside_qw`

* Point tests back to master

---------

Co-authored-by: Samuel Nichols <[email protected]>
Co-authored-by: kclem <[email protected]>
Co-authored-by: Cole Lyman <[email protected]>

* further vcf tweaks/work

* vcf dynamic alt map and vcf line generation with unit testing

* updating tests

* Add test case for second element deletion

* Remove unnecessary imports and add whitspace

* Add mini integration tests for alt_map and vcf_line

* Update unit tests to account for bug in find_indels_substitutions

* Add failing test case for insertion and deletion occurring at the same position

The alt map incorrectly separate the deletion and insertion to different
positions.

* Cast dict_keys to be a list

* Fix test case to reflect correct position of deletion

* Add checks to the VCF file

* Add upsetplot as a dependency in setup.py (#130)

* Fix position of deletion in test case

* Fix off by one error for deletions, update tests and fix deletion at start

It turns out that when there is a deletion at the start of the sequence, the
correct way to handle it is to provide the last base after the deletion. Source:
https://bioinformatics.stackexchange.com/questions/2476/how-to-represent-a-deletion-at-position-1-in-a-vcf-file

* Update tests to convince myself that insertion and deletion handling is working

* Remove duplicate writing of allele frequency table

* Refactor `get_allele_row` and make vcf_output not dependent on write_detailed_allele_table

* Extract out construction of df_alleles in unit tests

* Write test to illustrate bug when you have an insertion then a deletion

* Refactor unit tests to use create_df_alleles

* Fix test_build_alt_map_mixed test

* Refactor create_df_alleles to support number of reads

This fixes test_build_alt_map_substitutions

* Fix test_build_alt_map_insertions test

* Allow amplicon name to be passed to create_df_alleles

* Fix test_upsert_edit_del_and_ins

* Fix deletion at start of amplicon

* Fix bug when there is a deletion starting at the second position

This bug only happens when a deletion starts are the second position, before the
fix, it would report that the deletion started at the first position. It is
fixed now, so deletions at the second position are reported correctly.

* Fix test to reflect full deletion

* Add tests for find_indels_substitutions for deletions at the end

* Fix 1bp deletions at the end, and off by one error

This ensures that when a deletion occurs at the end of a read, the entire
deletion is accounted for.

* Fix for representing deletions at the end of a sequence

* Fix bug where deletions that extend to the end of a sequence fail

* Remove qwc_indexes

* Properly implement handling of deletions that start at the beginning of a sequence

* Properly account for multiple delete_start events

* Update test to reflect proper ref_positions and other attributes

* Fix QWC inference across amplicons (#137)

* Mckay/be plot improvements (#136)

* trying to get the figure to fit nicely, increased
element size to 100

* custom figsize to display without cutting off

increased figsize in report template

* Allow messages to be served in CLI reports (#134) (pinellolab#583)

* Fix deletion at second position (#131)

* Fix bug when there is a deletion starting at the second position

This bug only happens when a deletion starts are the second position, before the
fix, it would report that the deletion started at the first position. It is
fixed now, so deletions at the second position are reported correctly.

* Update CRISPRessoCOREResources.c due to change in .pyx

* Add tests for find_indels_substitutions for deletions at the end

* Fix 1bp deletions at the end, and off by one error

This ensures that when a deletion occurs at the end of a read, the entire
deletion is accounted for.

* Update CRISPRessoCOREResources.c to reflect fixes for deletions at the end of alignments

* Add extra asserts to deletion checks

* Point to new test branch

* Reafctor deletion_coordinates to go past the end of the string for deletions at
the end of the sequence

* Point tests to master

* Allow messages to be served in CLI reports

* Point to cole/messages test branch

* Point tests back to master

* point to tests branch

* typo

* testing github actions

* remove test

* point tests to master

---------

Co-authored-by: Cole Lyman <[email protected]>

* Update inferred QWC tests to reflect correct intended behavior

* Fix inferring QWC to match intended behavior

* Add more test cases and fix bug discovered in single bp QWC

* Add even more test cases testing indels outside the QWC

* Point tests to cole/fix-qwc-deletion

* update plotly.js (#138)

* Change order of amplicon inference alignment so that 1st amplicon is the reference

This makes a difference because it changes the values of `s1inds`, and therefore
the value of the inferred quantification window coordinates.

* Point integration tests back to master

* Update CHANGELOG.md

---------

Co-authored-by: mbowcut2 <[email protected]>

* Fix a bug for `--bam_output` when there are unaligned reads (#144)

* Fix BAM output when there are unaligned reads

* Point tests to cole/unaligned-reads

* Update CHANGELOG.md

* Fix typo in tests branch name

* Point tests back to master

* Add VCF testing and verification design document

Documents the strategy for integration testing the --vcf_output feature
using golden file comparison with syn-gen synthetic data.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* Change name of output VCF file

* Point tests to vcf-parameters

* Add writers module refactor design document

Documents plan to move VCF output code from CRISPRessoUtilities.py
to a new writers/ module structure.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* Refactor VCF code into writers/ module

Move CRISPRessoUtilities.py to writers/vcf.py to establish a pattern
for organizing output writers. Update imports in CRISPRessoCORE.py
and relocate tests to test_writers/test_vcf/.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* Remove the FORMAT column from VCF and add the contig length to VCF headers

* Initial VCF simplication writing each edit on its own line

* Further simplification of VCF output code

* Left align deletions for VCF output

* Left-normalize insertions for VCF output

* Update CLAUDE.md with design_docs

* Update Github Actions with Pooled Prime Editing and VCF basic and VCF Prime Edit Basic

* Update integration tests to run in parallel

* Add VCF path to crispresso2_info

* Point the tests back to cole/vcf-parameters

* Always save the conda env cache in Github Actions

* Debug: add set +e and per-target error reporting in integration tests

setup-miniconda@v3 injects 'set -eo pipefail' into ~/.profile, which
is sourced by bash -l. This causes make commands to fail silently.
Adding set +e at script start and explicit per-target error annotations.

* Attempt to fix conda cache in GitHub Actions

* Update CHANGELOG.md and parameter descriptions

* Remove unused nullcontext import

* Remove extra blank lines after amplicon_coordinates validation

* Document that --vcf_output implies --write_detailed_allele_table

* Use max amplicon length for VCF contig headers and warn on collision

When multiple amplicons map to the same chromosome, the contig header
now uses the max amplicon length instead of silently overwriting with
the last one seen. A warning is logged when this occurs.

* Fix VCF POS off-by-one for start-of-amplicon deletions when pos > 1

For deletions starting at position 0 of the amplicon, the VCF POS was
computed as pos-1 instead of pos. The max(1,...) clamp accidentally
masked this when pos=1 (all existing tests). With pos>1, VCF POS
pointed one base before the amplicon.

Fix: when start==0, set left_index=pos directly instead of
pos+start-1.

* Fix multi-insertion bug: use alignment index to extract inserted bases

_edits_from_insertions treated the second element of insertion_coordinates
as an alignment index, but it is actually a reference position. These
coincide for single-insertion reads but diverge when prior insertions
shift alignment indices.

Fix: convert right_anchor_ref_pos to an alignment index via
ref_positions.index(), then extract the size characters before it.

Also:
- Rename misleading variables (right_anchor_ref_pos was actually left
  anchor, aligned_start was actually right anchor ref pos)
- Add comment on defensive ref_len branch explaining it is currently
  unreachable
- Update LEFT_NORMALIZATION.md to correctly describe the
  insertion_coordinates format

* Add clarifying example to help text of --amplicon_coordinates

* Remove planning documents

* Point tests back to master

* Update PR number in CHANGELOG.md

* Cache the conda environment for pytest as well

---------

Co-authored-by: Trevor Martin <[email protected]>
Co-authored-by: mbowcut2 <[email protected]>
Co-authored-by: Samuel Nichols <[email protected]>
Co-authored-by: kclem <[email protected]>
Co-authored-by: Claude Opus 4.5 <[email protected]>
if not in list, set to 0
@mbowcut2
Copy link
Author

another round of changes:
Summary of Changes

CRISPResso2/CRISPRessoCORE.py (Steps 1, 2, 3)

Problem: The amino acid table code (Figure 9a) was outside the sgRNA loop but used stale cut_point, plot_cut_point, and new_sgRNA_intervals variables from the last sgRNA iteration.
When the last sgRNA's cut point was far from the coding region, the amino_acid_cut_point was wildly out of range.

Fix:

  1. Best sgRNA selection — For each coding sequence, finds the sgRNA whose cut point is closest to that coding sequence's exon interval (distance=0 if inside, otherwise distance to
    nearest edge). Falls back to exon midpoint if no guides exist.
  2. Per-coding-seq exon start — Changed the formula from (cut_point - refs[ref_name]['exon_positions'][0] + 1) // 3 to (best_cut_point - exon_start + 1) // 3, using the specific exon
    interval start rather than the merged exon union start.
  3. Removed unused params — Removed sgRNA_intervals, sgRNA_names, sgRNA_mismatches from the plot dict (they were passed through but never used by plot_amino_acid_heatmap). Used
    best_plot_cut_point instead of the stale plot_cut_point.

CRISPResso2/CRISPRessoShared.py (Step 4)

Problem: The except ValueError: cut_idx = 0 fallback was crude — jumping to position 0 when the cut point position is deleted in an allele.

Fix: When amino_acid_cut_point isn't found in ref_positions, find the closest valid position instead of defaulting to 0. Also narrowed the bare except: on gap_incentive[cut_idx] to
except IndexError:.

CRISPResso2_tests expected results (Step 5)

Updated 4 amino acid table files + 2 HTML reports. The amino acid tables revert from the HEAD~1 interim values (which used amino_acid_cut_point=16 from the clamp-to-last-sgRNA
approach) back to values matching the correct sgRNA selection (amino_acid_cut_point=12, using the primary guide that's inside the exon). The HTML changes add new parameters
(amplicon_coordinates, vcf_output) from the VCF output feature.

Verification (Step 6)

Both make params test and make base_editor test pass cleanly.

@mbowcut2 mbowcut2 requested a review from Colelyman February 13, 2026 21:29
Copy link
Member

@Colelyman Colelyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is on the right track, but needs some refinement.

(And quick clarification in the description, ref_positions does have negative values, but it is when there is an insertion that isn't represented in the original sequence. Not super relevant to the content or design of the PR, this still needs to be fixed.)

mbowcut2 and others added 3 commits February 17, 2026 10:37
* Fix Docker entrypoint

* Attempt to fix Circle CI pip install

* Run each CRISPResso command in the conda environment

* Update `--base_editor_output` parameter name for Circle CI

* Update tests Makefile and batch expected output

* Fix QWC inference across amplicons (#137)

* Mckay/be plot improvements (#136)

* trying to get the figure to fit nicely, increased
element size to 100

* custom figsize to display without cutting off

increased figsize in report template

* Allow messages to be served in CLI reports (#134) (pinellolab#583)

* Fix deletion at second position (#131)

* Fix bug when there is a deletion starting at the second position

This bug only happens when a deletion starts are the second position, before the
fix, it would report that the deletion started at the first position. It is
fixed now, so deletions at the second position are reported correctly.

* Update CRISPRessoCOREResources.c due to change in .pyx

* Add tests for find_indels_substitutions for deletions at the end

* Fix 1bp deletions at the end, and off by one error

This ensures that when a deletion occurs at the end of a read, the entire
deletion is accounted for.

* Update CRISPRessoCOREResources.c to reflect fixes for deletions at the end of alignments

* Add extra asserts to deletion checks

* Point to new test branch

* Reafctor deletion_coordinates to go past the end of the string for deletions at
the end of the sequence

* Point tests to master

* Allow messages to be served in CLI reports

* Point to cole/messages test branch

* Point tests back to master

* point to tests branch

* typo

* testing github actions

* remove test

* point tests to master

---------

Co-authored-by: Cole Lyman <[email protected]>

* Update inferred QWC tests to reflect correct intended behavior

* Fix inferring QWC to match intended behavior

* Add more test cases and fix bug discovered in single bp QWC

* Add even more test cases testing indels outside the QWC

* Point tests to cole/fix-qwc-deletion

* update plotly.js (#138)

* Change order of amplicon inference alignment so that 1st amplicon is the reference

This makes a difference because it changes the values of `s1inds`, and therefore
the value of the inferred quantification window coordinates.

* Point integration tests back to master

* Update CHANGELOG.md

---------

Co-authored-by: mbowcut2 <[email protected]>

* Update setup.py and pyproject.toml to find all modules and packages

* Convert pyproject.toml to Unix file endings

---------

Co-authored-by: mbowcut2 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants