Improvements to saving confidence, quality estimation, and sample usability #900

mshannon-sil · 2025-12-17T19:19:29Z

This PR addresses a number of requested improvements across features related to confidence, quality estimation, and sampling usability:

Adds a script to draw a representative sample from verse usabilities
Adds quality-estimation as a flag for experiment.py and translate.py
Adds labels to quality estimation scores
Fixes the index when saving confidence from .txt files
Saves verse level confidence separate from verse/token confidence file for readability and easier parsing
Ensures quality_estimation and sample_usability use canonical ordering
Saves projected chrf3 side by side with usability in the usability files
Automatically saves verse-level scores in test.py
Changes quality estimation to use the new verse-level test scores rather than diff_predictions
Other minor changes

This change is

…sed indexing to reflect row number in file

…ce and chrf3

benjaminking

@benjaminking reviewed 6 files and all commit messages, and made 8 comments.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @mshannon-sil).

silnlp/common/sample_usability.py line 128 at r1 (raw file):

    )
    parser.add_argument(
        "--random-state",

Maybe rename this "random-seed"? for consistency with other scripts?

silnlp/nmt/quality_estimation.py line 44 at r1 (raw file):

@dataclass
class BookScores:

Nice use of inheritance and composition here. It really helps to make the code clearer.

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

        book = file_scores[0].vref.book if file_scores else None
        verse_scores += file_scores
        if confidence_file.with_suffix(".chapters.tsv").is_file():

It would probably require some refactoring (maybe a class wrapping confidence_files?), but I'd like if there was a way to keep the logic about what the various confidence files are called all in one place, so we wouldn't have to update in multiple places if it changes.

silnlp/nmt/quality_estimation.py line 127 at r1 (raw file):

def get_verse_scores(input_file_path: Path, slope: float, intercept: float) -> List[VerseScore]:

If I'm correct, input_file_path is pointing to a confidences file? Changing the name to reflect that might be helpful for understanding what's going on (since there are so many different files involved here)

silnlp/nmt/translate.py line 129 at r1 (raw file):

                        tags,
                    )
                    if save_confidences:

I wonder if we should just always be saving confidence files? Do you see any downsides of that? Now that we're having to pass confidence files around in the code, it would help to simplify things.

silnlp/nmt/experiment.py line 162 at r1 (raw file):

                raise RuntimeError("A Scripture book, file, or file prefix must be specified for translation.")

        # Run quality estimation once after all translations complete

I wonder if we ought to be doing quality estimation in each of the translate functions instead (probably calling a function to do that)? We could just pass along the quality_estimation parameter and then don't have to return the list of confidence files.

silnlp/nmt/test.py line 435 at r1 (raw file):

def write_sentence_bleu(

What was the impetus in getting rid of SentenceBleu?

silnlp/common/translator.py line 148 at r1 (raw file):

    def get_all_sequence_confidence_scores(self) -> List[float]:
        return [
            scs for scs in [t.get_sequence_confidence_score() for t in self._sentence_translations] if scs is not None

Was taking the exponential incorrect here?

mshannon-sil

@mshannon-sil made 7 comments.
Reviewable status: 0 of 8 files reviewed, 7 unresolved discussions (waiting on @benjaminking and @ddaspit).

silnlp/common/sample_usability.py line 128 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Maybe rename this "random-seed"? for consistency with other scripts?

Done.

silnlp/common/translator.py line 148 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Was taking the exponential incorrect here?

No I just changed how get_sequence_confidence_score() works so that it returns the exponentiated version, so I removed the exp() call here to compensate. It seemed more apt since what we've been calling confidence scores is the exponentiated version, and it also makes the code cleaner since we don't have to remember to call exp() every time we get the score.

silnlp/nmt/experiment.py line 162 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

I wonder if we ought to be doing quality estimation in each of the translate functions instead (probably calling a function to do that)? We could just pass along the quality_estimation parameter and then don't have to return the list of confidence files.

Done.

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

It would probably require some refactoring (maybe a class wrapping confidence_files?), but I'd like if there was a way to keep the logic about what the various confidence files are called all in one place, so we wouldn't have to update in multiple places if it changes.

Done.

silnlp/nmt/quality_estimation.py line 127 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

If I'm correct, input_file_path is pointing to a confidences file? Changing the name to reflect that might be helpful for understanding what's going on (since there are so many different files involved here)

Done.

silnlp/nmt/test.py line 435 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

What was the impetus in getting rid of SentenceBleu?

I used the sacrebleu version in write_pair_verse_scores, because the comment explaining the rational for including our own custom implementation says: Substitute for the sacrebleu version of sentence_bleu, which uses settings that aren't consistent with the values we use for corpus_bleu, and isn't fully parameterized. However, I believe I was able to pass in all the values we use for corpus_bleu into the sacrebleu sentence_bleu or take advantage of default values that already match, and so it seems fully parameterized to me. If there's no difference, it's better to just use sacrebleu's version.

Let me double check though, @ddaspit , do you still see a need for our own version of sentence_bleu? If so, I can remove it in the other places it's used since I haven't fully removed it yet.

silnlp/nmt/translate.py line 129 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

I wonder if we should just always be saving confidence files? Do you see any downsides of that? Now that we're having to pass confidence files around in the code, it would help to simplify things.

Based on our discussion with other members, it sounds like we're in favor of always saving confidence, especially to avoid accidently forgetting to save the flag. It would clutter the infer directory up a bit, but the storage memory impact would be minimal.

However, there's still an issue with confidence related to multiple translations, and I'd like to get this PR merged first to get the current features into master as soon as possible. So I think it's best to hold off on removing the flag until the next PR that address the multiple translations issue.

mshannon-sil · 2026-01-21T16:10:47Z

silnlp/nmt/test.py line 435 at r1 (raw file):

Previously, mshannon-sil wrote…

I used the sacrebleu version in write_pair_verse_scores, because the comment explaining the rational for including our own custom implementation says: Substitute for the sacrebleu version of sentence_bleu, which uses settings that aren't consistent with the values we use for corpus_bleu, and isn't fully parameterized. However, I believe I was able to pass in all the values we use for corpus_bleu into the sacrebleu sentence_bleu or take advantage of default values that already match, and so it seems fully parameterized to me. If there's no difference, it's better to just use sacrebleu's version.

Let me double check though, @ddaspit , do you still see a need for our own version of sentence_bleu? If so, I can remove it in the other places it's used since I haven't fully removed it yet.

From Damien: "I believe that prior to sacrebleu 2.0 there were a couple of options that we could not set for sentence_bleu (specifically lowercase and tokenizer). That has changed now, so we can probably just use sentence_bleu as-is."

So I'll go ahead and change test.py to always use the sacrebleu version and remove the custom implementation.

mshannon-sil · 2026-01-21T16:53:12Z

silnlp/nmt/test.py line 435 at r1 (raw file):

Previously, mshannon-sil wrote…

From Damien: "I believe that prior to sacrebleu 2.0 there were a couple of options that we could not set for sentence_bleu (specifically lowercase and tokenizer). That has changed now, so we can probably just use sentence_bleu as-is."

So I'll go ahead and change test.py to always use the sacrebleu version and remove the custom implementation.

Just removed the custom implementation. Also, if your question had to do with removing sentencebleu from the list of scorers, that's because verse level scores are now always being calculated, so sentencebleu will be run for the verse scores whenever bleu is requested as a scorer.

benjaminking

Good work. Just a couple small things before we should be able to get this in.

@benjaminking reviewed 8 files and all commit messages, made 7 comments, and resolved 6 discussions.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit and @mshannon-sil).

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

Previously, mshannon-sil wrote…

Done.

Could you have confidence_file here be an instance of ConfidenceFile so that you can call get_chapter_path() instead of specifying the suffix? It might also be nice if ConfidenceFile had some sort of iterator method that could iterate over books/chapters/verses and their confidence scores, so that other methods like this wouldn't need to know the format of the file.

silnlp/nmt/experiment.py line 223 at r2 (raw file):

    )
    parser.add_argument(
        "--test-data-file",

Maybe we could call this --test-data-scores-file? I was confused about what exactly we were passing around until I got to this description below.

silnlp/common/translator.py line 177 at r2 (raw file):

class ConfidenceFile:

    def __init__(self, path: Path, trg_file_path: Optional[Path] = None):

It looks like you only ever use the from_trg_path method to initiate this class. Is there a way to get rid of the trg_file_path parameter in the constructor, since it should be predictable from path?

silnlp/nmt/translate.py line 66 at r2 (raw file):

        postprocess_handler: PostprocessHandler = PostprocessHandler(),
        tags: Optional[List[str]] = None,
    ) -> List[Path]:

It doesn't look like this method returns anything.

silnlp/nmt/translate.py line 157 at r2 (raw file):

        test_data_path: Optional[Path] = None,
        tags: Optional[List[str]] = None,
    ) -> List[Path]:

Same issue with this function.

silnlp/nmt/translate.py line 219 at r2 (raw file):

        postprocess_handler: PostprocessHandler = PostprocessHandler(),
        tags: Optional[List[str]] = None,
    ) -> List[Path]:

Same issue here as well.

mshannon-sil · 2026-01-21T23:57:40Z

silnlp/nmt/experiment.py line 223 at r2 (raw file):

Previously, benjaminking (Ben King) wrote…

Maybe we could call this --test-data-scores-file? I was confused about what exactly we were passing around until I got to this description below.

What about --verse-test-scores-file? That would distinguish it from the aggregate/book-level scores-(ckpt).csv file that test.py already outputs.

benjaminking

@benjaminking made 1 comment and resolved 1 discussion.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @mshannon-sil).

silnlp/nmt/experiment.py line 223 at r2 (raw file):

Previously, mshannon-sil wrote…

What about --verse-test-scores-file? That would distinguish it from the aggregate/book-level scores-(ckpt).csv file that test.py already outputs.

Yes, that sounds good to me.

mshannon-sil

@mshannon-sil made 5 comments.
Reviewable status: 3 of 8 files reviewed, 5 unresolved discussions (waiting on @benjaminking and @ddaspit).

silnlp/common/translator.py line 177 at r2 (raw file):

Previously, benjaminking (Ben King) wrote…

It looks like you only ever use the from_trg_path method to initiate this class. Is there a way to get rid of the trg_file_path parameter in the constructor, since it should be predictable from path?

I use the init method to initiate the class in quality_estimation.py at the end of the main() method. But yeah I think I did overengineer this, so I simplified it to just using the init method and deriving the trg_draft_file_path (renamed to be more accurate & consistent w/the repo) from path.

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Could you have confidence_file here be an instance of ConfidenceFile so that you can call get_chapter_path() instead of specifying the suffix? It might also be nice if ConfidenceFile had some sort of iterator method that could iterate over books/chapters/verses and their confidence scores, so that other methods like this wouldn't need to know the format of the file.

Yeah sorry looks like I didn't integrate it in quality_estimation.py as fully as I thought. I fixed it to use get_chapter_path() and get_book_path(). I think I'll postpone using an interator method for another time as I'm not 100% clear on its purpose and I'd like to get these core changes into master soon.

silnlp/nmt/translate.py line 66 at r2 (raw file):

Previously, benjaminking (Ben King) wrote…

It doesn't look like this method returns anything.

Done. This was leftover from when quality estimation was happening outside the method. I also caught a bug when I was looking at this. In the call to glob() at the end that looks for confidence files, I changed it to trg_file_path.stem rather than trg_file_path.name so that it correctly picks up drafts that get their suffix changed in postprocessing/multiple translations.

silnlp/nmt/translate.py line 157 at r2 (raw file):

Previously, benjaminking (Ben King) wrote…

Same issue with this function.

Done.

silnlp/nmt/translate.py line 219 at r2 (raw file):

Previously, benjaminking (Ben King) wrote…

Same issue here as well.

Done.

benjaminking

Once you get it merged, I can create an issue to track the bit that we postponed.

@benjaminking reviewed 5 files and all commit messages, made 2 comments, and resolved 5 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit).

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

Previously, mshannon-sil wrote…

Yeah sorry looks like I didn't integrate it in quality_estimation.py as fully as I thought. I fixed it to use get_chapter_path() and get_book_path(). I think I'll postpone using an interator method for another time as I'm not 100% clear on its purpose and I'd like to get these core changes into master soon.

That's ok. Maybe we can create an issue to track this once it's merged. The idea is that we'd like to have all the logic related to reading the confidence files (in whatever format they're in) in one place, ideally encapsulated in a class. That way, if the format ever needs to change, we don't need to track down all the places that read the confidence files and risk missing one of them.

The iterator would be a way to get access to the contents of the confidence files without needing to know the format. You could write something like for chapter, confidence in confidence_file.chapter_score_iterator, and project those scores, with the ConfidenceFile class taking care of all the file IO details.

A different way to approach the problem that would still preserve encapsulation would be to have the ConfidenceFile class create the projected chrF3 scores, e.g. confidence_file.project_chrf3_scores(slope, intercept).

mshannon-sil · 2026-01-22T19:59:54Z

silnlp/nmt/quality_estimation.py line 76 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

That's ok. Maybe we can create an issue to track this once it's merged. The idea is that we'd like to have all the logic related to reading the confidence files (in whatever format they're in) in one place, ideally encapsulated in a class. That way, if the format ever needs to change, we don't need to track down all the places that read the confidence files and risk missing one of them.

The iterator would be a way to get access to the contents of the confidence files without needing to know the format. You could write something like for chapter, confidence in confidence_file.chapter_score_iterator, and project those scores, with the ConfidenceFile class taking care of all the file IO details.

A different way to approach the problem that would still preserve encapsulation would be to have the ConfidenceFile class create the projected chrF3 scores, e.g. confidence_file.project_chrf3_scores(slope, intercept).

Oh I see what you mean now! Yes let's track that in an issue and make that part of the next PR.

…approach across translate.py and experiment.py; fix handling and minimize number of confidence files when post_processing is applied

mshannon-sil added 17 commits November 30, 2025 20:55

add sample_usability.py; fix import for quality_estimation.py

506dd7d

clarify help text for usability_verses_file

11a1c15

add project_usability to experiment.py and translate.py

ccb97c4

add labels

f277cd6

clarify diff_predictions argument

e43a22f

further clarity on diff predictions argument

586c659

canonical order for usability sample

9623d76

fix windows incompatibility issue

e93e7ee

improve handling of book argument

078ac59

make sequence label for confidence files w/o scriptures refs use 1-ba…

ea26fc2

…sed indexing to reflect row number in file

windows fix #2

6d6973b

windows fix #3

7c570f6

refine confidence files

71a6918

use canonical book order

1cac343

save projected_chrf3 to usability file

7d4d514

initial implementation of verse level scores in test.py

042dcb6

have quality estimation use the new test file for extracting confiden…

ef75f30

…ce and chrf3

mshannon-sil requested review from benjaminking and ddaspit December 17, 2025 19:19

mshannon-sil self-assigned this Dec 17, 2025

benjaminking requested changes Dec 19, 2025

View reviewed changes

adjust thresholds

b8a5529

mshannon-sil added 6 commits January 20, 2026 21:37

remove unnecessary imports

f49606b

use sentence versions of metrics for verse scores

5c855c0

remove non-sentence scorers from verse scores

3416248

call lower() when checking against scorers list

2190897

refine ValueError text

3a1ec76

capitalize confidence in header of verse scores file

3565ca6

mshannon-sil commented Jan 21, 2026

View reviewed changes

mshannon-sil requested a review from benjaminking January 21, 2026 15:49

mshannon-sil linked an issue Jan 21, 2026 that may be closed by this pull request

scores.tsv file header row does not match file contents after adding m-metrics to silnlp #910

Closed

replace custom sentence_bleu with sacrebleu version

160c7b8

benjaminking requested changes Jan 21, 2026

View reviewed changes

benjaminking reviewed Jan 22, 2026

View reviewed changes

mshannon-sil added 5 commits January 22, 2026 16:17

rename to verse_test_scores_file

cb3fe0e

integrate ConfidenceFile class more fully in quality_estimation

d0a92af

fix call to glob; adjust return types

65cc71b

fix path to initiate confidence files; use get_verses_path

7603504

use get_path() not .path

37c270c

mshannon-sil commented Jan 22, 2026

View reviewed changes

benjaminking approved these changes Jan 22, 2026

View reviewed changes

mshannon-sil added 6 commits January 23, 2026 01:54

make --verse-test-scores-file truly optional when possible and unify …

b04a0f7

…approach across translate.py and experiment.py; fix handling and minimize number of confidence files when post_processing is applied

fix quality estimation description

8517ab1

add to --verse-test-scores-file description

bcec1ae

Add to experimen.py --quality-estimation description

cd9730d

undo a description change for --quality-estimation

2d332b4

adjust --verse-test-scores-file description again

9d6dbf8

mshannon-sil merged commit 988eacd into master Jan 23, 2026
1 check was pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to saving confidence, quality estimation, and sample usability #900

Improvements to saving confidence, quality estimation, and sample usability #900

Uh oh!

mshannon-sil commented Dec 17, 2025 •

edited

Loading

Uh oh!

benjaminking left a comment

Uh oh!

mshannon-sil left a comment

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

benjaminking left a comment

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

benjaminking left a comment

Uh oh!

mshannon-sil left a comment

Uh oh!

benjaminking left a comment

Uh oh!

mshannon-sil commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Improvements to saving confidence, quality estimation, and sample usability #900

Improvements to saving confidence, quality estimation, and sample usability #900

Uh oh!

Conversation

mshannon-sil commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Jan 21, 2026

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mshannon-sil commented Dec 17, 2025 •

edited

Loading