fix: handle punkt_tab fallback in split_sentences for NLTK >= 3.9 by danishashko · Pull Request #1023 · codelucas/newspaper

danishashko · 2026-04-01T14:22:42Z

NLTK 3.9 moved the default punkt tokenizer from tokenizers/punkt/english.pickle to tokenizers/punkt_tab/english. On a fresh install with NLTK >= 3.9 only punkt_tab is downloaded, so calling split_sentences() raises a LookupError even when NLTK is properly installed.

What this changes:

newspaper/nlp.py - split_sentences() now tries the old path first and falls back to punkt_tab if not found. Works with both old and new NLTK versions.
download_corpora.py - adds punkt_tab to REQUIRED_CORPORA so it gets downloaded alongside punkt.

PR #1006 already adds punkt_tab to the download list, but the runtime crash in nlp.py still happens if only punkt_tab is available. This fixes both.

danishashko added 2 commits April 1, 2026 17:22

fix: try punkt fallback to punkt_tab in split_sentences (NLTK >= 3.9)

f50ceaf

fix: add punkt_tab to REQUIRED_CORPORA for NLTK >= 3.9

958ec7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle punkt_tab fallback in split_sentences for NLTK >= 3.9#1023

fix: handle punkt_tab fallback in split_sentences for NLTK >= 3.9#1023
danishashko wants to merge 2 commits intocodelucas:masterfrom
danishashko:fix/punkt-tab-fallback

danishashko commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danishashko commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant