Skip to content

fix: handle punkt_tab fallback in split_sentences for NLTK >= 3.9#1023

Open
danishashko wants to merge 2 commits intocodelucas:masterfrom
danishashko:fix/punkt-tab-fallback
Open

fix: handle punkt_tab fallback in split_sentences for NLTK >= 3.9#1023
danishashko wants to merge 2 commits intocodelucas:masterfrom
danishashko:fix/punkt-tab-fallback

Conversation

@danishashko
Copy link
Copy Markdown

Fixes #1017

NLTK 3.9 moved the default punkt tokenizer from tokenizers/punkt/english.pickle to tokenizers/punkt_tab/english. On a fresh install with NLTK >= 3.9 only punkt_tab is downloaded, so calling split_sentences() raises a LookupError even when NLTK is properly installed.

What this changes:

  1. newspaper/nlp.py - split_sentences() now tries the old path first and falls back to punkt_tab if not found. Works with both old and new NLTK versions.
  2. download_corpora.py - adds punkt_tab to REQUIRED_CORPORA so it gets downloaded alongside punkt.

PR #1006 already adds punkt_tab to the download list, but the runtime crash in nlp.py still happens if only punkt_tab is available. This fixes both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WordTokenizer conflict with nltk >= 3.8.2

1 participant