-
Notifications
You must be signed in to change notification settings - Fork 949
Closed
Description
Summary
I'd like to propose adding a HuggingFaceNerRecognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.
Problem
The current approach using spaCy tokenizer with HuggingFace NER models has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):
- Particles/postpositions attach to nouns: "김태웅이고" (Kim Taewoong + particle)
- spaCy tokenizer:
"김태웅이고"(includes particle) - NER model result:
"김태웅"(name only) char_span()alignment fails withalignment_mode="strict"alignment_mode="expand"includes particles, causing downstream issues
Example (Korean)
| Step | Result |
|---|---|
| Input Text | "내 이름은 김태웅이고 전화번호는 010-1234-5678이야" |
| spaCy Token | "김태웅이고" (name + particle) |
| NER Result | "김태웅" (name only, start=6, end=9) |
| char_span(strict) | SKIP (boundary mismatch) |
| char_span(expand) | "김태웅이고" (includes particle → wrong!) |
Proposed Solution
Create HuggingFaceNerRecognizer that:
- Uses HuggingFace Transformers pipeline directly (bypasses spaCy tokenizer)
- Returns NER results without char_span alignment
- Supports any language with a HuggingFace NER model
Key Features
- Language-agnostic: Works with any HuggingFace NER model (English, Korean, Japanese, etc.)
- Direct inference: No spaCy tokenizer dependency for entity boundaries
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels