Skip to content

Add HuggingFaceNerRecognizer for direct NER model inference #1833

@ultramancode

Description

@ultramancode

Summary

I'd like to propose adding a HuggingFaceNerRecognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.

Problem

The current approach using spaCy tokenizer with HuggingFace NER models has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):

  • Particles/postpositions attach to nouns: "김태웅이고" (Kim Taewoong + particle)
  • spaCy tokenizer: "김태웅이고" (includes particle)
  • NER model result: "김태웅" (name only)
  • char_span() alignment fails with alignment_mode="strict"
  • alignment_mode="expand" includes particles, causing downstream issues

Example (Korean)

Step Result
Input Text "내 이름은 김태웅이고 전화번호는 010-1234-5678이야"
spaCy Token "김태웅이고" (name + particle)
NER Result "김태웅" (name only, start=6, end=9)
char_span(strict) SKIP (boundary mismatch)
char_span(expand) "김태웅이고" (includes particle → wrong!)

Proposed Solution

Create HuggingFaceNerRecognizer that:

  1. Uses HuggingFace Transformers pipeline directly (bypasses spaCy tokenizer)
  2. Returns NER results without char_span alignment
  3. Supports any language with a HuggingFace NER model

Key Features

  • Language-agnostic: Works with any HuggingFace NER model (English, Korean, Japanese, etc.)
  • Direct inference: No spaCy tokenizer dependency for entity boundaries

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions