Architecture

Project Overview

jntajis-python is a Python library for transliterating and encoding/decoding characters across three Japanese character set standards: JIS X 0208, JIS X 0213, and Unicode. It also supports transliteration via the MJ (Moji Joho) character table and shrink conversion maps.

Directory Layout

jntajis-python/
  setup.py                        # setuptools + Cython extension build
  setup.cfg                       # Package metadata, dependencies, dev extras
  Makefile                        # Data pipeline: download -> parse -> codegen
  src/jntajis/
    __init__.py                   # Public Python API surface (enums + re-exports)
    _jntajis.pyx                  # Cython implementation (core logic)
    _jntajis.h                    # Generated C header (lookup tables)
    _jntajis.pyi                  # Type stubs for the Cython extension
    _jntajis.c                    # Cython-generated C source (not committed normally)
    gen.py                        # Code generator: Excel/JSON -> _jntajis.h
    py.typed                      # PEP 561 marker
    tests/
      test_encoder.py             # Tests for encoding/decoding and IncrementalEncoder
      test_mj_translit.py         # Tests for MJ shrink candidate transliteration
    xlsx_parser/
      __init__.py                 # Re-exports read_xlsx
      parser.py                   # Streaming OpenXML XLSX reader
      xmlutils.py                 # SAX-style XML parser framework (expat-based)
  docs/
    source/
      api.rst                     # Sphinx API documentation
      conf.py                     # Sphinx configuration
      _static/images/             # SVG diagrams
  .github/workflows/
    main.yml                      # CI entry point (PR + push + tag triggers)
    tests.yml                     # Lint (black, flake8, mypy) + test job
    wheels.yml                    # cibuildwheel multi-platform wheel builds

High-Level Architecture

The system has three distinct phases: data pipeline (build-time), native extension (compile-time), and runtime API (user-facing).

1. Data Pipeline (build-time, `Makefile` + `gen.py`)

External data sources are downloaded and processed into a single generated C header file:

[JNTA Excel] ---+
[MJ Excel]   ---+--> gen.py (Jinja2 template) --> _jntajis.h (C lookup tables)
[MJ Shrink JSON]+

JNTA Excel (jissyukutaimap1_0_0.xlsx): NTA shrink conversion map. Downloaded from NTA.
MJ Excel (mji.00601.xlsx): MJ character table. Downloaded from CITPC/IPA.
MJ Shrink JSON (MJShrinkMap.1.2.0.json): MJ shrink conversion map. Downloaded from CITPC/IPA.

gen.py uses a custom xlsx_parser to read the Excel files, processes the data into optimized lookup structures, and renders _jntajis.h via a Jinja2 template. The generated header contains:

tx_mappings[]: 29494 entries, one per JIS X 0213 codepoint (men-ku-ten)
urange_to_jis_mappings[]: Sorted ranges for Unicode-to-JIS binary search
sm_uni_to_jis_mapping(): State machine for multi-codepoint Unicode-to-JIS mapping
urange_to_mj_mappings[]: Sorted ranges for Unicode-to-MJ-mapping-set binary search
mj_shrink_mappings[]: MJ shrink mapping unicode sets indexed by MJ code

2. Native Extension (compile-time, Cython)

_jntajis.pyx is a Cython file compiled into a C extension module. It:

Includes _jntajis.h via cdef extern to access the generated lookup tables
Uses CPython internal APIs (_PyUnicodeWriter, _PyBytesWriter, PyUnicode_READ, etc.) directly for high-performance string construction
Compiles with safety checks disabled (boundscheck=False, wraparound=False, cdivision=True)

The build process is: _jntajis.pyx + _jntajis.h --> Cython --> _jntajis.c --> C compiler --> _jntajis.so.

3. Runtime API

The public API is exposed via __init__.py which re-exports from the Cython extension:

Symbol	Type	Description
`jnta_encode()`	function	Unicode -> JIS byte sequence
`jnta_decode()`	function	JIS byte sequence -> Unicode
`jnta_shrink_translit()`	function	JNTA shrink transliteration (Unicode -> Unicode)
`mj_shrink_candidates()`	function	MJ-based shrink transliteration candidates
`IncrementalEncoder`	class	Stateful encoder (codec-compatible)
`TransliterationError`	exception	Raised on transliteration failure
`ConversionMode`	enum	Encoding mode selection
`MJShrinkScheme`	enum	Individual MJ shrink scheme identifiers
`MJShrinkSchemeCombo`	flag enum	Combinable MJ shrink scheme selectors

Key Data Structures

JIS Code Representation

JIS codepoints are packed into a uint16_t as: (men - 1) * 94 * 94 + (ku - 1) * 94 + (ten - 1), where men is 1 or 2 (JIS X 0213 plane), ku is 1-94 (row), ten is 1-94 (column).

ShrinkingTransliterationMapping

Each JIS X 0213 position has an entry:

jis: packed men-ku-ten code
us[2]: primary Unicode codepoint(s)
sus[2]: secondary (similar glyph) Unicode codepoint(s)
class_: JIS character class (level 1-4, non-kanji, reserved)
tx_jis[4]/tx_us[4]: transliterated form (JIS and Unicode)

Unicode-to-JIS Reverse Lookup

Uses sorted range tables (URangeToJISMapping) with binary search. Multi-codepoint sequences (e.g. base + combining mark) use a state machine (sm_uni_to_jis_mapping()).

MJ Mapping Structures

MJMapping: Maps an MJ code to Unicode codepoints + IVS (Ideographic Variation Sequence) pairs
MJMappingSet: A set of MJ mappings for a single Unicode codepoint
URangeToMJMappings: Sorted range table for Unicode-to-MJ binary search
MJShrinkMappingUnicodeSet: Per-MJ-code shrink targets, one array per scheme (4 schemes)

Component Interactions

User code
  |
  v
__init__.py  (Python enums + re-exports)
  |
  v
_jntajis.pyx  (Cython: encoding, decoding, transliteration logic)
  |
  v
_jntajis.h  (Generated C: static lookup tables + state machine)

xlsx_parser Sub-package

A lightweight, streaming, read-only XLSX parser. It avoids heavyweight dependencies like openpyxl by:

Opening XLSX as a zip file (zipfile.ZipFile)
Parsing xl/sharedStrings.xml for the shared string table
Parsing xl/worksheets/sheetN.xml incrementally via SAX-style handlers

The XML parsing framework in xmlutils.py provides:

A hierarchical Handlers/HandlersBase abstract pattern where each nesting level of XML is handled by a different handler class
HandlerShim wraps handlers to dynamically switch the active handler as XML nesting changes
read_xml_incremental() enables pull-style iteration over worksheet rows

CI/CD

Trigger (main.yml): On PR open, push to main, or version tag push (v*)
Lint & Test (tests.yml): black + flake8 + mypy on Python 3.11
Wheels (wheels.yml): cibuildwheel across Ubuntu, Windows, macOS (11/12/13), excluding PyPy. Only runs on tag push.

Documentation

Sphinx with sphinx_rtd_theme, hosted on Read the Docs. API docs are manually authored in api.rst (not autodoc).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Project Overview

Directory Layout

High-Level Architecture

1. Data Pipeline (build-time, `Makefile` + `gen.py`)

2. Native Extension (compile-time, Cython)

3. Runtime API

Key Data Structures

JIS Code Representation

ShrinkingTransliterationMapping

Unicode-to-JIS Reverse Lookup

MJ Mapping Structures

Component Interactions

xlsx_parser Sub-package

CI/CD

Documentation

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Project Overview

Directory Layout

High-Level Architecture

1. Data Pipeline (build-time, Makefile + gen.py)

2. Native Extension (compile-time, Cython)

3. Runtime API

Key Data Structures

JIS Code Representation

ShrinkingTransliterationMapping

Unicode-to-JIS Reverse Lookup

MJ Mapping Structures

Component Interactions

xlsx_parser Sub-package

CI/CD

Documentation

1. Data Pipeline (build-time, `Makefile` + `gen.py`)