jntajis-python is a Python library for transliterating and encoding/decoding characters across three Japanese character set standards: JIS X 0208, JIS X 0213, and Unicode. It also supports transliteration via the MJ (Moji Joho) character table and shrink conversion maps.
jntajis-python/
setup.py # setuptools + Cython extension build
setup.cfg # Package metadata, dependencies, dev extras
Makefile # Data pipeline: download -> parse -> codegen
src/jntajis/
__init__.py # Public Python API surface (enums + re-exports)
_jntajis.pyx # Cython implementation (core logic)
_jntajis.h # Generated C header (lookup tables)
_jntajis.pyi # Type stubs for the Cython extension
_jntajis.c # Cython-generated C source (not committed normally)
gen.py # Code generator: Excel/JSON -> _jntajis.h
py.typed # PEP 561 marker
tests/
test_encoder.py # Tests for encoding/decoding and IncrementalEncoder
test_mj_translit.py # Tests for MJ shrink candidate transliteration
xlsx_parser/
__init__.py # Re-exports read_xlsx
parser.py # Streaming OpenXML XLSX reader
xmlutils.py # SAX-style XML parser framework (expat-based)
docs/
source/
api.rst # Sphinx API documentation
conf.py # Sphinx configuration
_static/images/ # SVG diagrams
.github/workflows/
main.yml # CI entry point (PR + push + tag triggers)
tests.yml # Lint (black, flake8, mypy) + test job
wheels.yml # cibuildwheel multi-platform wheel builds
The system has three distinct phases: data pipeline (build-time), native extension (compile-time), and runtime API (user-facing).
External data sources are downloaded and processed into a single generated C header file:
[JNTA Excel] ---+
[MJ Excel] ---+--> gen.py (Jinja2 template) --> _jntajis.h (C lookup tables)
[MJ Shrink JSON]+
- JNTA Excel (
jissyukutaimap1_0_0.xlsx): NTA shrink conversion map. Downloaded from NTA. - MJ Excel (
mji.00601.xlsx): MJ character table. Downloaded from CITPC/IPA. - MJ Shrink JSON (
MJShrinkMap.1.2.0.json): MJ shrink conversion map. Downloaded from CITPC/IPA.
gen.py uses a custom xlsx_parser to read the Excel files, processes the data into optimized lookup structures, and renders _jntajis.h via a Jinja2 template. The generated header contains:
tx_mappings[]: 29494 entries, one per JIS X 0213 codepoint (men-ku-ten)urange_to_jis_mappings[]: Sorted ranges for Unicode-to-JIS binary searchsm_uni_to_jis_mapping(): State machine for multi-codepoint Unicode-to-JIS mappingurange_to_mj_mappings[]: Sorted ranges for Unicode-to-MJ-mapping-set binary searchmj_shrink_mappings[]: MJ shrink mapping unicode sets indexed by MJ code
_jntajis.pyx is a Cython file compiled into a C extension module. It:
- Includes
_jntajis.hviacdef externto access the generated lookup tables - Uses CPython internal APIs (
_PyUnicodeWriter,_PyBytesWriter,PyUnicode_READ, etc.) directly for high-performance string construction - Compiles with safety checks disabled (
boundscheck=False,wraparound=False,cdivision=True)
The build process is: _jntajis.pyx + _jntajis.h --> Cython --> _jntajis.c --> C compiler --> _jntajis.so.
The public API is exposed via __init__.py which re-exports from the Cython extension:
| Symbol | Type | Description |
|---|---|---|
jnta_encode() |
function | Unicode -> JIS byte sequence |
jnta_decode() |
function | JIS byte sequence -> Unicode |
jnta_shrink_translit() |
function | JNTA shrink transliteration (Unicode -> Unicode) |
mj_shrink_candidates() |
function | MJ-based shrink transliteration candidates |
IncrementalEncoder |
class | Stateful encoder (codec-compatible) |
TransliterationError |
exception | Raised on transliteration failure |
ConversionMode |
enum | Encoding mode selection |
MJShrinkScheme |
enum | Individual MJ shrink scheme identifiers |
MJShrinkSchemeCombo |
flag enum | Combinable MJ shrink scheme selectors |
JIS codepoints are packed into a uint16_t as: (men - 1) * 94 * 94 + (ku - 1) * 94 + (ten - 1), where men is 1 or 2 (JIS X 0213 plane), ku is 1-94 (row), ten is 1-94 (column).
Each JIS X 0213 position has an entry:
jis: packed men-ku-ten codeus[2]: primary Unicode codepoint(s)sus[2]: secondary (similar glyph) Unicode codepoint(s)class_: JIS character class (level 1-4, non-kanji, reserved)tx_jis[4]/tx_us[4]: transliterated form (JIS and Unicode)
Uses sorted range tables (URangeToJISMapping) with binary search. Multi-codepoint sequences (e.g. base + combining mark) use a state machine (sm_uni_to_jis_mapping()).
MJMapping: Maps an MJ code to Unicode codepoints + IVS (Ideographic Variation Sequence) pairsMJMappingSet: A set of MJ mappings for a single Unicode codepointURangeToMJMappings: Sorted range table for Unicode-to-MJ binary searchMJShrinkMappingUnicodeSet: Per-MJ-code shrink targets, one array per scheme (4 schemes)
User code
|
v
__init__.py (Python enums + re-exports)
|
v
_jntajis.pyx (Cython: encoding, decoding, transliteration logic)
|
v
_jntajis.h (Generated C: static lookup tables + state machine)
A lightweight, streaming, read-only XLSX parser. It avoids heavyweight dependencies like openpyxl by:
- Opening XLSX as a zip file (
zipfile.ZipFile) - Parsing
xl/sharedStrings.xmlfor the shared string table - Parsing
xl/worksheets/sheetN.xmlincrementally via SAX-style handlers
The XML parsing framework in xmlutils.py provides:
- A hierarchical
Handlers/HandlersBaseabstract pattern where each nesting level of XML is handled by a different handler class HandlerShimwraps handlers to dynamically switch the active handler as XML nesting changesread_xml_incremental()enables pull-style iteration over worksheet rows
- Trigger (
main.yml): On PR open, push to main, or version tag push (v*) - Lint & Test (
tests.yml): black + flake8 + mypy on Python 3.11 - Wheels (
wheels.yml): cibuildwheel across Ubuntu, Windows, macOS (11/12/13), excluding PyPy. Only runs on tag push.
Sphinx with sphinx_rtd_theme, hosted on Read the Docs. API docs are manually authored in api.rst (not autodoc).