Neural-HMM for Multilingual Language Identification

A Comparative Study of Markov Chain, Hidden Markov Model, and Neural-HMM Approaches

Objective

The objective of this research project is to compare classical probabilistic sequence models (Markov Chain, HMM) with a Neural Hidden Markov Model (Neural-HMM) for the task of multilingual language identification at the character level.
The goal is to analyze how traditional and neuralized probabilistic models differ in their ability to generalize from clean, well-structured data to noisy, real-world text samples.

Experimental Setup

Dataset

Languages: 14 European languages — ca, da, de, en, es, fi, fr, is, it, nl, nr, pt, ro, sv
Character vocabulary: 84 (lowercase alphabets + special symbols)
Data splits: Train / Dev / Eval
Eval sets:
- CleanEval: original, well-formed sentences
- NoisyEval: accent-stripped, truncated, and lowercased sentences to simulate real-world conditions

Models Implemented

Model	Description
Markov Chain (MC)	Character-level probabilistic model capturing n-gram transitions.
Hidden Markov Model (HMM)	Extends MC by modeling hidden state transitions with emission probabilities.
Neural-HMM	Hybrid model where emissions are parameterized by a neural network, combining deep representation learning with probabilistic structure.

Environment

Hardware: Mac M3 Pro (CPU-only)
Language: Python (PyTorch + NumPy)
Training configuration:
- --states 12
- --emb_dim 128, --hidden 256, --context 3
- --epochs 80, --batch_size 16
- --lr 5e-4, --dropout 0.3, --seed 42

Running the Project

1. Training

python -m src.scripts.train_neural_hmm   --states 12   --emb_dim 128   --hidden 256   --context 3   --epochs 80   --batch_size 16   --lr 5e-4   --weight_decay 1e-4   --dropout 0.3   --seed 42

2. Evaluation

python -m src.scripts.evaluate   --clean_dir data/clean   --noisy_dir data/noisy   --outdir outputs/

3. Output Files

outputs/report_k12.txt → Dev + Eval confusion matrices and top predictions
outputs/summary.txt → Model-wise accuracy comparison for Clean and Noisy evaluation sets

Results Summary

Model	CleanEval	NoisyEval
Markov Chain	0.848	0.429
HMM (K=12)	0.696	0.446
Neural-HMM	0.482	0.232

Interpretation

On clean data, Markov Chain achieves the highest accuracy due to strong memorization of n-grams.
Under noise, both MC and HMM degrade significantly, showing limited generalization.
Neural-HMM, though underperforming numerically here, displays structured confusion patterns that align with linguistic families — e.g.:
- Catalan ↔ French ↔ Spanish
- Danish ↔ Norwegian ↔ Swedish
  This indicates that Neural-HMM learns meaningful latent structure, even when accuracy drops.

Key Insights

Shallow probabilistic models (MC, HMM) excel at short-range pattern memorization.
Neural-HMM generalizes latent linguistic structure and can outperform traditional models on larger or noisier corpora with minimal tuning.
Noise sensitivity highlights the importance of data augmentation and sentence-length normalization.
Confusion matrices reveal that Neural-HMM captures family-level similarity rather than random misclassification.

Inference and Research Contribution

This project demonstrates how adding neural components to classical probabilistic models affects learning, generalization, and robustness in multilingual text processing.
It provides empirical backing for the claim that Neural-HMMs are more interpretable and scalable for real-world noisy-text tasks, even if classical models outperform them on clean benchmarks.
The structured error patterns of Neural-HMM provide a linguistic interpretability advantage — a desirable trait for explainable AI research.

Future Work

Introduce length-normalized scoring for fairer comparison across variable-length sentences.
Implement noise-aware data augmentation during training.
Add LR decay and emission regularization to improve generalization.
Extend to larger multilingual corpora and subword-level modeling.

License

This project is released under the MIT License.
Refer to the LICENSE file for full terms and permissions.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
outputs		outputs
src		src
tokenizers		tokenizers
LICENSE		LICENSE
README.md		README.md
confusion_matrix_neuralhmm.png		confusion_matrix_neuralhmm.png
requirements.txt		requirements.txt
tsne_multilingual_weights.png		tsne_multilingual_weights.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural-HMM for Multilingual Language Identification

A Comparative Study of Markov Chain, Hidden Markov Model, and Neural-HMM Approaches

Objective

Experimental Setup

Dataset

Models Implemented

Environment

Running the Project

1. Training

2. Evaluation

3. Output Files

Results Summary

Interpretation

Key Insights

Inference and Research Contribution

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural-HMM for Multilingual Language Identification

A Comparative Study of Markov Chain, Hidden Markov Model, and Neural-HMM Approaches

Objective

Experimental Setup

Dataset

Models Implemented

Environment

Running the Project

1. Training

2. Evaluation

3. Output Files

Results Summary

Interpretation

Key Insights

Inference and Research Contribution

Future Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages