A Comparative Study of Markov Chain, Hidden Markov Model, and Neural-HMM Approaches
The objective of this research project is to compare classical probabilistic sequence models (Markov Chain, HMM) with a Neural Hidden Markov Model (Neural-HMM) for the task of multilingual language identification at the character level.
The goal is to analyze how traditional and neuralized probabilistic models differ in their ability to generalize from clean, well-structured data to noisy, real-world text samples.
- Languages: 14 European languages —
ca,da,de,en,es,fi,fr,is,it,nl,nr,pt,ro,sv - Character vocabulary: 84 (lowercase alphabets + special symbols)
- Data splits: Train / Dev / Eval
- Eval sets:
- CleanEval: original, well-formed sentences
- NoisyEval: accent-stripped, truncated, and lowercased sentences to simulate real-world conditions
| Model | Description |
|---|---|
| Markov Chain (MC) | Character-level probabilistic model capturing n-gram transitions. |
| Hidden Markov Model (HMM) | Extends MC by modeling hidden state transitions with emission probabilities. |
| Neural-HMM | Hybrid model where emissions are parameterized by a neural network, combining deep representation learning with probabilistic structure. |
- Hardware: Mac M3 Pro (CPU-only)
- Language: Python (PyTorch + NumPy)
- Training configuration:
--states 12--emb_dim 128,--hidden 256,--context 3--epochs 80,--batch_size 16--lr 5e-4,--dropout 0.3,--seed 42
python -m src.scripts.train_neural_hmm --states 12 --emb_dim 128 --hidden 256 --context 3 --epochs 80 --batch_size 16 --lr 5e-4 --weight_decay 1e-4 --dropout 0.3 --seed 42python -m src.scripts.evaluate --clean_dir data/clean --noisy_dir data/noisy --outdir outputs/outputs/report_k12.txt→ Dev + Eval confusion matrices and top predictionsoutputs/summary.txt→ Model-wise accuracy comparison for Clean and Noisy evaluation sets
| Model | CleanEval | NoisyEval |
|---|---|---|
| Markov Chain | 0.848 | 0.429 |
| HMM (K=12) | 0.696 | 0.446 |
| Neural-HMM | 0.482 | 0.232 |
- On clean data, Markov Chain achieves the highest accuracy due to strong memorization of n-grams.
- Under noise, both MC and HMM degrade significantly, showing limited generalization.
- Neural-HMM, though underperforming numerically here, displays structured confusion patterns that align with linguistic families — e.g.:
- Catalan ↔ French ↔ Spanish
- Danish ↔ Norwegian ↔ Swedish
This indicates that Neural-HMM learns meaningful latent structure, even when accuracy drops.
- Shallow probabilistic models (MC, HMM) excel at short-range pattern memorization.
- Neural-HMM generalizes latent linguistic structure and can outperform traditional models on larger or noisier corpora with minimal tuning.
- Noise sensitivity highlights the importance of data augmentation and sentence-length normalization.
- Confusion matrices reveal that Neural-HMM captures family-level similarity rather than random misclassification.
- This project demonstrates how adding neural components to classical probabilistic models affects learning, generalization, and robustness in multilingual text processing.
- It provides empirical backing for the claim that Neural-HMMs are more interpretable and scalable for real-world noisy-text tasks, even if classical models outperform them on clean benchmarks.
- The structured error patterns of Neural-HMM provide a linguistic interpretability advantage — a desirable trait for explainable AI research.
- Introduce length-normalized scoring for fairer comparison across variable-length sentences.
- Implement noise-aware data augmentation during training.
- Add LR decay and emission regularization to improve generalization.
- Extend to larger multilingual corpora and subword-level modeling.
This project is released under the MIT License.
Refer to the LICENSE file for full terms and permissions.