Skip to content

lingo-iitgn/awesome-code-mixing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

144 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Code-Mixing & Code-Switching

Awesome PRs Welcome

A curated list of papers, datasets, and toolkits for Code-Switching & Code-Mixing in Natural Language Processing in the Era of Large Language Models.

Table of Contents

Click on any link to jump to the corresponding section on this page.


Taxonomy of Code-Switched Language Analytics and representative works for each direction

Taxonomy of Code-Switched Language Analytics

Survey Papers

Comprehensive reviews of the code-switching research landscape. A great place to start.


1. NLP Tasks

1.1. Traditional Tasks

Core competencies for understanding structure, syntax, and linguistic boundaries.

Language Identification (LID)

Part-of-Speech (POS) Tagging

Named Entity Recognition (NER)

Sentiment & Emotion Analysis

Syntactic Analysis

Machine Translation (MT)


1.2. Emerging and Contemporary Tasks

Tasks focused on generating fluent and coherent code-mixed text.

Natural Language Inference (NLI)

Intent Classification

Question Answering (QA)

Code-Mixed Text Generation

Cross-lingual Transfer

Text Summarization

Dialogue Generation

Transliteration


1.3. Underexplored and Frontier Tasks

Unexplored research directions where Code-Switching tasks intersects with reasoning, safety, creativity, and multimodal interaction.

Reasoning & Abstraction

Creative & Code Generation

Conversational & Dialogue systems

Safety & Multimodal

2. Datasets & Resources

Corpora, toolkits, and frameworks to support your research.

Datasets

Name Description Lang Pair Type/Task Link
AfroCS-xs High-quality human-validated synthetic data. 4 African-En Machine Translation πŸ”—
ASCEND 10.6h spontaneous conversational speech. Mandarin-En ASR/Dialogue πŸ”—
BanglishRev 23K Bangla-English reviews for sentiment. Bengali-En Sentiment πŸ”—
CM-DailyDialog Synthetic code-mixed version of DailyDialog (Hinglish dialogs). Hindi-En Dialogue Generation πŸ”—
CSPref Human preference dataset for evaluating fluency of LLM-generated code-switched text. Hindi-En, Tamil-En, Malayalam-En Preference/Evaluation (LLM-generated CS) πŸ”—
DravidianCodeMix ~71K code-mixed YouTube comments from Dravidian languages. Tamil/Kannada/Malayalam-En Sentiment & Offensive Detection πŸ”—
GupShup 6.8K+ Hindi-English code-switched conversations with summaries. Hindi-En Abstractive Summarization πŸ”—
HiACC Hinglish adult & children code-switched corpus. Hindi-En Speech/Text πŸ”—
MMS-5 Multi-scenario multimodal hate speech. Tamil/Kan-En MM Hate Speech πŸ”—
MultiCoNER Large-scale benchmark for complex NER. 11 Langs NER πŸ”—
My Boli Corpora & Pre-trained Models for Marathi-English. Marathi-En NLU πŸ”—
RideKE Over 29K code-switched tweets from Kenyan ride-hailing domain. English-Swahili-Sheng Sentiment & Emotion πŸ”—
SCC (Saudilang Code-switch Corpus) LLM-generated (GPT-4) code-switched speech dataset with Arabic dialects. Arabic dialects-En/MSA ASR (Code-Switched) πŸ”—
SwitchLingua Massive multi-ethnic code-switching dataset. 83 Langs General NLU πŸ”—
ToxVidLM Framework & dataset for toxicity in code-mixed videos. Mixed Video Toxicity πŸ”—

Frameworks & Toolkits


3. Model Training & Adaptation

Techniques for building and adapting models to understand and generate code-mixed language.

Pre-training Approaches

Fine-tuning Approaches

Post-training Approaches


4. Evaluation & Benchmarking

Resources for evaluating model performance on code-switching tasks.

πŸ“Š Benchmark Comparison

A comparison of major evaluation suites for Code-Switching, categorized by data origin and evaluation focus.

Benchmark Task Scope Data Origin Eval Focus Link
CodeMixBench Multitask (LID, POS, NER, SA, MT + Knowledge/Math Reasoning, Truthfulness) πŸ€– Synthetic (GPT-assisted) Multilingual Code-Mixing Capabilities (18 Langs) πŸ”—
CodeMixBench (Code Gen) Code Generation (Python) πŸ§‘β€πŸ’» Human (augmented from BigCodeBench) Syntax & Executability with Code-Mixed Prompts πŸ”—
COMI-LINGUA LID, Matrix Language ID, POS, NER, MT πŸ§‘β€πŸ’» Human (expert-annotated) Multitask NLU & MT in Hindi-English Code-Mixing πŸ”—
CroCoSum Cross-lingual Code-switched Summarization πŸ§‘β€πŸ’» Human (English-Hindi dialogues) Summarization Quality in Code-Switched Context πŸ”—
CS-Sum Dialogue Summarization πŸ§‘β€πŸ’» Human (annotated CS dialogues) Comprehension & Summarization of CS Dialogues πŸ”—
CS3-Bench Speech-to-Speech QA & Conversation πŸ§‘β€πŸ’» Human + πŸ€– Synthetic Language Alignment in Mandarin-En CS Speech πŸ”—
GLUECoS QA, NLI, Sentiment, LID, POS, NER πŸ§‘β€πŸ’» Human NLU Performance πŸ”—
LinCE LID, NER, POS, Sentiment πŸ§‘β€πŸ’» Human Linguistic Accuracy (F1) πŸ”—
Lost in the Mix Reading Comprehension, Knowledge, NLI πŸ€– Synthetic (LLM-generated CS variants) Deeper Reasoning in Code-Switched Text πŸ”—
MEGAVERSE Multimodal QA + Multitask NLU ⚑ Hybrid Factuality & Robustness (83 Langs) πŸ”—
PACMAN POS Tagging πŸ€– Synthetic (parallel generation) POS Accuracy in Code-Mixed Text (Hindi-En focus) πŸ”—
SwitchLingua Multitask NLU (83 Langs) πŸ€– Hybrid (LLM-synthesized) Scale & Diversity in Code-Switching πŸ”—
X-RiSAWOZ Multilingual Task-Oriented Dialogue (TOD) πŸ§‘β€πŸ’» Human (translated + rewritten) Cross-lingual TOD in Code-Mixed Scenarios (En-Hi, En-Es, En-Fr) πŸ”—

(Legend: πŸ§‘β€πŸ’» Human = Manually annotated/curated; πŸ€– Synthetic = Generated by Large Language Models; ⚑ Hybrid = Mixed sources or Human-filtered Synthetic data.)

Benchmarks

Evaluation Metrics


5. Multi & Cross-Modal Applications

Applying code-switching NLP to speech, vision, and other modalities.

Speech Processing

Vision-Language & Document Processing

Cross-Modal Integration


Workshops & Shared Tasks

A list of academic workshops and community shared tasks dedicated to code-switching.


Contributing

Your contributions are always welcome and make this community resource better!

If you have a paper, dataset, or tool you'd like to add:

  1. Fork the repository.
  2. Add your resource to the relevant section.
  3. Please try to follow the existing format and include a direct link.
  4. Submit a pull request!

About

A curated list of resources dedicated to Code-mixed Natural Language Processing (NLP).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors