Awesome Code-Mixing & Code-Switching

A curated list of papers, datasets, and toolkits for Code-Switching & Code-Mixing in Natural Language Processing in the Era of Large Language Models.

Survey Papers

Comprehensive reviews of the code-switching research landscape. A great place to start.

A Survey of Current Datasets for Code-Switching Research - Jose, N., et al. (2020).
A Survey of Code-switched Speech and Language Processing - Sitaram, S., et al. (2020).
A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies - Doğruöz, A. S., et al. (2021).
The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges - Winata, G. I., et al. (2023).
A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions - Hamed, I., et al. (2025).
Code-Switching in End-to-End ASR: A Systematic Literature Review - Smitesh Patil, et al. (2025).
Position Paper
- Building Educational Technologies for Code-Switching: Current Practices, Difficulties and Future Directions - Li Nguyen, et al. (2022).

1. NLP Tasks

1.1. Traditional Tasks

Core competencies for understanding structure, syntax, and linguistic boundaries.

1.2. Emerging and Contemporary Tasks

Tasks focused on generating fluent and coherent code-mixed text.

Natural Language Inference (NLI)

Detecting entailment in code-mixed Hindi-English conversations - Chakravarthy, S., et al. (2020).
A New Dataset for Natural Language Inference from Code-mixed Conversations - Khanuja, S., et al. (2020).
CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP - Qin, L., et al. (2020).
The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding - Prasad, A., et al. (2021).
On Utilizing Constituent Language Resources to Improve Downstream Tasks in Hinglish - Kumar, V., et al. (2022).
Toward the Limitation of Code-Switching in Cross-Lingual Transfer - Feng, Y., et al. (2022).
Aligning Multilingual Embeddings for Improved Code-switched Natural Language Understanding - Fazili, B., et al. (2022).
Incontext Mixing (ICM): Codemixed Prompts for Multilingual LLMs - Shankar, B., et al. (2024).
Using Contextually Aligned Online Reviews to Measure LLMs’ Performance Disparities Across Language Varieties - Tang, Z., et al. (2025).

Intent Classification

IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection - Srivastava, V. & Singh, M. (2020).
Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling - Krishnan, J., et al. (2021).
Regional language code-switching for natural language understanding and intelligent digital assistants - Rajeshwari, S. & Kallimani, J. S. (2021).
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs - Nag, A., et al. (2024).

Question Answering (QA)

Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural based Question Answering - Gupta, D., et al. (2018).
Code-Mixed Question Answering Challenge using Deep Learning Methods - Thara, S., et al. (2020).
MLQA: Evaluating Cross-lingual Extractive Question Answering - Lewis, P., et al. (2020)
The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding - Prasad, A., et al. (2021).
To Ask LLMs about English Grammaticality, Prompt Them in a Different Language - Behzad, S., et al. (2024).
COMMIT: Code-Mixing English-Centric Large Language Model for Multilingual Instruction Tuning - Lee, J., et al. (2024).
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks - Ahuja, S., et al. (2024).
Controlling Language Confusion in Multilingual LLMs - Lee, N., et al. (2025).
Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts - Goloburda, M., et al. (2025).
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs - Yoo, H., et al. (2025).

Code-Mixed Text Generation

A Deep Generative Model for Code Switched Text - Samanta, B., et al. (2019).
A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning - Gupta, D., et al. (2020).
Towards Code-Mixed Hinglish Dialogue Generation - Agarwal, V., et al. (2021).
HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text - Srivastava, V., et al. (2021).
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text - Tarunesh, I., et al. (2021).
PACMAN:PArallel CodeMixed dAta generatioN for POS tagging - Chatterjee, A., et al. (2022).
MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation - Liu, Y., et al. (2022).
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges - Shaikh, S., et al. (2022).
CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation - Mondal, S., et al. (2022).
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages - Yong, Z. X., et al. (2023).
Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation - Sravani, D., et al. (2023).
Code-Switched Text Synthesis in Unseen Language Pairs - Hsu, I.-H., et al. (2023).
Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models - Kuwanto, G., et al. (2024).
Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis - Zeng, L. (2024).
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation - Kartik, K., et al. (2024).
LLM-based Code-Switched Text Generation for Grammatical Error Correction - Potter, T., et al. (2024).
Understanding and Mitigating Language Confusion in LLMs - Marchisio, K., et al. (2024).
Pun Generation
- Bridging Laughter Across Languages: Generation of Hindi-English Code-mixed Puns - Asapu, L., et al. (2025).
- Homophonic Pun Generation in Code Mixed Hindi English - Sarrof, Y. R. (2025).

Cross-lingual Transfer

XLP at SemEval-2020 Task 9: Cross-lingual Models with Focal Loss for Sentiment Analysis of Code-Mixing Language - Ma, Y., et al. (2020).
CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP - Qin, L., et al. (2020).
Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling - Krishnan, J., et al. (2021).
Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification - Lai, S., et al. (2021).
Scopa: Soft code-switching and pairwise alignment for zero-shot cross-lingual transfer - Lee, D., et al. (2021).
Toward the Limitation of Code-Switching in Cross-Lingual Transfer - Feng, Y., et al. (2022).
ENTITYCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching - Whitehouse, C., et al. (2022).
Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching - Li, Z., et al. (2024).
Test-Time Code-Switching for Cross-lingual Aspect Sentiment Triplet Extraction - Sheng, D., et al. (2025).

Text Summarization

GupShup: Summarizing Open-Domain Code-Switched Conversations - Mehnaz, L., et al. (2021).
Multilingual Large Language Models Are Not (Yet) Code-Switchers - Zhang, R., et al. (2023).
CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics - Arora, G., et al. (2023).
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? - Hada, R., et al. (2024).
CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization - Zhang, R. & Eickhoff, C. (2024).
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs - Yoo, H., et al. (2025).
An Adapted Few-Shot Prompting Technique Using ChatGPT to Advance Low-Resource Languages Understanding - Sarrof, Y. R., et al. (2025).

Dialogue Generation

Detecting Entailment in Code-Mixed Hindi-English Conversations - Sharanya Chakravarthy, et al. (2020).
A New Dataset for Natural Language Inference from Code-mixed Conversations - Simran Khanuja, et al. (2020).
Do Multilingual Users Prefer Chat-bots that Code-mix? Let's Nudge and Find Out! - Anshul Bawa, et al. (2020).
CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP - Libo Qin, et al. (2020).
Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling - Jitin Krishnan, et al. (2021).
Towards Code-Mixed Hinglish Dialogue Generation - Vibhav Agarwal, et al. (2021).
GupShup: Summarizing Open-Domain Code-Switched Conversations - Laiba Mehnaz, et al. (2021).
Code-switched inspired losses for generic spoken dialog representations - Emile Chapuis, et al. (2021).
Towards Code-Mixed Hinglish Dialogue Generation - Vibhav Agarwal, et al. (2021).
MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation - Yongkang Liu, et al. (2022).
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents - Mehrad Moradshahi, et al. (2023).
CST5: Data Augmentation for Code-Switched Semantic Parsing - Agarwal, A., et al. (2023).
Does a code-switching dialogue system help users learn conversational fluency in Choctaw? - Jacqueline Brixey, et al. (2025).
Performance Analysis of Effective Retrieval of Kannada Translations in Code-Mixed Sentences using BERT and MPnet - H. P. Rohith, et al. (2025).

Transliteration

Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering - Gupta, D., et al. (2018).
Towards an Efficient Code-Mixed Grapheme-to-Phoneme Conversion in an Agglutinative Language: A Case Study on To-Korean Transliteration - Won Ik Cho, et al. (2020).
Detecting Entailment in Code-Mixed Hindi-English Conversations - Sharanya Chakravarthy, et al. (2020).
Graph Convolutional Networks with Multi-headed Attention for Code-Mixed Sentiment Analysis - Dowlagar, S. & Mamidi, R. (2021).
Normalization and Back-Transliteration for Code-Switched Data - Parikh, D. & Solorio, T. (2021).
Abusive content detection in transliterated Bengali-English social media corpus - Salim Sazzed (2021).
Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar - Taguchi, C., Sakai, Y. & Watanabe, T. (2021).
MUCS@MixMT: indicTrans-based Machine Translation for Hinglish Text - Asha Hegde, et al. (2022).
Text Characterization Toolkit (TCT) - Simig, D., et al. (2022).
Adapting Multilingual Models for Code-Mixed Translation - Vavre, A., Gupta, A. & Sarawagi, S. (2022).
CodeSwitching and BackTransliteration Using a Bilingual Model - Daniel Weisberg Mitelman, et al. (2024).
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs - Arijit Nag, et al. (2024).
Homophonic Pun Generation in Code Mixed Hindi English - Yash Raj Sarrof (2025).

1.3. Underexplored and Frontier Tasks

Unexplored research directions where Code-Switching tasks intersects with reasoning, safety, creativity, and multimodal interaction.

Reasoning & Abstraction

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text - Amr Mohamed et al. (2025).
SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis - Md Nishat Raihan et al. (2023).
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences - Prashant Kodali et al. (2025).
GupShup: Summarizing Open-Domain Code-Switched Conversations - Laiba Mehnaz et al. (2021).

Creative & Code Generation

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts - Zhen Yang et al. (2025).
COCOA: An Encoder-Decoder Model for Controllable Code-switched Generation - Sneha Mondal et al. (2022).
Can You Translate for Me? Code-Switched Machine Translation with Large Language Models - Jyotsana Khatri et al. (2023).

Conversational & Dialogue systems

BanglAssist: A Bengali-English Generative AI Chatbot for Code-Switching and Dialect-Handling in Customer Service - Francesco Kruk (2025).
Does a code-switching dialogue system help users learn conversational fluency in Choctaw? - Jacqueline Brixey et al. (2025).
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents - Mehrad Moradshahi et al. (2023).
Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition - P. Hemant et al. (2025).
Harmonizing Code-mixed Conversations: Personality-assisted Code-mixed Response Generation in Dialogues - Kumar, S. & Chakraborty, T. (2024).
Dialogue Language Model with Large-Scale Persona Data Engineering - Mengze Hong et al. (2025).

Safety & Multimodal

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding - Haneul Yoo et al. (2025).
Tongue-Tied: Breaking LLMs Safety Through New Language Learning - Bibek Upadhayay et al. (2025).
CM_CLIP: Unveiling Code-Mixed Multimodal Learning with Cross-Lingual CLIP Adaptations - Gitanjali Kumari et al. (2024).
ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos - Krishanu Maity et al. (2024).
Multi-task detection of harmful content in code-mixed meme captions using large language models - Bharath Kancharla et al. (2025).
BanglAssist: A Bengali-English Generative AI Chatbot for Code-Switching and Dialect-Handling in Customer Service - Francesco Kruk (2025).
Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts - Maiya Goloburda, et al. (2025).

2. Datasets & Resources

Corpora, toolkits, and frameworks to support your research.

Datasets

Name	Description	Lang Pair	Type/Task	Link
AfroCS-xs	High-quality human-validated synthetic data.	4 African-En	Machine Translation	🔗
ASCEND	10.6h spontaneous conversational speech.	Mandarin-En	ASR/Dialogue	🔗
BanglishRev	23K Bangla-English reviews for sentiment.	Bengali-En	Sentiment	🔗
CM-DailyDialog	Synthetic code-mixed version of DailyDialog (Hinglish dialogs).	Hindi-En	Dialogue Generation	🔗
CSPref	Human preference dataset for evaluating fluency of LLM-generated code-switched text.	Hindi-En, Tamil-En, Malayalam-En	Preference/Evaluation (LLM-generated CS)	🔗
DravidianCodeMix	~71K code-mixed YouTube comments from Dravidian languages.	Tamil/Kannada/Malayalam-En	Sentiment & Offensive Detection	🔗
GupShup	6.8K+ Hindi-English code-switched conversations with summaries.	Hindi-En	Abstractive Summarization	🔗
HiACC	Hinglish adult & children code-switched corpus.	Hindi-En	Speech/Text	🔗
MMS-5	Multi-scenario multimodal hate speech.	Tamil/Kan-En	MM Hate Speech	🔗
MultiCoNER	Large-scale benchmark for complex NER.	11 Langs	NER	🔗
My Boli	Corpora & Pre-trained Models for Marathi-English.	Marathi-En	NLU	🔗
RideKE	Over 29K code-switched tweets from Kenyan ride-hailing domain.	English-Swahili-Sheng	Sentiment & Emotion	🔗
SCC (Saudilang Code-switch Corpus)	LLM-generated (GPT-4) code-switched speech dataset with Arabic dialects.	Arabic dialects-En/MSA	ASR (Code-Switched)	🔗
SwitchLingua	Massive multi-ethnic code-switching dataset.	83 Langs	General NLU	🔗
ToxVidLM	Framework & dataset for toxicity in code-mixed videos.	Mixed	Video Toxicity	🔗

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data - Adithya Pratapa, et al. (2018).
Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering - Deepak Gupta, et al. (2018).
Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank - Upendra Kumar, et al. (2019).
Dependency Parsing for English–Malayalam Code-mixed Text - Sanket Sonu, et al. (2019).
A New Dataset for Natural Language Inference from Code-mixed Conversations - Simran Khanuja, et al. (2020).
Detecting Entailment in Code-Mixed Hindi-English Conversations - Sharanya Chakravarthy, et al. (2020).
GupShup: Summarizing Open-Domain Code-Switched Conversations - Laiba Mehnaz, et al. (2021).
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences - Devansh Gautam, et al. (2021).
Exploring Language Identification from Short Multilingual Code-Switched Texts - Pei-Chi Lo, et al. (2022).
A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings - Milana Karaica, et al. (2022).
Code-MixPro: A Framework for Code-Mixed Data Augmentation via Prompt Tuning - Rohit Kundu, et al. (2023).
OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification - Goswami, D., et al. (2023).
My Boli: A Comprehensive Suite of Corpora and Pre-trained Models for Marathi-English Code-Mixing - Joshi, A., et al. (2023).
Sentiment Analysis in Code-Mixed Telugu-English Text with Multi-task Learning - Siva Sai, et al. (2024).
Multilingual Harmful Meme Detection Using Large Language Models - Sanchit Ahuja, et al. (2024).
Aligning Speech to Languages to Enhance Code-switching Speech Recognition - Hexin Liu, et al. (2024).
HiACC: Hinglish adult & children code-switched corpus - Singh, S., et al. (2025).
AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages - Olaleye, K., et al. (2025).

Frameworks & Toolkits

CoSSAT: Code-Switched Speech Annotation Tool - Shah, S., et al. (2019).
A Unified Framework for Multilingual and Code-Mixed Visual Question Answering - Deepak Gupta, et al. (2020).
CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing - Jayanthi, S. M., et al. (2021).
GCM: A Toolkit for Generating Synthetic Code-mixed Text - Rizvi, M. S. Z., et al. (2021).
Commentator: A Code-mixed Multilingual Text Annotation Framework - Sheth, R., et al. (2024).
ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos - Krishanu Maity, et al. (2024).
CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback - Wenbo Zhang (2024).

3. Model Training & Adaptation

Techniques for building and adapting models to understand and generate code-mixed language.

Pre-training Approaches

Modeling Code-Switch Languages Using Bilingual Parallel Corpus - Grandee Lee, et al. (2020).
SJ AJ@DravidianLangTech-EACL2021: Task-Adaptive Pre-Training of Multilingual BERT models for Offensive Language Identification - Sai Muralidhar Jayanthi, et al. (2021).
Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching - Parul Chopra, et al. (2021).
Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data - Akshat Gupta, et al. (2021).
Task-Specific Pre-Training and Cross Lingual Transfer for Code-Switched Data - Akshat Gupta, et al. (2021).
BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT? - Santy, S., et al. (2021).
HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models - Nayak, R. & Joshi, R. (2022).
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Model for Language Identification - Raviraj Joshi, et al. (2022).
MALM: Mixing Augmented Language Modeling for Zero-Shot Machine Translation - Kshitij Gupta (2022).
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter? - Kushal Tatariya, et al. (2023).
Improving Pretraining Techniques for Code-Switched NLP - Richeek Das, et al. (2023).
Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation - Vivek Iyer, et al. (2023).
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training - Zhijun Wang, et al. (2025).
Breaking the Language Barrier: Can One Language Model Understand All Languages? - Sanchit Ahuja, et al. (2025).

Fine-tuning Approaches

From English to Code-Switching: Transfer Learning with Strong Morphological Clues - Gustavo Aguilar, et al. (2020).
FiSSA at SemEval-2020 Task 9: Fine-tuned for Feelings - Bertelt Braaksma, et al. (2020).
A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning - Deepak Gupta, et al. (2020).
A Pre-trained Transformer and CNN model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text - Suman Dowlagar, et al. (2021).
The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding - Archiki Prasad, et al. (2021).
Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification - Siyu Lai, et al. (2021).
On Utilizing Constituent Language Resources to Improve Downstream Tasks in Hinglish - Vishwajeet Kumar, et al. (2022).
Adapting Multilingual Models for Code-Mixed Translation - Aditya Vavre, et al. (2022).
PRO-CS : An Instance-Based Prompt Composition Technique for Code-Switched Tasks - Srijan Bansal, et al. (2022).
Progressive Sentiment Analysis for Code-Switched Text Data - Sudhanshu Ranjan, et al. (2022).
ENTITYCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching - Chenxi Whitehouse, et al. (2022).
COCOA: An Encoder-Decoder Model for Controllable Code-switched Generation - Sneha Mondal, et al. (2022).
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter? - Kushal Tatariya, et al. (2023).
From Translation to Generative LLMs: Classification of Code-Mixed Affective Tasks - Anjali Yadav, et al. (2024).
SetFit: A Robust Approach for Offensive Content Detection in Tamil-English Code-Mixed Conversations Using Sentence Transfer Fine-tuning - Kathiravan Pannerselvam, et al. (2024).
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation - Kartik, et al. (2024).
COMMIT: Code-Mixing English-Centric Large Language Model for Multilingual Instruction Tuning - Lee, J., et al. (2024).
Demystifying Instruction Mixing for Fine-tuning Large Language Models - Wang, R., et al. (2024).
CHAI for LLMs: Improving Code-Mixed Translation in LLMs through Reinforcement Learning with AI Feedback - Zhang, W., et al. (2025).
LLMsAgainstHate@NLU of Devanagari Script Languages 2025: Hate Speech Detection and Target Identification in Devanagari Languages via Parameter Efficient Fine-Tuning of LLMs - Rushendra Sidibomma, et al. (2025).
Controlling Language Confusion in Multilingual LLMs - Nahyun Lee, et al. (2025).
Fine-Tuning Cross-Lingual LLMs for POS Tagging in Code-Switched Contexts - Shayaan Absar (2025).
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs - Haneul Yoo, et al. (2025).
MIGRATE: Cross-Lingual Adaptation of Domain-Specific LLMs through Code-Switching and Embedding Transfer - Seongtae Hong, et al. (2025).
Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs - Yuqian Dai, et al. (2025).
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training - Zhijun Wang, et al. (2025).
Beyond Monolingual Limits: Fine-Tuning Monolingual ASR for Yoruba-English Code-Switching - Oreoluwa Babatunde, et al. (2025).
Tongue-Tied: Breaking LLMs Safety Through New Language Learning - Bibek Upadhayay, et al. (2025).
Identifying Aggression and Offensive Language in Code-Mixed Tweets: A Multi-Task Transfer Learning Approach - Bharath Kancharla, et al. (2025).
Multi-task detection of harmful content in code-mixed meme captions using large language models with zero-shot, few-shot, and fine-tuning approaches - Bharath Kancharla, et al. (2025).
Adapting Multilingual Models to Code-Mixed Tasks via Model Merging - Sanchit Ahuja, et al. (2025).

Post-training Approaches

Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification - Siyu Lai, et al. (2021).
Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling - Jitin Krishnan, et al. (2021).
PRO-CS : An Instance-Based Prompt Composition Technique for Code-Switched Tasks - Bansal, S., et al. (2022).
ENTITY CS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching - Chenxi Whitehouse, et al. (2022).
MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation - Yongkang Liu, et al. (2022).
MALM: Mixing Augmented Language Modeling for Zero-Shot Machine Translation - Kshitij Gupta (2022).
Multilingual Large Language Models Are Not (Yet) Code-Switchers - Ruochen Zhang, et al. (2023).
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter? - Kushal Tatariya, et al. (2023).
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages - Zheng-Xin Yong, et al. (2023).
OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification - Dhiman Goswami, et al. (2023).
Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis - Zeng, L. (2024).
In-context Mixing (ICM): Code-mixed Prompts for Multilingual LLMs - Shankar, B., et al. (2024).
From Translation to Generative LLMs: Classification of Code-Mixed Affective Tasks - Anjali Yadav, et al. (2024).
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing - Rajvee Sheth, et al. (2025).
DweshVaani: An LLM for Detecting Religious Hate Speech in Code-Mixed Hindi-English - Varad Srivastava (2025).
Multi-task detection of harmful content in code-mixed meme captions using large language models with zero-shot, few-shot, and fine-tuning approaches - Bharath Kancharla, et al. (2025).
An Adapted Few-Shot Prompting Technique Using ChatGPT to Advance Low-Resource Languages Understanding - Yash Raj Sarrof, et al. (2025).

4. Evaluation & Benchmarking

Resources for evaluating model performance on code-switching tasks.

📊 Benchmark Comparison

A comparison of major evaluation suites for Code-Switching, categorized by data origin and evaluation focus.

Benchmark	Task Scope	Data Origin	Eval Focus	Link
CodeMixBench	Multitask (LID, POS, NER, SA, MT + Knowledge/Math Reasoning, Truthfulness)	🤖 Synthetic (GPT-assisted)	Multilingual Code-Mixing Capabilities (18 Langs)	🔗
CodeMixBench (Code Gen)	Code Generation (Python)	🧑‍💻 Human (augmented from BigCodeBench)	Syntax & Executability with Code-Mixed Prompts	🔗
COMI-LINGUA	LID, Matrix Language ID, POS, NER, MT	🧑‍💻 Human (expert-annotated)	Multitask NLU & MT in Hindi-English Code-Mixing	🔗
CroCoSum	Cross-lingual Code-switched Summarization	🧑‍💻 Human (English-Hindi dialogues)	Summarization Quality in Code-Switched Context	🔗
CS-Sum	Dialogue Summarization	🧑‍💻 Human (annotated CS dialogues)	Comprehension & Summarization of CS Dialogues	🔗
CS3-Bench	Speech-to-Speech QA & Conversation	🧑‍💻 Human + 🤖 Synthetic	Language Alignment in Mandarin-En CS Speech	🔗
GLUECoS	QA, NLI, Sentiment, LID, POS, NER	🧑‍💻 Human	NLU Performance	🔗
LinCE	LID, NER, POS, Sentiment	🧑‍💻 Human	Linguistic Accuracy (F1)	🔗
Lost in the Mix	Reading Comprehension, Knowledge, NLI	🤖 Synthetic (LLM-generated CS variants)	Deeper Reasoning in Code-Switched Text	🔗
MEGAVERSE	Multimodal QA + Multitask NLU	⚡ Hybrid	Factuality & Robustness (83 Langs)	🔗
PACMAN	POS Tagging	🤖 Synthetic (parallel generation)	POS Accuracy in Code-Mixed Text (Hindi-En focus)	🔗
SwitchLingua	Multitask NLU (83 Langs)	🤖 Hybrid (LLM-synthesized)	Scale & Diversity in Code-Switching	🔗
X-RiSAWOZ	Multilingual Task-Oriented Dialogue (TOD)	🧑‍💻 Human (translated + rewritten)	Cross-lingual TOD in Code-Mixed Scenarios (En-Hi, En-Es, En-Fr)	🔗

(Legend: 🧑‍💻 Human = Manually annotated/curated; 🤖 Synthetic = Generated by Large Language Models; ⚡ Hybrid = Mixed sources or Human-filtered Synthetic data.)

Benchmarks

LinCE: A centralized benchmark for linguistic code-switching evaluation – Aguilar et al. (2020)
GLUECoS: An Evaluation Benchmark for Code-Switched NLP – Khanuja et al. (2020)
PACMAN: Parallel Code-Mixed Data Generation for POS Tagging – Chatterjee et al. (2022)
MultiCoNER: A Large-scale Multilingual Dataset for Complex NER – Malmasi et al. (2022)
X-RiSAWOZ: High-Quality Multilingual Dialogue Datasets – Moradshahi et al. (2023)
CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models – Krishnan et al. (2025)
CroCoSum: Cross-Lingual Code-Switched Summarization Benchmark – Zhang et al. (2024)
MEGAVERSE: Benchmarking LLMs Across Languages and Tasks – Ahuja et al. (2024)
COMI-LINGUA: Hindi–English Code-Mixed Multitask Dataset – Sheth et al. (2025)
CodeMixBench: Code Generation with Code-Mixed Prompts – Sawant (2025)
SwitchLingua: Large-Scale Multilingual Code-Switching Dataset – Xie (2025)

Evaluation Metrics

Bleu: a Method for Automatic Evaluation of Machine Translation - Papineni, K., et al. (2002).
chrF: character n-gram F-score for automatic MT evaluation - Popović, M. (2015).
Code-Mixing in Social Media Text - Amitava Das, et al. (2013).
Comparing the Level of Code-Switching in Corpora - Björn Gambäck, et al. (2016).
Automatic Detection of Code-switching Style from Acoustics - SaiKrishna Rallabandi, et al. (2018).
Detecting de minimis Code-Switching in Historical German Books - Shijia Liu, et al. (2020).
Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text - Vivek Srivastava, et al. (2021).
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing - Prashant Kodali, et al. (2022).
PreCogIIITH at HinglishEval: Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality - Prashant Kodali, et al. (2022).
Code-Switching Metrics Using Intonation Units - Rebecca Pattichis, et al. (2023).
Minimal Pair-Based Evaluation of Code-Switching - Sterner, I. & Teufel, S. (2025).
PIER: A Novel Metric for Evaluating What Matters in Code-Switching - Ugan, E. Y., et al. (2025).
Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities - Joshi, R., et al. (2025).

5. Multi & Cross-Modal Applications

Applying code-switching NLP to speech, vision, and other modalities.

Speech Processing

ASR
- Dependency Parsing for English–Malayalam Code-mixed Text - Sanket Sonu, et al. (2019).
- Semi-supervised Acoustic and Language Model Training for English-isiZulu Code-Switched Speech Recognition - Astik Biswas, et al. (2020).
- Improving code-switched ASR with linguistic information - Jie Chi, et al. (2022).
- End-to-End Speech Translation for Code Switched Speech - Orion Weller, et al. (2022).
- Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation - A. Seza Doğruöz, et al. (2023).
- New Datasets and Controllable Iterative Data Augmentation Method for Code-switching ASR Error Correction - Zhaohong Wan, et al. (2023).
- Code-Mixed Text Augmentation for Latvian ASR - Martins Kronis, et al. (2024).
- The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR - Injy Hamed, et al. (2025).
- Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms - Palash Jain, et al. (2025).
- ENHANCING ASR ACCURACY AND COHERENCE ACROSS INDIAN LANGUAGES WITH WAV2VEC2 AND GPT - 2 - R. Geetha Rajakumari, et al. (2025).
- Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM - Yu Xi, et al. (2024).
- Adapting Whisper for Low-Resource Hindi-English Code-Mix Speech - Sakshi Koli, et al. (2025).
Speech Translation
- Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation - Humair Raj Khan, et al. (2021).
- End-to-End Speech Translation for Code Switched Speech - Weller, O., et al. (2022).
- CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units - Kang, Y. (2024).
- Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM - Yu Xi, et al. (2024).
- The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR - Injy Hamed, et al. (2025).
- Code-Switching and Syntax: A Large–Scale Experiment - Igor Sterner, et al. (2025).
- Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms - P. Hemant, et al. (2025).
- ENHANCING ASR ACCURACY AND COHERENCE ACROSS INDIAN LANGUAGES WITH WAV2VEC2 AND GPT - 2 - R. Geetha Rajakumari, et al. (2025).

Vision-Language & Document Processing

A Unified Framework for Multilingual and Code-Mixed Visual Question Answering - Deepak Gupta, et al. (2020).
Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation - Raj Khan, H., et al. (2021).
"To Have the 'Million' Readers Yet": Building a Digitally Enhanced Edition of the Bilingual Irish-English Newspaper - Dereza, O., et al. (2024).
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks - Sanchit Ahuja, et al. (2024).
ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos - Krishanu Maity, et al. (2024).
Multi-task detection of harmful content in code-mixed meme captions using large language models with zero-shot, few-shot, and fine-tuning approaches - Bharath Kancharla, et al. (2025).
Enhancing Participatory Development Research in South Asia through LLM Agents System: An Empirically-Grounded Methodological Initiative from Field Evidence in Sri Lankan - Xinjie Zhao, et al. (2025).

Cross-Modal Integration

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences - Genta Indra Winata, et al. (2019).
Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data - Devansh Gautam, et al. (2021).
Data Augmentation to Address Out of Vocabulary Problem in Low Resource Sinhala English Neural Machine Translation - Aloka Fernando, et al. (2021).
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition - Dai, W., et al. (2022).
Typo-Robust Representation Learning for Dense Retrieval - Panuthep Tasawong, et al. (2023).
Advancing Multi-Criteria Chinese Word Segmentation Through Criterion Classification and Denoising - Tzu Hsuan Chou, et al. (2023).
ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos - Maity, K., et al. (2024).
Machine Translation and Transliteration for Indo-Aryan Languages: A Systematic Review - Sandun Sameera Perera, et al. (2025).

Workshops & Shared Tasks

A list of academic workshops and community shared tasks dedicated to code-switching.

Contributing

Your contributions are always welcome and make this community resource better!

If you have a paper, dataset, or tool you'd like to add:

Fork the repository.
Add your resource to the relevant section.
Please try to follow the existing format and include a direct link.
Submit a pull request!

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Code-Mixing & Code-Switching

Table of Contents

Taxonomy of Code-Switched Language Analytics and representative works for each direction

Survey Papers

1. NLP Tasks

1.1. Traditional Tasks

Language Identification (LID)

Part-of-Speech (POS) Tagging

Named Entity Recognition (NER)

Sentiment & Emotion Analysis

Syntactic Analysis

Machine Translation (MT)

1.2. Emerging and Contemporary Tasks

Natural Language Inference (NLI)

Intent Classification

Question Answering (QA)

Code-Mixed Text Generation

Cross-lingual Transfer

Text Summarization

Dialogue Generation

Transliteration

1.3. Underexplored and Frontier Tasks

Reasoning & Abstraction

Creative & Code Generation

Conversational & Dialogue systems

Safety & Multimodal

2. Datasets & Resources

Datasets

Frameworks & Toolkits

3. Model Training & Adaptation

Pre-training Approaches

Fine-tuning Approaches

Post-training Approaches

4. Evaluation & Benchmarking

📊 Benchmark Comparison

Benchmarks

Evaluation Metrics

5. Multi & Cross-Modal Applications

Speech Processing

Vision-Language & Document Processing

Cross-Modal Integration

Workshops & Shared Tasks

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages