Gender Classification Project - Machine Learning course (COSC 6342)
University of Houston - Spring 2025
- Minh Nguyen
- Mahtab Jeyhani
Given a set of labeled blogs written by males and females, predict the gender of the author of a new blog.
- Sample blog author dataset used in [Mukherjee and Liu, EMNLP 2010] available
from: http://www.cs.uic.edu/~liub/FBS/blog-gender-dataset.rar
or you can find in
data/raw/blog-gender-dataset.zip - The extracted file is a xlsx file, we converted it to csv format and save it as
gender-classification.csvindata/raw/ - Blog Authorship Corpus from Kaggle is used for supervised contrastive pre-training.
- You can find the report paper here
- Python 3.12+
- Jupyter Notebook
- Clone the project
- Create a virtual environment
python -m venv venv
- Activate the virtual environment
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
- Install the required packages
pip install -r requirements.txt
- Download the pre-trained model from our Hugging Face Model Hub:
bert_supervised_contrastive_pretrained_final_pca.pth(Pre-trained contrastive model)best_bert_supervised_final_pca.pth(Fine-tuned supervised model)- Place the downloaded files in the
models/directory.
- Run the main training pipeline script (Optional - if no trained model is available in
models/, or if you want to retrain the model):
NOTE: This step may take a long time to run, depending on the size of the dataset and the hardware used.
- Run code in
pipeline_final_pca.ipynbto execute the entire training pipeline.
This will execute the entire training and evaluation pipeline, including data preprocessing, supervised contrastive learning, supervised fine-tuning, and evaluation.
data/: Contains the dataset and any processed data.raw/: Original dataset files.processed/: Processed dataset files.
models/: Contains the trained models.src/: Source code for data processing, model training, and evaluation.config.py: Configuration file for setting parameters and paths.data_preprocessing.py: Code for loading and processing the dataset.data_augmentation.py: Code for augmenting the dataset.dataset.py: Code for creating custom dataset classes for contrastive learning and supervised fine-tuning.model.py: Code for defining and training the machine learning model.contrastive_learning.py: Code for implementing contrastive learning.supervised_fine_tune.py: Code for fine-tuning the model with supervised learning.evaluation.py: Code for evaluating the model's performance with various metrics.utils.py: Utility functions
pipeline_final_pca.ipynb: Jupyter notebook version of the pipeline script for visualization.requirements.txt: List of required Python packages.README.md: This file, providing an overview of the project.