Skip to content

ndminhvn/gender-classification

Repository files navigation

gender-classification

Gender Classification Project - Machine Learning course (COSC 6342)

University of Houston - Spring 2025

Team Members

  • Minh Nguyen
  • Mahtab Jeyhani

Project Overview

Given a set of labeled blogs written by males and females, predict the gender of the author of a new blog.

Dataset

  • Sample blog author dataset used in [Mukherjee and Liu, EMNLP 2010] available from: http://www.cs.uic.edu/~liub/FBS/blog-gender-dataset.rar or you can find in data/raw/blog-gender-dataset.zip
  • The extracted file is a xlsx file, we converted it to csv format and save it as gender-classification.csv in data/raw/
  • Blog Authorship Corpus from Kaggle is used for supervised contrastive pre-training.

Report Paper

  • You can find the report paper here

Requirements

  • Python 3.12+
  • Jupyter Notebook

Installation

  1. Clone the project
  2. Create a virtual environment
    python -m venv venv
  3. Activate the virtual environment
    • On Windows:
      venv\Scripts\activate
    • On macOS/Linux:
      source venv/bin/activate
  4. Install the required packages
    pip install -r requirements.txt

Usage

  1. Download the pre-trained model from our Hugging Face Model Hub:
  • bert_supervised_contrastive_pretrained_final_pca.pth (Pre-trained contrastive model)
  • best_bert_supervised_final_pca.pth (Fine-tuned supervised model)
  • Place the downloaded files in the models/ directory.
  1. Run the main training pipeline script (Optional - if no trained model is available in models/, or if you want to retrain the model):

NOTE: This step may take a long time to run, depending on the size of the dataset and the hardware used.

  • Run code in pipeline_final_pca.ipynb to execute the entire training pipeline.

This will execute the entire training and evaluation pipeline, including data preprocessing, supervised contrastive learning, supervised fine-tuning, and evaluation.

Project Structure

  • data/: Contains the dataset and any processed data.
    • raw/: Original dataset files.
    • processed/: Processed dataset files.
  • models/: Contains the trained models.
  • src/: Source code for data processing, model training, and evaluation.
    • config.py: Configuration file for setting parameters and paths.
    • data_preprocessing.py: Code for loading and processing the dataset.
    • data_augmentation.py: Code for augmenting the dataset.
    • dataset.py: Code for creating custom dataset classes for contrastive learning and supervised fine-tuning.
    • model.py: Code for defining and training the machine learning model.
    • contrastive_learning.py: Code for implementing contrastive learning.
    • supervised_fine_tune.py: Code for fine-tuning the model with supervised learning.
    • evaluation.py: Code for evaluating the model's performance with various metrics.
    • utils.py: Utility functions
  • pipeline_final_pca.ipynb: Jupyter notebook version of the pipeline script for visualization.
  • requirements.txt: List of required Python packages.
  • README.md: This file, providing an overview of the project.

Releases

No releases published

Packages

 
 
 

Contributors