gender-classification

Gender Classification Project - Machine Learning course (COSC 6342)

University of Houston - Spring 2025

Team Members

Minh Nguyen
Mahtab Jeyhani

Project Overview

Given a set of labeled blogs written by males and females, predict the gender of the author of a new blog.

Dataset

Sample blog author dataset used in [Mukherjee and Liu, EMNLP 2010] available from: http://www.cs.uic.edu/~liub/FBS/blog-gender-dataset.rar or you can find in data/raw/blog-gender-dataset.zip
The extracted file is a xlsx file, we converted it to csv format and save it as gender-classification.csv in data/raw/
Blog Authorship Corpus from Kaggle is used for supervised contrastive pre-training.

Report Paper

You can find the report paper here

Requirements

Python 3.12+
Jupyter Notebook

Installation

Clone the project
Create a virtual environment
```
python -m venv venv
```
Activate the virtual environment
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```
Install the required packages
```
pip install -r requirements.txt
```

Usage

Download the pre-trained model from our Hugging Face Model Hub:

bert_supervised_contrastive_pretrained_final_pca.pth (Pre-trained contrastive model)
best_bert_supervised_final_pca.pth (Fine-tuned supervised model)
Place the downloaded files in the models/ directory.

Run the main training pipeline script (Optional - if no trained model is available in models/, or if you want to retrain the model):

NOTE: This step may take a long time to run, depending on the size of the dataset and the hardware used.

Run code in pipeline_final_pca.ipynb to execute the entire training pipeline.

This will execute the entire training and evaluation pipeline, including data preprocessing, supervised contrastive learning, supervised fine-tuning, and evaluation.

Project Structure

data/: Contains the dataset and any processed data.
- raw/: Original dataset files.
- processed/: Processed dataset files.
models/: Contains the trained models.
src/: Source code for data processing, model training, and evaluation.
- config.py: Configuration file for setting parameters and paths.
- data_preprocessing.py: Code for loading and processing the dataset.
- data_augmentation.py: Code for augmenting the dataset.
- dataset.py: Code for creating custom dataset classes for contrastive learning and supervised fine-tuning.
- model.py: Code for defining and training the machine learning model.
- contrastive_learning.py: Code for implementing contrastive learning.
- supervised_fine_tune.py: Code for fine-tuning the model with supervised learning.
- evaluation.py: Code for evaluating the model's performance with various metrics.
- utils.py: Utility functions
pipeline_final_pca.ipynb: Jupyter notebook version of the pipeline script for visualization.
requirements.txt: List of required Python packages.
README.md: This file, providing an overview of the project.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
models		models
pdf		pdf
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline.ipynb		baseline.ipynb
embedding.ipynb		embedding.ipynb
evaluate.py		evaluate.py
hyper_tuning.py		hyper_tuning.py
pipeline_final.ipynb		pipeline_final.ipynb
pipeline_final_pca.ipynb		pipeline_final_pca.ipynb
pipeline_v4.ipynb		pipeline_v4.ipynb
pipeline_v5.ipynb		pipeline_v5.ipynb
pipeline_v6.ipynb		pipeline_v6.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
training_pipeline.py		training_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gender-classification

Team Members

Project Overview

Dataset

Report Paper

Requirements

Installation

Usage

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gender-classification

Team Members

Project Overview

Dataset

Report Paper

Requirements

Installation

Usage

Project Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages