Skip to content

epfl-ada/ada-2025-project-datasentinels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

138 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subreddit conflicts: the alliances, the rivalries, and the machinations behind both

A DataSentinels investigation.

Quickstart

# clone project
git clone git@github.com:epfl-ada/ada-2025-project-datasentinels.git
cd ada-2025-project-datasentinels

# create virtual environment
python -m venv venv
source venv/bin/activate

# install requirements
pip install -r pip_requirements.txt

Project Structure

│──graphs                       <- Interactive graphs
│
├── data                        <- Project data files should go here (due to size constraints, they have been placded in https://drive.google.com/drive/folders/│1dvRtim53A-2JOBN3tsHX9QZkzXu-B8td?usp=drive_link)
├── src                         <- Source code
│   ├── data                            <- Data directory
│   ├── models                          <- Model directory
│   ├── utils                           <- Utility directory
│
├── results.ipynb               <- Notebook with final results
│
├── .gitignore                  <- List of files ignored by git
├── pip_requirements.txt        <- File for installing python dependencies
└── README.md

Abstract

Our project studies how conflicts and alliances form and spread between online communities on Reddit. We aim to understand whether disputes between subreddits create “chain reactions”, where one attack leads to others, and whether relationships follow patterns like “the enemy of my enemy is my friend”. Using the Reddit Hyperlink Network dataset, which records positive and negative links between subreddits over time, we built a dynamic network to observe how communities attack or support each other. We then looked for patterns of conflict spreading (domino effects) and balanced or unbalanced relationships among groups of three subreddits. We also used an additional dataset, the Reddit Embedding Dataset, to compare how similar subreddits behave toward each other. Overall, the project aims to reveal how online hostility spreads and how alliances emerge across Reddit communities.

Research Questions

  1. How often do subreddit conflicts (negative hyperlinks) trigger chain reactions of further hostilities in the network?
  2. How are the length and depth of conflict cascades distributed, and what do these distributions reveal about conflict dynamics between subreddits?
  3. Which subreddits contribute the most to overall toxicity and which are the ones that actually take part in bigger conflicts? Can toxicity be proxied by other metrics, such as participation in unusual conflict schemes?
  4. What influences the formation or alliances or retaliations between subreddits?

Supplementary Dataset

In addition to the Reddit Hyperlink Network, we will use the Reddit Embedding Dataset to make our analysis more complete. This dataset gives each subreddit a 300-dimensional vector, called an embedding, that represents its general theme and the type of users who post there. These embeddings were created using a model similar to word2vec, which learns relationships between subreddits based on user posting patterns. In simple terms, two subreddits with similar embeddings tend to discuss related topics or share overlapping audiences.

By combining this dataset with the Reddit Hyperlink Network, we can link community behavior (attacks and alliances) with community similarity (topics and interests). This lets us test whether subreddits that are thematically close are more likely to cooperate or to compete. For example, two political subreddits might attack each other more often than gaming communities that talk about similar games.

We also used the embeddings to cluster subreddits by similarity and check whether conflicts happen mostly within the same cluster or between different ones. Using PCA, we reduced these clusters to two dimensions and visualized them to better understand the overall structure of Reddit’s communities. This aimed to help us see if alliances form among similar groups or if some communities act as bridges or sources of conflict across unrelated topics.

Overall, this supplementary dataset adds a valuable “semantic” layer to our project. It was crucial in order for us to move beyond just counting links and allowed us to analyze how the meaning, content, and user overlap between subreddits affect the way they interact over time.

Methods

We began by preparing the datasets. The Reddit Hyperlink Network is divided into two files, one for hyperlinks found in post titles and one for those in post bodies. We merged these files into a single dataset to have a complete view of subreddit interactions. Each row represents one hyperlink, with its source and target subreddits, timestamp, and sentiment label (+1 for positive or neutral, –1 for negative).

We then built a temporal signed network that caputures link sentiment over time. In this network, each node represents a subreddit, and each edge represents a positive or negative connection at a given time. This enabled us to examine conflicts and alliances within specific time frames, something necessary for a more nuanced analysis of the temporal proximity factor. Simply put, it allowed us to identify when conflicts start, grow, or fade away.

Afterwards, we studied conflict cascades. These are sequences where one subreddit attacks another, and that second subreddit later attacks a third (for example, A → B → C). We looked for patterns that show how common it is for conflicts to spread like a chain reaction and we calculated measures such as how long it takes for such chains to form, how many subsequent subreddits are affected and how the lengths of these cascades are distributed.

We also examinded triads, which are groups of three connected subreddits. We checked if these triads follow the rules of balance theory. For example, if a subreddit A attacks a subreddit B which then attacks a third subreddit C, we would expect C to form an "alliance" with A, meaning that overall the links from C to A would be positive.

As another metric of hostility, we checked whether certain subreddits are amplifiers of conflict, meaning that they create more conflict after enduring an attack.

Finally, we used the embedding vectors to study how similarity between subreddits affects their relationships. We computed cosine similarity between embeddings to assess whether similar subreddits are more likely to be friends or enemies. This part connects the content dimension (what subreddits talk about) with the interaction dimension (how they treat each other). Combining these analyses helped us understand not only how conflicts spread but also why some communities tend to clash while others stay connected or neutral.

Results

You can take a look at the answers to all of the research questions in the precompiled notebook. Note that a lof of the plots are interactive and are not shown on GitHub, so we recommend cloning the repo and viewing them locally. Also check out the our website (repo) for a more engaging presentation of the data story!

The DataSentinels team: Members and contributions

  • Clément Josso: Website with a focus on creating a UI that closely resembles Reddit's.
  • Faruk Zahiragić: Research Question 4.
  • Mamoun Imghi: Website with a focus on telling an engaging story.
  • Mohamed Bouchnak: Research Questions 1 and 2.
  • Vasilis Gkikas: Research Question 3.

About

ada-2025-project-datasentinels created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors