🤗 Welcome to this repo! This is the project for the ADA course (2025 Fall) at EPFL, and we are doing an interesting journey on the Stock Market Dataset. We are diving into the possible relationship between the stock market and the U.S. presidential election.
📖 For our interactive data story, please visit: https://sjj1017.github.io/ada_penta_data_story/
This project investigates the complicated relationship between U.S. presidential elections and stock market performance, and explores whether political factors significantly influence financial markets. Our motivation comes from the belief that political uncertainty is a key factor of market volatility, and this project is exactly going to verify this belief. This project investigates the complex relationship between U.S. presidential elections and stock market performance, exploring whether political factors significantly influence financial markets and how the market may send signals to presidential elections. We first present the data profile to show that our project is based on individual stocks (non-ETFs) starting from 1991. We then analyze year-wise and month-wise price and volatility changes during election years and non-election years. Next, we zoom in to examine the sensitivity and political leaning of each stock using statistical methods such as regression. Finally, we focus on certain dramatic events to analyze stock behavior during these ephemeral periods using time series forecasting and counterfactual analysis.
In the previous attempts, we have found that during the long history, there are literally some interesting relationship between the stock market and the presidential election. According to this, we have several initial research questions (We may still find more during our analysis):
- Is the election period's impact on the stock market statistically significant enough? Does the profound events during election have further influence? (For example, the Attempted assassination of Donald Trump in Pennsylvania)
- Which stocks are the most sensitive to the election results and polling data?
- Do stocks have their own political inclination?
- And whether the different behavior of stock can be actually understood in a high-level?
We expand the original dataset and add 3 additional datasets, including (1) industry & sector metadata, (2) election outcomes, and (3) polling data. The table below shows metadata for these datasets.
| Category | Item | Source | Files / Directories | Size |
|---|---|---|---|---|
| Data Expansion | 📈 Stocks | NASDAQ Trader directory + yfinance | dataset/stocks/, dataset/etfs/, dataset/symbols_valid_meta.csv (local only, not on GitHub) |
~3GB |
| Industry / Sector Metadata | 📈 Stocks | yfinance | symbols_valid_meta_with_industry_sector.csv |
~7k–10k rows (U.S. equities + some ETFs) |
| Election Results | 🗳️ Election | Wikipedia | us_presidential_election_1876_2024.csv |
38 rows |
| Polling Data | 🗳️ Election | 538 Data Collection Platform | pres_primary_avgs_1980-2016.csv, presidential_primary_averages_2024.csv |
~460k rows in total |
The original dataset is expanded to 2025, in order to cover the two latest election period. We also set auto_adjust = True to get adjusted data. This dataset provides a wider coverage of the latest events that we want to study in the future (e.g., the attempted assassination of Donald Trump) more available election periods to conduct statistical analysis. However, we still acknowledge that the data contains some consecutive same numbers caused by the automatique filling, which can be also seen from the original data.
The Industry and Sector data provides additional sectoral and industral metadata of each symbol, which helps in our work to test the significance of election impact on stock data through the distribution of sectors. We may use this dataset to further examine the political inclination and the heterogeneity of stocks in the context of a certain event.
The Election Results contains not only the election day of each year but the exact party that won in a certain election. It works as a time anchor and helps to determine the election window or boundary and to analyze political inclinations.
The Polling data, though not used in the Milestone 2, potentially provides a more detailed timeline and information about the election period. We expect to use this data in the future to examine a more detailed event, not limited to the election day. Although there is problems in getting the data of 2020 because FiveThertyEight was shut down, it is still feasible to zoom in on one or several other elections from 1980-2024.
- Discriptive Statistics: we begin by summarizing the key characteristics of the datasets, including sample size, missing values, and variable distributions. Histograms of stock and ETF starting dates illustrate the temporal coverage relative to election periods, while industry distributions show the sectoral composition of listed firms. This descriptive analysis helps assess data completeness, detect anomalies, and ensure the datasets are suitable for subsequent statistical modeling.
- Regression: significance analysis was performed using t-tests and linear OLS regressions. T-tests compared returns and other stock metrics across different electoral outcomes. Regressions estimated sector- and stock-level sensitivities to political factors, reporting coefficients, standard errors, t-values, and corresponding p-values, with explanatory variables including Republican win, margin, election proximity, volatility, and momentum.
- Seperation Metric: a separation metric based on the difference in mean cumulative abnormal returns (CAR), adjusted by Cohen’s d effect size, to measure the difference of consequences of election outcomes on an individual stock. It focuses both on statistical significance and numerical difference, filtering out random noise.
- Time Series Forecasting (ARIMA): assess election-day impacts via an ARIMA counterfactual: for each stock, we fit an ARIMA model on a pre-event window of (log) returns—using excess returns (stock − market, e.g., vs. SPY) when a benchmark is available—then generate a post-event multi-step forecast and compute residuals as (actual − forecast). We evaluate whether the mean residual over the post window differs from zero using a one-sample t-test, interpreting significance as evidence of a short-horizon mean shift relative to the ARIMA baseline.
-
Clone the repository
git clone https://github.com/epfl-ada/ada-2025-project-penta_data.git cd ada-2025-project-penta_data -
Install required packages
pip install -r pip_requirements.txt
-
Download Dataset
The original dataset is about 3GB, so please download the data manually from: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset
Extract and place the files in a
dataset/directory in the project root:├── dataset/ # Raw datasets │ ├── stocks/ # Individual stock CSV files │ ├── etfs/ # Individual ETF CSV files │ └── symbols_valid_meta.csv # Stock metadata
Navigate to p3_notebook.ipynb and run all cells to reproduce the analysis.
| Team Member | Contributions | |
|---|---|---|
| Jiajun Shen | jiajun.shen@epfl.ch | Introduction and Political Leaning Part |
| Yibo Yin | yibo.yin@epfl.ch | Event & volatility definition and Conclusion Part |
| Jinghao Zheng | jinghao.zheng@epfl.ch | General Part |
| Xinxian Ma | xinxian.ma@epfl.ch | Event Analysis and Machine Learning Part |
| Zhiyan Ke | zhiyan.ke@epfl.ch | Political Sensitivity Part |
All team members contributed to discussions, code reviews, website developmet and the overall direction of the project.