Predicting whether the SpaceX Falcon 9 first-stage booster will land successfully, using data science and machine learning techniques. This project demonstrates an end-to-end data science pipeline, including data collection, data wrangling, exploratory analysis, interactive visualization, dashboard development, and predictive modeling. The project is part of the IBM Applied Data Science Capstone, where the objective is to generate data-driven insights that can help estimate rocket launch costs and support competitive launch bidding strategies.
SpaceX advertises Falcon 9 rocket launches for approximately $62 million, significantly cheaper than other providers that charge $165 million or more. The primary reason for this cost advantage is first-stage booster reusability. Predicting whether the first-stage booster will land successfully is critical because reusable boosters drastically reduce launch costs.
This project aims to:
- Analyze historical Falcon 9 launch data
- Identify factors influencing landing success
- Build machine learning models to predict landing outcomes
- Provide insights into mission success patterns
The project follows a complete data science lifecycle:
Data Collection → Data Wrangling → Exploratory Data Analysis → Interactive Visual Analytics → Dashboard Development → Machine Learning Modeling → Insights & Conclusions
The methodology used throughout the project includes:
- API data collection
- Web scraping
- Data preprocessing and wrangling
- SQL analysis
- Data visualization
- Interactive dashboards
- Classification machine learning models
Raw Data Sources
│
├── SpaceX REST API
├── Wikipedia Web Scraping
│
↓
Data Processing (Pandas)
│
↓
Exploratory Data Analysis
│
├── SQL Analysis
├── Matplotlib / Seaborn Visualizations
│
↓
Interactive Analytics
│
├── Folium Geospatial Map
├── Plotly Dash Dashboard
│
↓
Machine Learning Models
│
├── Logistic Regression
├── SVM
├── Decision Tree
└── KNN
│
↓
Landing Success Prediction
Launch data was collected using the SpaceX REST API.
- Sent a GET request to retrieve historical launch data.
- Converted JSON responses to Pandas DataFrames.
- Selected relevant features from launch records.
- Extracted additional information using IDs for:
- Rocket
- Launchpad
- Payload
- Booster core
- Filtered dataset to include Falcon 9 launches only.
- Replaced missing payload masses with the column mean.
- Exported cleaned data as CSV.
The final dataset was saved as:
dataset_part_1.csv
Additional launch data was collected by scraping the Falcon 9 launch history table on Wikipedia.
requestsBeautifulSouppandas
- Sent HTTP request to the Wikipedia page.
- Parsed HTML using BeautifulSoup.
- Extracted the third HTML table containing launch records.
- Extracted column headers and launch data.
- Constructed a dictionary to store extracted values.
- Converted dictionary into a Pandas DataFrame.
- Exported the dataset as:
spacex_web_scraped.csv
Data preprocessing was performed to prepare the dataset for analysis and modeling.
-
Loaded the dataset and inspected data types and missing values.
-
Analyzed launch sites, orbit types, and mission outcomes.
-
Converted mission outcomes into a binary classification variable:
Class = 1 → Successful Landing Class = 0 → Unsuccessful LandingBad outcomes included: False ASDS, False Ocean, False RTLS, None ASDS, None None
-
Calculated overall success rate.
-
Exported cleaned dataset as:
dataset_part_2.csv
EDA was conducted using:
- Pandas
- Matplotlib
- Seaborn
The objective was to identify relationships between launch characteristics and landing success.
Later launches show higher landing success rates, indicating improvements over time.
- Different launch sites handle varying payload ranges.
- Some sites successfully land boosters with very heavy payloads.
Certain orbits such as LEO and ISS show higher success rates compared to others like GTO.
Success trends differ by orbit type.
- LEO shows improvement with time.
- GTO has mixed results across launches.
Some orbit types successfully carry large payloads while maintaining landing success.
Landing success improved dramatically between 2013 and 2020, reaching close to 100% reliability in later years.
SQL queries were used to perform additional analysis on launch data.
- Retrieve unique launch sites
- Identify launch sites starting with CCA
- Calculate total payload mass for NASA CRS missions
- Compute average payload mass for F9 v1.1
- Identify the first successful ground landing date
- Determine boosters with successful drone ship landings
- Count mission success vs failure outcomes
- Identify boosters carrying maximum payload mass
SELECT DISTINCT Launch_Site
FROM SPACEXTABLE;
Interactive maps were built using Folium.
Features included:
- Launch site markers
- Clustered launch outcomes
- Color-coded markers:
- Green → Successful landing
- Red → Failure
- Proximity analysis to:
- Coastlines
- Railways
- Highways
- Cities
Distance lines were drawn using Folium PolyLine objects to analyze geographic factors affecting launch site placement.
An interactive dashboard was developed using Plotly Dash.
Allows selection of:
- All sites
- Individual launch sites
Displays launch success counts per site.
Filters launches by payload mass.
- Shows relationship between: Payload Mass vs Landing Outcome
- Colored by booster version category.
This dashboard enables interactive exploration of launch performance data.
The final stage of the project involved building classification models to predict landing success.
- Target variable:
Y = Class - Feature scaling using:
StandardScaler() - Train/Test split:
80% training 20% testing
The following algorithms were evaluated:
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- K-Nearest Neighbors (KNN)
Grid search with 10-fold cross-validation was used:
GridSearchCV(cv=10)
- All models achieved similar accuracy:
Test Accuracy ≈ 0.8333 (83%) - Confusion matrices were used to analyze classification performance.
Accuracy: 0.8333
From exploratory analysis and modeling:
- Falcon 9 landing success rates improved significantly over time.
- Launch site experience contributes to higher success rates.
- Orbit type impacts landing probability.
- Payload mass influences mission outcomes but does not necessarily cause failure.
- Machine learning models can reasonably predict landing success using historical data.
- Programming Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Plotly Dash, Folium, BeautifulSoup, Requests
- Tools: Jupyter Notebook, SQL
IBM-Applied-Data-Science-Capstone
│
├─ Module 1 – Introduction
│ ├─ Data Collection → SpaceX REST API
│ ├─ Data Collection → Web Scraping (Wikipedia)
│ └─ Data Wrangling
│
├─ Module 2 – Exploratory Data Analysis
│ ├─ EDA with SQL
│ └─ EDA with Visualization (Matplotlib / Seaborn)
│
├─ Module 3 – Interactive Visual Analytics
│ ├─ Interactive Map → Folium
│ └─ Interactive Dashboard → Plotly Dash
│
├─ Module 4 – Predictive Analysis (Machine Learning)
│ └─ Landing Success Prediction Models
│ ├─ Logistic Regression
│ ├─ Support Vector Machine (SVM)
│ ├─ Decision Tree
│ └─ K-Nearest Neighbors (KNN)
│
└─ Final Presentation
Potential enhancements include:
- Incorporating additional launch features
- Using advanced ML models (Random Forest, XGBoost)
- Deploying the model as a web application
- Real-time launch data integration
- Expanding dataset with more recent launches