This project demonstrates a complete machine learning workflow for price prediction using:
- Stepwise Regression for feature selection
- Advanced statistical analysis (ANOVA, R² metrics)
- Full model diagnostics
- Interactive visualization integration
- What is Data Normalization/Scaling?
- Common Scaling Methods
- Why is this Important in Machine Learning?
- Practical Example
- Code Example (Python)
- Linear Regression: Price Prediction Case Study 📈
- Linear Regression Analysis Report 📊
A preprocessing technique that adjusts numerical values in a dataset to a standardized scale (e.g., [0, 1] or [-1, 1]). This is essential for:
- Reducing outlier influence
- Ensuring stable performance in machine learning algorithms (e.g., neural networks, SVM)
- Enabling fair comparison between variables with different units or magnitudes
- Result: Values scaled to the [0, 1] interval.
-
Where: (\mu) is the mean and (\sigma) is the standard deviation.
-
Result: Data with a mean of 0 and standard deviation of 1.
-
Uses the median and interquartile range (IQR) to reduce the impact of outliers.
-
Formula:
-
Scale-sensitive algorithms: Methods like neural networks, SVM, and KNN rely on the distances between data points; unscaled data can hinder model convergence.
-
Interpretation: Variables with different scales can distort the weights in linear models (e.g., logistic regression).
-
Optimization Speed: Gradients in optimization algorithms converge faster with normalized data.
-
Age: Values between 18–90 years
-
Salary: Values between $1k–$20k
-
Age 30 transforms to approximately [0.17]
-
Salary $5k transforms to approximately [0.21]
This process ensures both features contribute equally to the model.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Sample data: Age and Salary
data = np.array([[30], [5000]], dtype=float).reshape(-1, 1)
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
# Expected Output: [[0.17], [0.21]]Linear Regression: Price Prediction Case Study 📈 Dataset: housing_data.xlsx (included in repository) Tech Stack: Python 3.9, Jupyter Notebook, scikit-learn, statsmodels
| Variable | Type | Range | Description |
|---|---|---|---|
area_sqm |
float | 40–220 | Living area in square meters |
bedrooms |
int | 1–5 | Number of bedrooms |
distance_km |
float | 0.5–15 | Distance to city center (km) |
price |
float | $50k–$1.2M | Property price in USD |
import statsmodels.api as sm
def stepwise_selection(X, y):
"""Automated feature selection using p-values."""
included = []
while True:
changed = False
# Forward step: consider adding each excluded feature
excluded = list(set(X.columns) - set(included))
pvalues = pd.Series(index=excluded)
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
pvalues[new_column] = model.pvalues[new_column]
best_pval = pvalues.min()
if best_pval < 0.05:
best_feature = pvalues.idxmin()
included.append(best_feature)
changed = True
# Backward step: consider removing features with high p-value
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:] # Exclude intercept
worst_pval = pvalues.max()
if worst_pval > 0.05:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if not changed:
break
return included
# Example usage (assuming X_train and y_train are predefined):
# selected_features = stepwise_selection(X_train, y_train)
| Metric | Value | Interpretation |
|---|---|---|
| R² | 0.872 | 87.2% variance explained |
| Adj. R² | 0.865 | Adjusted for feature complexity |
| F-statistic | 124.7 | p-value = 2.3e-16 (Significant) |
| Intercept | 58,200 | Base price without features |
import seaborn as sns
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Assuming X_train, y_train, X_test, and y_test are predefined
final_model = LinearRegression()
final_model.fit(X_train[selected_features], y_train)
# Predictions on test set
y_pred = final_model.predict(X_test[selected_features])
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = final_model.score(X_test[selected_features], y_test)import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,6))
sns.scatterplot(x=y_test, y=y_pred, hue=X_test['bedrooms'])
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', color='red')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Model Performance Visualization')
plt.savefig('results/scatter_plot.png')
plt.show()pip install -r requirements.txt
- From: data/housing_data.xlsx
- Or use this dataset link
jupyter notebook price_prediction.ipynbNote: Full statistical outputs and diagnostic plots are available in the notebook.
📌 Important Note:
This dataset is a fictitious example created solely for demonstration and educational purposes. There is no external source for this dataset.
For real-world datasets, consider exploring sources such as the UC Machine Learning Repository or Kaggle.
| Variable | Type | Range | Description |
|---|---|---|---|
| area_sqm | float | 40–220 | Living area in square meters |
| bedrooms | int | 1–5 | Number of bedrooms |
| distance_km | float | 0.5–15 | Distance to city center (km) |
| price | float | $50k–$1.2M | Property price in USD |
###3. F-Statistic (ANOVA)
| Metric | Value | Critical Value | Interpretation |
|---|---|---|---|
| R² | 0.872 | > 0.7 | Strong explanatory power |
| Adj. R² | 0.865 | > 0.6 | Robust to overfitting |
| F-statistic | 124.7 | 4.89 | p < 0.001 (Significant) |
| Intercept | 58,200 | - | Base property value |
import statsmodels.api as sm
def stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.1):
included = []
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pval = pd.Series(index=excluded)
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
new_pval[new_column] = model.pvalues[new_column]
best_pval = new_pval.min()
if best_pval < threshold_in:
best_feature = new_pval.idxmin()
included.append(best_feature)
changed = True
# Backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:]
worst_pval = pvalues.max()
if worst_pval > threshold_out:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if not changed:
break
return included🛸๋ My Contacts Hub
────────────── ⊹🔭๋ ──────────────
➣➢➤ Back to Top
Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.
