Skip to content

Mindful-AI-Assistants/Linear-Regression_Price-Prediction_DataScalingAnalysis








This project demonstrates a complete machine learning workflow for price prediction using:


Open in Colab


  • Stepwise Regression for feature selection
  • Advanced statistical analysis (ANOVA, R² metrics)
  • Full model diagnostics
  • Interactive visualization integration



Table of Contents

  1. What is Data Normalization/Scaling?
  2. Common Scaling Methods
  3. Why is this Important in Machine Learning?
  4. Practical Example
  5. Code Example (Python)
  6. Linear Regression: Price Prediction Case Study 📈
  7. Linear Regression Analysis Report 📊



A preprocessing technique that adjusts numerical values in a dataset to a standardized scale (e.g., [0, 1] or [-1, 1]). This is essential for:




1. Min-Max Scaling (Normalization)


$$ X_{\text{norm}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}} $$


  • Result: Values scaled to the [0, 1] interval.


2. Standardization (Z-Score)


$$ \Huge X_{\text{std}} = \frac{X - \mu}{\sigma} $$


  • Where: (\mu) is the mean and (\sigma) is the standard deviation.

  • Result: Data with a mean of 0 and standard deviation of 1.


3. Robust Scaling

  • Uses the median and interquartile range (IQR) to reduce the impact of outliers.

  • Formula:


$$ \Huge X_{\text{robust}} = \frac{X - \text{Median}(X)}{\text{IQR}(X)} $$



Why is this Important in Machine Learning?


  • Scale-sensitive algorithms: Methods like neural networks, SVM, and KNN rely on the distances between data points; unscaled data can hinder model convergence.

  • Interpretation: Variables with different scales can distort the weights in linear models (e.g., logistic regression).

  • Optimization Speed: Gradients in optimization algorithms converge faster with normalized data.



Practical Example

For a dataset containing:

  • Age: Values between 18–90 years

  • Salary: Values between $1k–$20k



After applying Min-Max Scaling:

  • Age 30 transforms to approximately [0.17]

  • Salary $5k transforms to approximately [0.21]


This process ensures both features contribute equally to the model.



Code Example (Python) – Data Normalization


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data: Age and Salary
data = np.array([[30], [5000]], dtype=float).reshape(-1, 1)
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print(normalized_data)
# Expected Output: [[0.17], [0.21]]



Linear Regression: Price Prediction Case Study 📈 
 Dataset: housing_data.xlsx (included in repository)
Tech Stack: Python 3.9, Jupyter Notebook, scikit-learn, statsmodels



I. Use Case Implementation & Dataset Description


Variable Type Range Description
area_sqm float 40–220 Living area in square meters
bedrooms int 1–5 Number of bedrooms
distance_km float 0.5–15 Distance to city center (km)
price float $50k–$1.2M Property price in USD



II. Methodology (Stepwise Regression)


import statsmodels.api as sm

def stepwise_selection(X, y):
    """Automated feature selection using p-values."""
    included = []
    while True:
        changed = False
        # Forward step: consider adding each excluded feature
        excluded = list(set(X.columns) - set(included))
        pvalues = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            pvalues[new_column] = model.pvalues[new_column]
        best_pval = pvalues.min()
        if best_pval < 0.05:
            best_feature = pvalues.idxmin()
            included.append(best_feature)
            changed = True
        
        # Backward step: consider removing features with high p-value
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude intercept
        worst_pval = pvalues.max()
        if worst_pval > 0.05:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
        
        if not changed:
            break
    return included

# Example usage (assuming X_train and y_train are predefined):
# selected_features = stepwise_selection(X_train, y_train)



III. Statistical Analysis

Key Metrics Table


Metric Value Interpretation
0.872 87.2% variance explained
Adj. R² 0.865 Adjusted for feature complexity
F-statistic 124.7 p-value = 2.3e-16 (Significant)
Intercept 58,200 Base price without features








Correlation Matrix


import seaborn as sns
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')



IV. Full Implementation Code

Model Training & Evaluation


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X_train, y_train, X_test, and y_test are predefined
final_model = LinearRegression()
final_model.fit(X_train[selected_features], y_train)

# Predictions on test set
y_pred = final_model.predict(X_test[selected_features])

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = final_model.score(X_test[selected_features], y_test)



V. Visualization – Actual vs Predicted Prices


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.scatterplot(x=y_test, y=y_pred, hue=X_test['bedrooms'])
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', color='red')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Model Performance Visualization')
plt.savefig('results/scatter_plot.png')
plt.show()



VI. How to Run


1. Install Dependencies:


pip install -r requirements.txt


2. Download Dataset:



3..Execute Jupyter Notebook:



    jupyter notebook price_prediction.ipynb



Note: Full statistical outputs and diagnostic plots are available in the notebook.



Linear Regression Analysis Report 📊

Dataset Overview

📌 Important Note:


This dataset is a fictitious example created solely for demonstration and educational purposes. There is no external source for this dataset.

For real-world datasets, consider exploring sources such as the UC Machine Learning Repository or Kaggle.



Variable Type Range Description
area_sqm float 40–220 Living area in square meters
bedrooms int 1–5 Number of bedrooms
distance_km float 0.5–15 Distance to city center (km)
price float $50k–$1.2M Property price in USD



Key Formulas


1. Regression Equation

$$ \Huge \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n $$


2. R-Squared

$$ \Huge R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} $$


###3. F-Statistic (ANOVA)

$$ \Hug e F = \frac{\text{MS}_\text{model}}{\text{MS}_\text{residual}} $$



Statistical


Metric Value Critical Value Interpretation
0.872 > 0.7 Strong explanatory power
Adj. R² 0.865 > 0.6 Robust to overfitting
F-statistic 124.7 4.89 p < 0.001 (Significant)
Intercept 58,200 - Base property value



Stepwise Regression


import statsmodels.api as sm

def stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.1):
    included = []
    while True:
        changed = False
        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True
        
        # Backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
        
        if not changed:
            break
    return included




🛸๋ My Contacts Hub




────────────── ⊹🔭๋ ──────────────

➣➢➤ Back to Top

Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.

About

This project demonstrates a complete machine learning workflow for price predictions usibng Linear Regression

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Contributors