Linear Regression and Data Scaling Analysis

Scale-sensitive algorithms: Methods like neural networks, SVM, and KNN rely on the distances between data points; unscaled data can hinder model convergence.
Interpretation: Variables with different scales can distort the weights in linear models (e.g., logistic regression).
Optimization Speed: Gradients in optimization algorithms converge faster with normalized data.

Practical Example

For a dataset containing:

Age: Values between 18–90 years
Salary: Values between $1k–$20k

After applying Min-Max Scaling:

Age 30 transforms to approximately [0.17]
Salary $5k transforms to approximately [0.21]

This process ensures both features contribute equally to the model.

Code Example (Python) – Data Normalization

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data: Age and Salary
data = np.array([[30], [5000]], dtype=float).reshape(-1, 1)
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print(normalized_data)
# Expected Output: [[0.17], [0.21]]

Linear Regression: Price Prediction Case Study 📈   Dataset: housing_data.xlsx (included in repository) Tech Stack: Python 3.9, Jupyter Notebook, scikit-learn, statsmodels

I. Use Case Implementation & Dataset Description

Variable	Type	Range	Description
`area_sqm`	float	40–220	Living area in square meters
`bedrooms`	int	1–5	Number of bedrooms
`distance_km`	float	0.5–15	Distance to city center (km)
`price`	float	$50k–$1.2M	Property price in USD

II. Methodology (Stepwise Regression)

import statsmodels.api as sm

def stepwise_selection(X, y):
    """Automated feature selection using p-values."""
    included = []
    while True:
        changed = False
        # Forward step: consider adding each excluded feature
        excluded = list(set(X.columns) - set(included))
        pvalues = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            pvalues[new_column] = model.pvalues[new_column]
        best_pval = pvalues.min()
        if best_pval < 0.05:
            best_feature = pvalues.idxmin()
            included.append(best_feature)
            changed = True
        
        # Backward step: consider removing features with high p-value
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude intercept
        worst_pval = pvalues.max()
        if worst_pval > 0.05:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
        
        if not changed:
            break
    return included

# Example usage (assuming X_train and y_train are predefined):
# selected_features = stepwise_selection(X_train, y_train)

III. Statistical Analysis

Key Metrics Table

Metric	Value	Interpretation
R²	0.872	87.2% variance explained
Adj. R²	0.865	Adjusted for feature complexity
F-statistic	124.7	p-value = 2.3e-16 (Significant)
Intercept	58,200	Base price without features

Correlation Matrix

import seaborn as sns
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

IV. Full Implementation Code

Model Training & Evaluation

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X_train, y_train, X_test, and y_test are predefined
final_model = LinearRegression()
final_model.fit(X_train[selected_features], y_train)

# Predictions on test set
y_pred = final_model.predict(X_test[selected_features])

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = final_model.score(X_test[selected_features], y_test)

V. Visualization – Actual vs Predicted Prices

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.scatterplot(x=y_test, y=y_pred, hue=X_test['bedrooms'])
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', color='red')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Model Performance Visualization')
plt.savefig('results/scatter_plot.png')
plt.show()

VI. How to Run

1. Install Dependencies:

pip install -r requirements.txt

2. Download Dataset:

From: data/housing_data.xlsx
Or use this dataset link

3..Execute Jupyter Notebook: 

    jupyter notebook price_prediction.ipynb

  Note: Full statistical outputs and diagnostic plots are available in the notebook.

Linear Regression Analysis Report 📊

Dataset Overview

📌 Important Note:

This dataset is a fictitious example created solely for demonstration and educational purposes. There is no external source for this dataset.

For real-world datasets, consider exploring sources such as the UC Machine Learning Repository or Kaggle.

Variable	Type	Range	Description
area_sqm	float	40–220	Living area in square meters
bedrooms	int	1–5	Number of bedrooms
distance_km	float	0.5–15	Distance to city center (km)
price	float	$50k–$1.2M	Property price in USD

Key Formulas

1. Regression Equation

$$ \Huge \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n $$

2. R-Squared

$$ \Huge R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} $$

###3. F-Statistic (ANOVA)

$$ \Hug e F = \frac{\text{MS}_\text{model}}{\text{MS}_\text{residual}} $$

Statistical

Metric	Value	Critical Value	Interpretation
R²	0.872	> 0.7	Strong explanatory power
Adj. R²	0.865	> 0.6	Robust to overfitting
F-statistic	124.7	4.89	p < 0.001 (Significant)
Intercept	58,200	-	Base property value

Stepwise Regression

import statsmodels.api as sm

def stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.1):
    included = []
    while True:
        changed = False
        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True
        
        # Backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
        
        if not changed:
            break
    return included

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── ⊹🔭๋ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github		.github
project_2-L_Regression_Stepwise_Car-Price		project_2-L_Regression_Stepwise_Car-Price
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
setup-python.yml		setup-python.yml
stepwiseLR.ipynb		stepwiseLR.ipynb
test_code.py		test_code.py
🇧🇷Linear-and-logistic-regression-foundations.md		🇧🇷Linear-and-logistic-regression-foundations.md
🇬🇧Linear-and-logistic-regression-foundations.md		🇬🇧Linear-and-logistic-regression-foundations.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Linear Regression and Data Scaling Analysis

🇬🇧 Linear and Logistic Regression Foundations - 🇧🇷 Linear and Logistic Regression Foundations

Project Overview

Table of Contents

What is Data Normalization/Scaling ?

Common Scaling Methods

1. Min-Max Scaling (Normalization)

2. Standardization (Z-Score)

3. Robust Scaling

Why is this Important in Machine Learning?

Practical Example

For a dataset containing:

After applying Min-Max Scaling:

Code Example (Python) – Data Normalization

I. Use Case Implementation & Dataset Description

II. Methodology (Stepwise Regression)

III. Statistical Analysis

Key Metrics Table

Correlation Matrix

IV. Full Implementation Code

Model Training & Evaluation

V. Visualization – Actual vs Predicted Prices

VI. How to Run

1. Install Dependencies:

2. Download Dataset:

3..Execute Jupyter Notebook:

Linear Regression Analysis Report 📊

Dataset Overview

Key Formulas

1. Regression Equation

2. R-Squared

Statistical

Stepwise Regression

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2026 Mindful-AI-Assistants. Code released under the MIT license.

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

3..Execute Jupyter Notebook: