Linear regression is a fundamental statistical and machine learning technique used to model the relationship between a dependent variable (target) and one or more independent variables (features). The goal is to find the best-fitting line (or hyperplane in higher dimensions) that predicts the target variable based on the features.
Key Concepts
- Simple Linear Regression: Models the relationship between two variables by fitting a linear equation to observed data. The formula is:
[
y = beta_0 + beta_1 x + epsilon
]
- ( y ) is the dependent variable.
- ( x ) is the independent variable.
- ( beta_0 ) is the intercept.
- ( beta_1 ) is the slope.
- ( epsilon ) is the error term.
- Multiple Linear Regression: Extends simple linear regression by modeling the relationship between the dependent variable and multiple independent variables. The formula is:
[
y = beta_0 + beta_1 x_1 + beta_2 x_2 + ldots + beta_n x_n + epsilon
]
- ( y ) is the dependent variable.
- ( x_1, x_2, ldots, x_n ) are the independent variables.
- ( beta_0 ) is the intercept.
- ( beta_1, beta_2, ldots, beta_n ) are the coefficients.
- ( epsilon ) is the error term.
Assumptions of Linear Regression
- Linearity: The relationship between the dependent and independent variables should be linear.
- Independence: Observations should be independent of each other.
- Homoscedasticity: The residuals (errors) should have constant variance at every level of the independent variables.
- Normality: The residuals should be approximately normally distributed.
Evaluation Metrics
Mean Squared Error (MSE): Measures the average squared difference between observed and predicted values.
- R-squared ((R^2)): Represents the proportion of variance in the dependent variable that is predictable from the independent variables. [ R^2 = 1 – frac{sum_{i=1}^{n} (y_i – hat{y}i)^2}{sum{i=1}^{n} (y_i – bar{y})^2} ]
- Adjusted R-squared: Adjusted version of ( R^2 ) that accounts for the number of predictors in the model.
Implementation Example using Scikit-Learn
Here's how you can implement linear regression in Python using the Scikit-Learn library:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset (example using the Boston housing dataset)
from sklearn.datasets import load_boston
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target
# Define features and target variable
X = data.drop('PRICE', axis=1)
y = data['PRICE']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
This example demonstrates loading a dataset, splitting it into training and testing sets, training a linear regression model, making predictions, and evaluating the model's performance.