Linear Regression
Linear Regression — Made Simple
Ever wondered how a bank predicts your loan amount, or how a website guesses your budget? It often starts with this one simple idea.
What Is Linear Regression?
Think about this: the bigger a pizza is, the more it costs. The more hours you study, the better your exam score. These are straight-line relationships — and that is exactly what linear regression finds.
Linear regression is a machine learning method that draws the best possible straight line through your data and uses that line to predict future values.
Real-life example: You want to predict a house's price based on its size. Linear regression looks at thousands of past house sales, draws a line showing how price goes up as size goes up, then uses that line to predict prices for new houses.
The Formula (Don't Be Scared!)
It's just the straight-line equation you learned in school:
The model's job is to find the right slope and starting value so the line fits the data as closely as possible. In ML, slope is called a weight and the starting value is called a bias.
How Does the Model Learn?
The model starts with a random guess — a line that fits the data badly. Then it keeps making small adjustments until the line fits as well as possible. Here is how:
Use the current line to guess the output for each example in the training data.
Compare each guess to the real answer. Add up all the errors. The total is called thecost— lower is better.
Nudge the slope and starting value slightly in the direction that reduces the cost. This step is calledgradient descent— the model rolls downhill toward a better answer.
Do steps 1–3 hundreds of times. Each round, the line gets a little better until it stops improving.
The error formula: The model measures its mistakes using Mean Squared Error (MSE) — it squares each mistake (so positive and negative errors don't cancel out) and takes the average. A lower MSE means a better line.
Quick Python Example
from sklearn.linear_model import LinearRegression
# Training data: house sizes (sqft) and prices ($)
sizes = [[600], [800], [1000], [1200], [1500]]
prices = [150000, 200000, 250000, 290000, 370000]
# Train the model
model = LinearRegression()
model.fit(sizes, prices)
# Predict the price of an 1100 sqft house
result = model.predict([[1100]])
print(f"Predicted price: ${result[0]:,.0f}")
How Do We Know If It Worked?
After training, test the model on data it has never seen before. These three numbers tell you how well it performed:
| Metric | What it tells you | Good score |
|---|---|---|
| RMSE | Average prediction error in the same units as your answer (e.g. dollars). Easy to understand. | As low as possible |
| MAE | Similar to RMSE but less affected by very large mistakes. | As low as possible |
| R² | How much of the pattern the model has captured. 1 = perfect. 0 = no better than a random guess. | Close to 1 |
Example: If your model predicts house prices and has an RMSE of $12,000, it is off by about $12,000 on average. Whether that is acceptable depends on the price range you are working with.
Common Mistakes to Avoid
| Mistake | What goes wrong | Easy fix |
|---|---|---|
| Not scaling your numbers | A feature like "salary in thousands" will overpower a feature like "age in years." The model learns badly. | Scale all features to a similar range before training. |
| Extreme outliers | One or two wild data points can pull the whole line off track. | Check your data and fix or remove obvious errors. |
| The relationship is curved, not straight | If the real pattern is a curve, a straight line will always give bad predictions. | Plot your data first! If it curves, use a different model. |
| Too many features, not enough data | The model memorises the training examples instead of learning the real pattern. | Use fewer features or add regularisation (see below). |
What Is Regularisation?
Sometimes the model tries too hard and memorises the training data. It gets great scores on training but fails on new data. This is called overfitting. Regularisation is a simple penalty that keeps the model from going overboard.
Shrinks all weights a little
Great when all your features probably matter. No features get completely removed.
Sets some weights to zero
Great when you think many features are useless — Lasso automatically removes them.
When Should You Use It?
Good fit
You want to predict a number (not a yes/no). The data looks roughly like a straight line. You need results that are easy to explain.
Not the right tool
You want to predict a category (e.g. spam/not spam). The relationship is clearly curved. You have complex, messy data — try a neural network instead.
Always try this first. Before jumping to complicated models, try linear regression. If it already gives good results, you do not need anything fancier — and simple models are easier to trust, explain, and maintain.
Key Takeaways
It learns the line by reducing the average error step by step.
Always plot your data before you start. If it curves, use a different model.
These two steps alone fix most common problems before training.
Always test on data the model has never seen — not just the training data.
Further Reading
- Géron, Aurélien — Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (O'Reilly, 2022)
- James et al. — An Introduction to Statistical Learning (Springer, 2021) — free PDF online
- Ng, Andrew — Machine Learning Specialization on Coursera — great beginner video course
- scikit-learn docs — sklearn.linear_model

No comments