Kaemon Lovendahl

Posted on May 12 • Originally published at glitchedgoblet.blog

Intro to Machine Learning: A Practical Guide for Curious Coders

#machinelearning #python #tutorial #ai

TL;DR
This post shows you how to take a real‑world dataset, build a decision‑tree and random‑forest regressor in Python, and understand why each step matters.
Ideal for existing devs dabbling in data‑science or anyone who wants a no‑fluff path from “What even is ML?” to “I shipped my first model!”

Why Machine Learning, Anyway?

Machine Learning (ML) lets software spot patterns in data and make predictions on brand‑new inputs.
If you’ve ever:

auto‑completed a sentence
filtered spam
asked a voice assistant to play a song on Spotify

…you’ve benefited from ML.
In 2025 the tooling is friendly enough that any JavaScript web dev can train useful models without a PhD—or even leaving VS Code.

We'll focus on supervised learning (predicting a target value given labeled examples).

Prerequisites

Tool	Why you need it	Install
Python ≥ 3.10	The lingua franca of ML	`brew install python` or https://python.org
pandas	Data wrangling swiss‑army knife	`pip install pandas`
scikit‑learn	Classic ML algorithms & utilities	`pip install scikit-learn`
Jupyter / VS Code Notebooks	Interactive coding & charts	`pip install notebook` or VS Code Python ext
matplotlib	Lightweight plotting	`pip install matplotlib`

Tip: If you’re on Windows, install WSL 2 + Ubuntu and run everything in a Linux shell. Makes life much happier.

1. Meet the Data

Before we even speak the word algorithm, we need to know what we’re working with. Think of Exploratory Data Analysis (EDA) as a code‑review for your dataset: it surfaces shape, size, and quirks so you don’t step on a landmine later. The helpers we call here each pull back a different curtain.

head() gives a sneak peek at the first five rows. It's perfect for checking that everything loaded correctly.
describe() spits out summary stats (mean, median, min/max) so you can eyeball distributions and spot outliers.
.columns prints every field so you know what you have to work with.
dropna() quickly removes rows containing NaN so your model doesn’t choke on missing values. Skip this step and you’re basically deploying straight to prod without tests.

In this guide, we'll use the California Housing dataset that ships with scikit‑learn. It contains 20,640 rows of census‑block statistics from the 1990 US Census and a target column MedHouseVal (median house value in \$100 000’s).

from sklearn.datasets import fetch_california_housing
import pandas as pd

housing_raw = fetch_california_housing(as_frame=True)
cal_df = housing_raw.frame  # neat: already a pandas DataFrame!

Quick EDA (Exploratory Data Analysis)

cal_df.head()
cal_df.describe(include='all')
cal_df.columns  # view every column name

The dataset is very clean (no missing values) but real life is messier. For example and testing purposes, let’s simulate some missing values in the MedInc column (median income). We’ll use NumPy’s random number generator to pick 5 % of rows and set them to NaN.

# simulate 5% missing entries in 'MedInc' (median income)
import numpy as np
rng = np.random.default_rng(42)
mask = rng.choice(cal_df.index, size=int(0.05 * len(cal_df)), replace=False)
cal_df.loc[mask, 'MedInc'] = np.nan

# Drop rows with missing values – quick‑n‑dirty
clean_df = cal_df.dropna(axis=0)

When to drop vs. impute? > Drop if <5 % rows are affected and you have plenty of data.
Impute (fill in) if the holes are bigger or systematic. Check out sklearn.impute.SimpleImputer for options. We'll discuss this in more detail in another post.

2. Choose Target & Features

Supervised learning is essentially a fancy mapping exercise. "given inputs (features), predict an output (target)". Declaring them explicitly forces you to articulate the question you’re asking. The target being MedHouseVal in our case. It is the single column we want the model to predict. Everything it’s allowed to look at to make that prediction are the features. Choosing a sensible subset reduces noise, speeds training, and keeps your model focused on relevant signals. Too many irrelevant columns = garbage‑in, garbage‑out; omitting an informative feature handicaps accuracy from the outset.

TARGET = 'MedHouseVal'
FEATURES = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
            'AveOccup', 'Latitude', 'Longitude']

y = clean_df[TARGET]
X = clean_df[FEATURES]

y is a Series (one‑dimensional).
X is a DataFrame (two‑dimensional) containing only numeric predictors.

3. Split: Train vs. Validation

Machine‑learning models love to memorize. To see whether they’ve actually learned anything, we quarantine a slice of data they never see during training. This is known as the validation set. It’s like an exam after the study session:. Ace the homework (training data) but bomb the exam (validation) and the model is probably overfitting. Conversely, doing poorly everywhere signals underfitting.
train_test_split shuffles and partitions for us an 80 / 20 split is the classic default. It can tweak the ratio based on dataset size. The key is that validation data remains untouched until evaluation time.

Never evaluate on the data you trained on, that invites overfitting. train_test_split keeps us honest.

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=0
)

20% for validation is a common default. For production, add a third split (test) or use k‑fold cross‑validation.

4. Build a Baseline Model

A baseline is your yardstick. It's something quick, simple, and interpretable that sets the minimum bar a fancier model must beat. Decision trees are perfect for this: they train in seconds, are easy to visualize, and yield a concrete metric such as MAE. If later experiments can’t outperform the tree, they’re not worth the added complexity. Always start here! You’ll be surprised how often a straightforward model wins.

What is a Decision Tree in Machine Learning? It is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The paths from root to leaf represent classification rules. An easy way to understand this is to think of it as a series of questions that lead to a decision. For example, if you were trying to decide whether to go outside based on the weather, you might ask "Is it raining?" If the answer is yes, you might then ask "Is it cold?" and so on. The DecisionTreeRegressor does this for you, but with numerical data.

from sklearn.tree import DecisionTreeRegressor

baseline = DecisionTreeRegressor(random_state=0)
baseline.fit(X_train, y_train)

Predict & Measure

The MAE (Mean Absolute Error) is the average of the absolute differences between predictions and actual outcomes. It’s a measure of how far off predictions are from actual outcomes, on average.

from sklearn.metrics import mean_absolute_error

val_pred = baseline.predict(X_val)
mae_baseline = mean_absolute_error(y_val, val_pred)
print(f"Baseline MAE: {mae_baseline:,.2f} (×$100 000)")

On my run: Baseline MAE: 0.33 (≈ \$33 000 off on average).

5. Taming Under‑ & Over‑fitting

I've used the term overfitting and underfitting a few times now, but what does it mean?

Overfitting happens when a model captures random noise in the training data, like memorizing every house’s exact sale price, so it collapses on fresh inputs.

Underfitting is the opposite: the model is too simple to grasp the real signal, yielding high error everywhere.

Hyper‑parameters such as max_leaf_nodes let us dial complexity up or down. Plot error versus model capacity and you’ll see a classic U‑shaped curve. Your goal is the bottom of that valley where validation error is lowest. It's a good idea to test with different values of max_leaf_nodes to see how it affects the model's performance.

A Decision Tree with too many leaves memorises the training set (overfit), too few leaves, and it under‑fits.

from tqdm import tqdm
from sklearn.metrics import mean_absolute_error

candidates = [5, 25, 125, 625, 3125]
for leaves in candidates:
    model = DecisionTreeRegressor(max_leaf_nodes=leaves, random_state=0)
    model.fit(X_train, y_train)
    pred = model.predict(X_val)
    print(f"{leaves:>5} leaves → MAE {mean_absolute_error(y_val, pred):.3f}")

Typical output:

    5 leaves → MAE 0.722
   25 leaves → MAE 0.465
  125 leaves → MAE 0.336
  625 leaves → MAE 0.343
 3125 leaves → MAE 0.356

The curve bottoms out around 125 leaves—anything larger buys us negligible gains but risks overfitting.

6. Enter the Random Forest

A Random Forest sidesteps the overfitting trap by training many slightly different trees and averaging their predictions. Each tree sees a bootstrap sample of rows and a random subset of features, so their individual biases cancel out. The ensemble is robust, usually outperforms a single tree, and demands little hyper‑parameter fussing. Making it a production workhorse.

A Random Forest trains many trees on random subsets of rows and columns, then averages their predictions.

from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(
    n_estimators=300,        # number of trees
    max_leaf_nodes=125,     # carry over our best leaf size
    random_state=0
)
forest.fit(X_train, y_train)
print("Forest MAE:", mean_absolute_error(y_val, forest.predict(X_val)))

On my laptop: Forest MAE: 0.26—that’s a solid 20%-ish bump over the single tree.

Sprinkle in min_samples_leaf, max_depth, and feature engineering (log‑scaling skewed columns, polynomial terms, etc.) before reaching for “deeper” algorithms like Gradient Boosting or XGBoost.

7. What About Categorical Data?

Numbers aren’t the whole story. Real datasets mix text labels like red, NY, or Saturday. Algorithms can’t parse strings, so we convert them to numeric vectors. Most commonly via one‑hot encoding, where each category becomes its own 0/1 column. ColumnTransformer lets us funnel numeric and categorical pipelines together so preprocessing and model training stay in lock‑step. This guarantees that the same transformations applied during training are also applied at inference time, preventing those dreaded “feature mismatch” errors.

Our example was all numbers. Real datasets mix text, enums, dates, & more. Pipeline it!

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

cat_features = ['ocean_proximity']  # present in raw Calif. housing
num_features = FEATURES  # defined earlier

preprocess = ColumnTransformer([
    ('num', SimpleImputer(strategy='median'), num_features),
    ('cat', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OneHotEncoder(handle_unknown='ignore'))
    ]), cat_features)
])

model = RandomForestRegressor(n_estimators=300, random_state=0)

pipe = Pipeline([
    ('prep', preprocess),
    ('model', model)
])
pipe.fit(X_train.join(cal_df[cat_features]), y_train)

Why Pipelines Rock

Keep preprocessing & model together—fewer bugs when you pickle / joblib.dump it.
Fit/transform splits handled automatically.
Hyperparameter search (GridSearchCV) becomes one‑liner.

8. Next Steps

Looking to level up beyond this intro? Here are some ideas:

Cross‑validation: sklearn.model_selection.cross_val_score for more robust metrics.
Hyper‑parameter tuning: GridSearchCV / RandomizedSearchCV or advanced tools like Optuna.
Feature importance: model.feature_importances_ plus SHAP for interpretability.
Persist your model: import joblib; joblib.dump(pipe, 'california_rf.joblib').
Serve it: FastAPI + Uvicorn → Vercel Edge Functions. (You knew a devops‑y plug was coming 😎.)

Quick‑Reference Cheatsheet

Step	Code Snippet
Load data	`fetch_california_housing(as_frame=True)`
Inspect	`df.head()`, `df.describe()`
Clean NA	`df.dropna()` or `SimpleImputer()`
Split	`train_test_split(X, y, random_state=0)`
Baseline model	`DecisionTreeRegressor()`
Evaluate	`mean_absolute_error(y_true, y_pred)`
Tune	loop over `max_leaf_nodes`, record MAE
Ensemble	`RandomForestRegressor()`
Pipeline	`ColumnTransformer` + `Pipeline`

Final Thoughts

Machine Learning isn’t magic. It’s statistics wrapped in code.
Start with a dataset, really look at it, build the simplest model that could possibly work, then iterate.
Soon you’ll wonder why you ever thought “ML” was reserved for ivory‑tower researchers.

If you build something rad, drop a comment or ping me on BlueSky. I'm always stoked to see what the community hacks together! I'm personally working on a project using data from mtgjson to predict the value of Magic: The Gathering cards. If you're interested in that, let me know and I can share my progress!

DEV Community