Univariate linear regression from scratch in Python

Question

I am relatively new to machine learning and I believe one of the best ways for me to get the intuition behind most algorithms is to write them from scratch before using tons of external libraries.

This classifier I wrote seems to be yielding reasonable results based on the dataset I provided. This dataset is based on the number of hours that a student studied for a test (x), and the score this same student got in the test (y).

I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.

Would you mind giving me your opinions and comments about this code? This is also important because I'll be adding to my portfolio. Are there some missing good practices in the code? What would you recommend keeping and removing in a professional setting or for life as a developer?

Univariate linear regression algorithm:

# Linear equation based on: y = m * x + b, which is the same as h = theta1 * x + theta0
import numpy as np

class LinearRegressionModel():
    """
    Univariate linear regression model classifier.
    """

    def __init__(self, dataset, learning_rate, num_iterations):
        """
        Class constructor.
        """
        self.dataset = np.array(dataset)
        self.b = 0  # Initial guess value for 'b'.
        self.m = 0  # Initial guess value for 'm'.
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.M = len(self.dataset) # 100.
        self.total_error = 0

    def apply_gradient_descent(self):
        """
        Runs the gradient descent step 'num_iterations' times.
        """
        for i in range(self.num_iterations):
            self.do_gradient_step()

    def do_gradient_step(self):
        """
        Performs each step of gradient descent, tweaking 'b' and 'm'.
        """
        b_summation = 0
        m_summation = 0
        # Doing the summation here.
        for i in range(self.M):
            x_value = self.dataset[i, 0]
            y_value = self.dataset[i, 1]
            b_summation += (((self.m * x_value) + self.b) - y_value) # * 1
            m_summation += (((self.m * x_value) + self.b) - y_value) * x_value

        # Updating parameter values 'b' and 'm'.
        self.b = self.b - (self.learning_rate * (1/self.M) * b_summation)
        self.m = self.m - (self.learning_rate * (1/self.M) * m_summation)
        # At this point. Gradient descent is finished.

    def compute_error(self):
        """
        Computes the total error based on the linear regression cost function.
        """
        for i in range(self.M):
            x_value = self.dataset[i, 0]
            y_value = self.dataset[i, 1]
            self.total_error += ((self.m * x_value) + self.b) - y_value
        return self.total_error

    def __str__(self):
        return "Results: b: {}, m: {}, Final Total error: {}".format(round(self.b, 2), round(self.m, 2), round(self.compute_error(), 2))

    def get_prediction_based_on(self, x):
        return round(float((self.m * x) + self.b), 2) # Type: Numpy float.

def main():

    # Loading dataset.
    school_dataset = np.genfromtxt(DATASET_PATH, delimiter=",")

    # Creating 'LinearRegressionModel' object.
    lr = LinearRegressionModel(school_dataset, 0.0001, 1000)

    # Applying gradient descent.
    lr.apply_gradient_descent()

    # Getting some predictions.
    hours = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
    for hour in hours:
        print("Studied {} hours and got {} points.".format(hour, lr.get_prediction_based_on(hour)))

    # Printing the class attribute values.
    print(lr)

if __name__ == "__main__": main()

Dataset snippet:

32.502345269453031,31.70700584656992
53.426804033275019,68.77759598163891
61.530358025636438,62.562382297945803
47.475639634786098,71.546632233567777
59.813207869512318,87.230925133687393
55.142188413943821,78.211518270799232
52.550014442733818,71.300879886850353
45.419730144973755,55.165677145959123

Daniel · Accepted Answer · 2018-06-20 14:57:40Z

About OOP

I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.

Although I believe that your approach was fine, using OOP for the sake of OOP is something I would rather warn against. There is a talk about this here.

Comments

def __init__(self, dataset, learning_rate, num_iterations):
    """
    Class constructor.
    """

The comment Class constructor is redundant and unecessary, I would instead explain the parameters of __init__ in the doc string.

    self.M = len(self.dataset) # 100.

Is the # 100 saying that the len(self.dataset) is always going to be 100? It might be 100 in this case, but I highly doubt you can ensure that.

Default values

Have you considered putting default values for learning_rate and num_iterations? If we want a default of 100 and 0.001 for num_iterations and learning_rate respectively, you could rewrite __init__ like:

def __init__(self, dataset, learning_rate=0.001, num_iterations=100):

Private methods

Do you really want do_gradient_step(self) to be considered public? Yes, there are no "true" private methods, but the convention is to put one underscore before the name to indicate it is private. Honestly, I would just call it _step(self).

Indentation

if __name__ == "__main__": main()

should really be:

if __name__ == "__main__": 
    main()

To comply with PEP 8.

to comply with pep8 there should just be one underscore before an private method or variable. — baot
– baot, Commented Jun 20, 2018 at 7:33

Randy Diaz · Accepted Answer · 2018-06-19 23:39:49Z

Are there some missing good practices in the code?

Notes about training methods for Linear Regression.

Gradient Descent is slower but uses less memory.
Normal equations as shown below is faster but uses more memory.

Training member function

You did well in trying to use gradient descent to train a linear model. For most models like the logistic regression model: there is no actual solution to train the model. However, for the linear regression model with squared errors you can calculate the weights with the below equation.

You can just add this method to the class with your other training functions (This is a head start on how you could implement the equation.)

def train_squared_error():
    x_value = np.array([x[i, 0] for x in self.dataset])
    y_value = np.array([y[i, 1] for y in self.dataset])
    self.m_b = (np.transpose(x) @ x).inverse() @ np.transpose(x) @ y

Note that this is going to be faster than gradient descent because matrix multiplication like this using numpy commands are very quick. Also the @ symbol is operator overload for the .dot() method for dot product command. I recommend testing this function because I did this off of the top of my head and don't have time to check if it is 100%.

Testing Suite

https://docs.python.org/3/library/unittest.html

I also recommend testing the class extensively by creating a unit test class like below:

import unittest

class TestStringMethods(unittest.TestCase):

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FOO')

    def test_isupper(self):
        self.assertTrue('FOO'.isupper())
        self.assertFalse('Foo'.isupper())

    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        # check that s.split fails when the separator is not a string
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

Stack Exchange Network

Univariate linear regression from scratch in Python

2 Answers 2

About OOP

Comments

Default values

Private methods

Indentation

You must log in to answer this question.

Hot Network Questions

Univariate linear regression from scratch in Python

2 Answers 2

About OOP

Comments

Default values

Private methods

Indentation

You must log in to answer this question.

Related

Hot Network Questions