1

I'm trying to normalize a matrix by doing (X - means) / variance to each row.

Since I am implementing this with MapReduce, I first calculate the means and standard variance for each column, and then map each row with:

   matrix.map(lambda X: (X - means) / variance)

But I want to ignore the first element in each row X, which is my target column containing only 1s and 0s.

How can I do this?

1 Answer 1

2

If A is a numpy array of shape (m, n + 1) and you also have arrays mu and s2 of shape (n,) holding the mean and variance of each column except the first one, you can do your normalization as follows:

A[:, 1:] = (A[:, 1:] - mu) / s2

To undestand wat goes on, you need to understand how broadcasting works. Since A[:, 1:] has shape (m, n) and mu and s2 shape (n,), these last two have 1s prepended to their shape to match the dimensions of the first, so they are treated as (1, n) arrays, and during the arithmetic operations the value in their first and only row is broadcasted to all rows.

If you are not already doing so, your meand and variance arrays can be calculated efficiently as

mu = (A[:, 1:].mean(axis=0)
s2 = A[:, 1:].var(axis=0)

For the variance you may want to use np.std squared to take advantage of the ddof argument, see the docs.

On a separate note, normalization is normally done dividing by the standard deviation, not the variance.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. I knew the mean and var methods, but I think they are just for small datasets. For large datasets, I have to implement them with MapReduce. In this case, I need to map a row so that the returned array are normalized (ignoring the first column).
np.concatenate((X[0], (X[1] - mean) / std_var) is what I want ;)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.