Understanding gradient flow of a linearized wide neural network

Ask Question

Asked 2 years, 4 months ago

Modified 2 years, 4 months ago

Viewed 55 times

I've been trying to fully understand the paper "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (available here), but I'm stuck on the linearization part, specifically section 2.2. The idea is that in the infinite width limit, the individual neural network parameters' changes vanish which permits studying the Taylor approximation about the initialization. Thus $$f_t^{\text{lin}}(x) = f_0(x) + \nabla_{\theta} f_0(x) |_{\theta=\theta_0} \omega_t$$ where $f_t$ is the network function at time t, $\theta$ is the collection of network parameters, $x$ is an input to the network, and $\omega_t$ is the difference between the parameters at time $t$ and initialization, $\theta_t - \theta_0$. The part I don't understand is the next equation (equation (6) in the paper), $$\dot{\omega_t} = -\eta \nabla_\theta f_0(\chi)^T \nabla_{f_t^{\text{lin}}(\chi)} L$$ where $\eta$ is the learning rate, $\chi$ is the entire training set, and $L$ is the loss function. Specifically, I don't understand why we are allowed to use $\nabla_{\theta}f_0(\chi)^T$. In the definition of $\omega_t$, only $\theta_t$ has time dependence so $\dot{\omega_t} = \dot{\theta_t}$. Continuous time gradient descent was previously defined as $$\dot{\theta_t} = -\eta\nabla_{\theta}f_t(\chi)^T\nabla_{f_t(\chi)}L$$ so I assume the idea is to substitute $f_t^{\text{lin}}$ for $f_t$ but instead we use $\nabla_{\theta}f_0(\chi)^T$, which means that $\nabla_{\theta}f_t^{\text{lin}} = \nabla_{\theta}f_0$, however I am unsure of why this is the case. I would appreciate any help on this, thanks!

edited Jul 2, 2023 at 17:43

user161257

asked Jul 2, 2023 at 12:16

user161590

112 bronze badges

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Understanding gradient flow of a linearized wide neural network

0

Hot Network Questions

Understanding gradient flow of a linearized wide neural network

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions