I've been trying to fully understand the paper "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (available here), but I'm stuck on the linearization part, specifically section 2.2. The idea is that in the infinite width limit, the individual neural network parameters' changes vanish which permits studying the Taylor approximation about the initialization. Thus $$f_t^{\text{lin}}(x) = f_0(x) + \nabla_{\theta} f_0(x) |_{\theta=\theta_0} \omega_t$$ where $f_t$ is the network function at time t, $\theta$ is the collection of network parameters, $x$ is an input to the network, and $\omega_t$ is the difference between the parameters at time $t$ and initialization, $\theta_t - \theta_0$. The part I don't understand is the next equation (equation (6) in the paper), $$\dot{\omega_t} = -\eta \nabla_\theta f_0(\chi)^T \nabla_{f_t^{\text{lin}}(\chi)} L$$ where $\eta$ is the learning rate, $\chi$ is the entire training set, and $L$ is the loss function. Specifically, I don't understand why we are allowed to use $\nabla_{\theta}f_0(\chi)^T$. In the definition of $\omega_t$, only $\theta_t$ has time dependence so $\dot{\omega_t} = \dot{\theta_t}$. Continuous time gradient descent was previously defined as $$\dot{\theta_t} = -\eta\nabla_{\theta}f_t(\chi)^T\nabla_{f_t(\chi)}L$$ so I assume the idea is to substitute $f_t^{\text{lin}}$ for $f_t$ but instead we use $\nabla_{\theta}f_0(\chi)^T$, which means that $\nabla_{\theta}f_t^{\text{lin}} = \nabla_{\theta}f_0$, however I am unsure of why this is the case. I would appreciate any help on this, thanks!