In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
  #do stuff using sub
Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5  times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.
do stuff, we can find a better solution without a loop.