Return to Answer

added 4964 characters in body

Source Link

edited Sep 1, 2014 at 23:39

1.5k
9
12

Edit: Ok, I reread your requirements more carefully. Here's what I think is best, however let me preface this: Alexandrescu has a great article on performance where he notes that we have made our processors better and more intelligent, and the cost of a simple mental model of how it works. So it's very hard nowadays to predict the speed of code compared to before. So profiling is more important than ever. So if you really care about speed, profile! (btw, link is https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920).

Given that your x and y locations will never change, only z, you can create a struct as you suggested that stores x, y, z and misc data. Take all the initial data points, and put them in a single std::vector of these structs, call this variable DataPoints. Sort them by x, then y, the z sorting is optional. If you are doing more operations that change z, maintaining sort order will be expensive and its easier just to loop through all z's and do an if check. If you usually just scan the data, then keep sort order by z, and you will have to pay O(N) to keep it sorted when modifying the z coordinate, where N is the number of points with the same x and y coords.

Now, we are going to build a lookup table. We create another array, a single contiguous piece of memory, that you do lookups on given an x and y coordinate. This table will return a pair of integers, telling you the first and last index in DataPoints that have that exact x and y. Now, there's just one more clever bit. Since we have this lookup table, we don't actually need the x and y coordinates stored in DataPoints at all. So we create a new DataPointsReduced; it's the same data in the same order, but with the x data stripped. We could strip the y data too, but that may or may not pay off as you'll see.

We're now basically done. When you do the loop over x and y, you can extract all the indices you need from the lookup table. Sadly because you do a range of x and range of y, the resulting data of interest can't be fully contiguous. But the scan should be contiguous over a fixed value of x. The outer loop is x, for that fixed x, since the secondary sorting is y, the range of y's of interest all contain data that is contiguous in memory. So there are only two loops, not three. The inner loop goes over a block of memory containing all z's for a range of y's, for a fixed x. Inside this inner loop you do whatever operations you want. Note that if we strip y too and we need it, we will need to return to three loops, or do some kind of lookup to determine y at each stage, which can be slower. If you don't need y at all once the range of y's is specified, then strip y from your data as well.

That's basically it. If your range of x's and y's gets very large, this lookup table will eventually use a huge amount of space (but note: no more space than your solution, this lookup table could probably store 2 32 bit unsigned integers for every x-y pair, whereas your table would store at least one 64 bit pointer for every x-y pair). You can avoid the lookup table at small cost. Since we've sorted by x and y, you can have a small prestep where you do some clever binary searching to quickly find the indices you need, and then proceed as before. The cost of this prestep is less than linear in the total amount of data being scanned, so asymptotically it doesn't affect running time. However in this case you obviously can't strip the x or y data. This shouldn't make a huge difference though, if the number of points in the contiguous y range is reasonable prefetching should nullify this difference.

This setup (with no lookup table) is pretty damn good if the ranges of x and y are huge, and your data is progressively more sparse. Note that this solution does not have any space requirement proportional to the ranges of x and y. Whereas even dynamically allocated arrays or linked lists at every x-y pair implies at least one pointer per x-y pair, which is space proportional to the ranges of x and y.

Final thought. This setup has potentially more if branching, because I basically recommend just looping through a contiguous block of memory and doing whatever checks you need to on z, instead of looping over the exact known range of z. It depends exactly what your code is doing, but for very simple if operations (e.g. 1-2 lines of code, no side effects) often the cost of branching is virtually nil: the processor can compute the if condition and both branches simultaneously on 3 different pipelines, and then move the appropriate result back.

Final final thought: profile! What I wrote here is based on moderate experience, but honestly it is fully possible that my answer will not work out the way I think it will for any number of reasons.

Good luck.

Final final thought: profile! What I wrote here is based on moderate experience, but honestly it is fully possible that my answer will not work out the way I think it will for any number of reasons.

Good luck.

Source Link

answered Aug 31, 2014 at 16:53

Nir Friedman

1.5k
9
12

I don't fully follow what's going from your answer, but a couple of comments:

You should use contiguous memory. At the very least, at the outer level (for the x-y coordinates) you should have a single vector, not a vector of vectors. A vector is basically a pointer to a contiguous block of memory. A vector of vectors is a contiguous block of pointers, to contiguous blocks of memory, i.e. the data itself is not contiguous. Contiguity is incredibly important to speed due to factors like cache missing and pre-fetching.
Linked lists are slow. The working of modern processors tends to give vectors a very large practical advantage. Linked lists are especially terrible for small types, for several reasons. First off, if the types are small the extra space costs of the pointers is huge. Second, when the types are small and in contiguous storage, the processor can read them very efficiently because processors generally read from memory an entire cache line at a time, which is 64 bytes. If your data type is just a double, that means that 8 doubles are read simultaneously by the processor. When you use linked lists you give this up.

The bottom line: you want a loop over your data to read the data precisely sequentially from memory in one contiguous chunk. This is the model that will give the best performance.