Checked out a few papers on Skip Graphs:
All of the insertion/deletion algorithms assume that you have selected a start node to begin the search. However, it is not described how the start node is selected. In doing some initial exploration into how to do this, I start thinking in terms of the following:
level1nodechunk1 = count next node node node ....
level1nodechunk2 = count next node node node ....
...
Basically, a level is just a bunch of nodes:
node node node ...
If you are to randomly select a starting node, and they are not contiguous in memory, then you have to navigate your way there. So instead of going one node at a time, I group it into chunks for the particular level, and specify the count and next so you can jump ahead to the next one if desired.
So say you generate a random number x = random num between 0 and totalnodecount. Say it is 20 and each level chunk has 5 nodes. Then you would do:
level1nodechunk1 = [count] [next] (5)
level1nodechunk2 = [count] [next] (5 + 5 = 10)
level1nodechunk3 = [count] [next] (10 + 5 = 15)
level1nodechunk4 = count next node node node node [node5]
You would jump over chunks 1-3 and get to chunk 4, see it has 5 contiguous nodes, and jump to that one. So it's just ~5 instructions instead of 20 (one per node) roughly.
But the problem is if you have 1 billion nodes. Then if each one is only 5 nodes, then you have 200 million chunks, so ~200 million instructions. So maybe you make the chunks to be 10,000 nodes each, that's still 100,000 instructions to find the start node.
I'm wondering how they do this efficiently.
If all the nodes were contiguous then it would be easy, $O(1)$ lookup by index. If you hardcode a chunk count so we only have say 20 chunks, no matter how many nodes, then you have to account for the worst case of lets say 1 billion nodes. So that means each chunk is 50 million nodes. But if you have only currently used 10 nodes out of 1 billion, then that's tons of wasted memory reserved.
The only other thing I can think of is that the start node is always the actual first node in the lowest level of the skip graph, then you navigate to the top-level for that node and work your way back down.
IIRC, the reason to "randomly" select a start node is to prevent hotspots.
Another possible solution is to cycle through the level 0 nodes. Basically:
- Start at the start level 0 node.
- Cache the next node it points to as the future start node.
- Next query, the start node is that cached node. Then set next node to its pointer, etc.
But that would give preference to the first nodes more than the last (i.e. it's not random).