How to efficiently select the initial node to start a search in a Skip Graph

Question

Checked out a few papers on Skip Graphs:

All of the insertion/deletion algorithms assume that you have selected a start node to begin the search. However, it is not described how the start node is selected. In doing some initial exploration into how to do this, I start thinking in terms of the following:

level1nodechunk1 = count next node node node ....
level1nodechunk2 = count next node node node ....
...

Basically, a level is just a bunch of nodes:

node node node ...

If you are to randomly select a starting node, and they are not contiguous in memory, then you have to navigate your way there. So instead of going one node at a time, I group it into chunks for the particular level, and specify the count and next so you can jump ahead to the next one if desired.

So say you generate a random number x = random num between 0 and totalnodecount. Say it is 20 and each level chunk has 5 nodes. Then you would do:

level1nodechunk1 = [count] [next] (5)
level1nodechunk2 = [count] [next] (5 + 5 = 10)
level1nodechunk3 = [count] [next] (10 + 5 = 15)
level1nodechunk4 = count next node node node node [node5]

You would jump over chunks 1-3 and get to chunk 4, see it has 5 contiguous nodes, and jump to that one. So it's just ~5 instructions instead of 20 (one per node) roughly.

But the problem is if you have 1 billion nodes. Then if each one is only 5 nodes, then you have 200 million chunks, so ~200 million instructions. So maybe you make the chunks to be 10,000 nodes each, that's still 100,000 instructions to find the start node.

I'm wondering how they do this efficiently.

If all the nodes were contiguous then it would be easy, $O(1)$ lookup by index. If you hardcode a chunk count so we only have say 20 chunks, no matter how many nodes, then you have to account for the worst case of lets say 1 billion nodes. So that means each chunk is 50 million nodes. But if you have only currently used 10 nodes out of 1 billion, then that's tons of wasted memory reserved.

The only other thing I can think of is that the start node is always the actual first node in the lowest level of the skip graph, then you navigate to the top-level for that node and work your way back down.

IIRC, the reason to "randomly" select a start node is to prevent hotspots.

Another possible solution is to cycle through the level 0 nodes. Basically:

Start at the start level 0 node.
Cache the next node it points to as the future start node.
Next query, the start node is that cached node. Then set next node to its pointer, etc.

But that would give preference to the first nodes more than the last (i.e. it's not random).

Derek Elkins left SE · Accepted Answer · 2018-07-01 06:25:16Z

First, the skip graph is a distributed data structure. There is no reason to use it in a single machine context.

For a singly-linked skip list (on a single machine), you would start at the beginning. This is trivial to find; you just hold a pointer to the first node. For a doubly-linked skip list, you could start at any node, though unless you had some cursor-like abstraction, you'd presumably start at the beginning for most operations anyway.

A skip graph is effectively a collection of doubly-linked skip lists that share structure. There is no need to start at the beginning. This allows the machine on which to initiate an operation to be chosen arbitrarily. In a public peer-to-peer context, you are usually only aware of one or a few machines. In a data-center context, usually a load balancer will choose the machine.

A machine will generally host many nodes of the skip graph. It's not clear to me how much it matters to pick a local node at random. Typically, only some of the nodes will be connected to nodes on other machines, so picking any other node is the same as picking one of these as far as spreading load between the machines. Nevertheless, if you wanted to pick randomly from the local nodes there are a variety of approaches. Since you are only concerned about local nodes, you are free to use any non-distributed data structure to keep track of them. For example, a size-balanced tree would work. You would pick a number between 0 and the number of local nodes, and then do a rank query on a size-balanced tree containing (pointers to) the local nodes. Alternatively, you could probably get something close to a uniform distribution on average by assuming that the any one of the local portions of the skip lists making up the local portion of the skip graph is perfectly "balanced". This lets you estimate how many nodes you are skipping when you traverse a link at a given level which lets you do an approximate rank query directly against (the local portion of) the skip graph. Neither of these approaches is likely to affect the asymptotic complexity of any operation.

Stack Exchange Network

How to efficiently select the initial node to start a search in a Skip Graph

1 Answer 1

Hot Network Questions

How to efficiently select the initial node to start a search in a Skip Graph

1 Answer 1

Related

Hot Network Questions