582615: Overlay and P2P Networks, Spring 2013
This spoiler package contains some answers for Exercise Package III.
The CAN topology is presented as a wrapping section of x-dimensional (xD) space,
where x depends on the configuration of the CAN instance. Depending on the amount
of dimensions the topology is either a ring (1D), a torus (2D) or a hypertorus
(xD, x>2). In colloquial language tori are often referred to as doughnuts
because of their resemblance to the bakery product. Each CAN node is responsible for a
zone in the CAN topology. Neighbours in the topology maintain connections to each other.
The shape of a zone also depends on the amount of dimensions. A zone can be a line
segment (1D), a rectangle (2D), a cuboid (3D) or a hypercuboid (xD, x>3).
When a new node joins the CAN network a random coordinate is selected as the point
where the new node gets inserted. The system uses basic routing
to find the node that is responsible for the selected coordinate. The node responsible
for that point divides its zone into two halves, and the new node becomes responsible
for the objects in one of the halves, while the original node remains responsible
for the other half. As a zone is split the two nodes transfer the corresponding
objects to the new node and update the routing tables of the neighbouring nodes
to match the change. Finally, as the two halves of the original zone are next to
each other, both of the two nodes add the other to their routing table.
As depicted by Figure 4 in the CAN article, adding dimensions increases routing
efficiency. However, as more dimensions are added, the benefit of adding more
dimensions decreases. Adding dimensions also increases routing table sizes.
Thus there exists a limit how far performance can be improved by adding
dimensions.
The amount of realities refers to redundant copies of the CAN topology in a
single CAN deployment. Each node may end up responsible for different
objects in different realities. Thus, the amount of realities affects the
amount of redundant copies available on the system.
The routing performance increase gained from adding
realities is pretty similar to that of adding dimensions, as can be seen
by comparing Figure 4 with Figure 5. However, the cost of adding realities
is higher because of the increased replication.
K-bucket size affects the theoretical maximum amount of connections in
the network. With k-bucket size zero there can be no connections.
With huge k-bucket sizes the network has the potential of becoming
fully connected. Thus, increasing the size of k-buckets increases
routing performance, but also increases the routing table size.
Increased routing tables require more memory, and require the host
to maintain a larger amount of connections.
Try experimenting with different parameters in the Kademlia Visualizer at
http://www.cs.helsinki.fi/u/twruottu/kademlia/kademlia.html
Lets consider this question for 5 bit identifiers and hope it generalizes to n bits. With 5 bit identifiers the maximum amount of nodes is 2^5 = 32. With bucket size 0 the routing table does not have space for any nodes.
The nodes are divided to different buckets in a way, where half of the nodes are placed in one of the buckets, half of the remaining nodes are placed in the following bucket and so forth until we run out of nodes. The nodes do not store their own identifier in any bucket. Thus, with the bucket size of 32 / 2 >= 16, all existing nodes fit into the routing table.
The amount of nodes added with each bucket size increase are as follows.
Assignment 9, Spoiler
a.
b.
c.
Assignment 10, Spoiler
a.
b.
increased to
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
...
required additional space
5
4
3
3
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
...
If we integrate this with a pocket calculator we get the following.
| bucket size | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | ... |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| required space | 0 | 5 | 9 | 12 | 15 | 17 | 19 | 21 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 31 | 31 | 31 | ... |
Now, by using a plotter we get the following chart.
We look at the plot and believe that the storage requirement increases logarithmically. If we would wish someone else to believe us as well, we would continue with a formal proof that verifies our belief. The proof should probably take into account identifier lengths other than 5 bits.
In the analysis above, we estimated the maximum space requirement for the routing table. In practice the nodes are not necessarily aware of all the other nodes that could potentially exist in their routing table. In fact it is not very likely for all the nodes even exist. Even more so because long identifier lengths are often used to make collisions less likely.
The Kademlia paper also describes optimizations where memory for the routing table gets allocated dynamically. The bookkeeping required for the optimization causes some processing overhead, but saves space when the routing table is not full.
Part c was pretty much an open question. To get points one would need to either analyse what the authors claim in the paper, or compare Kademlia to some other DHT, or analyse an existing DHT comparison table with Kademlia in it. Perhaps the comparison table from the lecture slides, or perhaps one from some scientific article that compares different DHTs.
Dynamo needs to be performant. It needs to be able to guarantee lattency and throughtput, not on average, but in the typical case. It needs to support performance contracts where dynamo and the service making use of Dynamo can agree on a sustainable performance level. It needs to support distributing its operation on hundreds of simultaneous storage hosts.
Dynamo also needs to provide some reliability guarantees. It needs to be always writable, and it must be able to guarantee that the information has been stored in durable storage. It needs to be completely decentralized so there is no single point of failure. It needs to provide some weak consistency guarantees, but providing full ACID guarantees is in conflict with the other requirements.
Further more Dynamo has some administrational requirements. It needs to support adding new nodes easily. The Dynamo deployment needs to be configurable depending on the needs of the application. There must be a way of rolling out hardware updates incrementally. The system needs to support heterogenic hardware, since newly deployed hosts may be more powerfull than the existing hosts. The distribution needs to be symmetrical to be easy to maintain. It does not need to be particularly secure, as it is run in an isolated environment.
Dynamo uses vector clocks as version information on stored objects. Conflicting entries can enter the system through system failures or through multiple simultaneous edits. Conflicting edits are stored in the system and revealed to the application as results of read operations. It is the responsibility of the application to solve conflicts. For example, two conflicting shopping carts can be merged by the application. However, removed items may reappear in the shopping cart as a consequence of the merge.
Dynamo has been designed for hunreds of hosts. Thus, the routing tables will be small enough for every node to store them. As the hosts can run multiple virtual hosts, the amount of entries in the routing table will be some small multiple of a thousand at its largest. As the system does not need to scale infinitely it can optimize routing accordingly.
DNS has good sides, but it also has its shortcomings. It is widely deployed and proven to work in practice. However, it can not efficiently manage rapid updates and end-users do not often have write access to DNS records. It is also based on hierarchical namespaces and is therefore not well-suited for flat namespaces.
DHTs can fill in where DNS is lacking, as DHTs can handle frequent updates. DHTs are also typically based on flat namespaces. The important question is whether or not DHTs can be secured to match the security provided by DNS.
DHTs concentrate on routing and seldom provide security features such as access control out of the box. Thus, applications that use DHTs need to either run custom DHT instances in isolated environments or accept the low security level provided by a public DHT.
When DHTs are used for name resolution they need to be isolated to prevent denial of service and spoofing attacks. Access can then be allowed through a wrapper service that makes sure write requests to the system are authenticated. One way to authenticate the user is to only accept authenticated HIP connections. In addition to authenticating the user the wrapper can enforce an access control policy. With a name resolution service it is natural to define a policy that lets everyone modify their address information, but does not let anyone modify address information of other users.
The goal of the experiment was to measure performance of the access control wrapper created for the DHT. For that reason the authors did their measurements against a single node of each system. Thus, the measurement gives us little information about how the two systems would compare in a situation with multiple nodes.
We do not know why a single OpenLookup node is more efficient than a single Bamboo node, but we can guess. Maybe some of Bamboo's distribution techniques cause some overhead. The single node might simulate some topology, distribute objects among virtual copies of itself, build routing tables and forward queries within that virtual topology. Then again, it might not.
Implementation details could also affect performance. Bamboo is implemented in Java, while OpenLookup is implemented in Python. We could try to find some results on how programs written in those two languages typically compare. However, the reason is likely to be more fine grained than that. Truly finding out would require us to do some research. Unfortunately, that is out of the scope of this assignment.