# -*- mode: org -*- #+STARTUP: indent 6.033 2011 Lecture 13: peer-to-peer systems * Today: Peer-to-peer systems hard problem: lookup Overlay network * Layer network model E2E (e.g., TCP) Network (e.g., IP) Link (e.g., ethernet) * Application using these stack have been client/server Server in Machine room: well maintained, centrally located, perhaps replicated Examples: X, DNS, master in MapReduce * What is wrong with centralized infrastructure? Centralized point of failure High management costs if one org has to host millions of files, conversations, etc. Machines owned perhaps by you and me: no obvious central authority. * Goal: peer-to-peer system (serverless, or every client is a server) How do you track nodes and objects in the system? How do you find other nodes in the system (efficiently)? How should data be split up between nodes? How to prevent data from being lost? How to keep it available? How to provide consistency? How to provide security? anonymity? * Example: bittorrent Usages: static bulk content (Songs and videos, Linux distributions) Other examples: Skype, etc. ** Usage model: cooperative user downloads file from someone using simple user interface while downloding, bittorrent serves file also to others bittorrent keeps running for a little while after download completes ** Goal: get file out to many users quickly Encourage everyone to upload file ** Challenges: Tracking which peer has what Handling high churn rates Download rate proportional to upload rate ** Approach: Publisher a .torrent file on a Web server (e.g., suprnova.org) URL of tracker file name, length SHA1s of data blocks (64-512Kbyte) Tracker Organizes a swarm of peers (who has what block?) Seed posts the URL for .torrent with tracker Seed must have complete copy of file Every peer that is online and has copy of a file becomes a seed Peer asks tracker for list of peers to download from Tracker returns list with random selection of peers Peers contact peers to learn what parts of the file they have etc. Download from other peers ** Transport Peers pipeline on top of TCP divide a block further in 16Kbyte subpieces keep typically 5 requests in flight (to different peers) ** Which pieces to download strict? rarest first? ensures that every piece is widely available also helps with the seed and bootstrapping rapidly won't retrieve the same piece multiple times from the seed random? avoid overloading seed when starting download if peer has no piece, get as quickly as possible a piece so that it can upload don't use rarest because it is likely only one peer has it --> use random, can download subpieces in parallel parallel download of same pieces? avoid waiting on slowest final algorithm random for first piece, then rarest-first, parallel for last piece ** Fairness (see paper for tomorrow) ** demo: transmission with ubuntu .torrent use inspector to look at the square of pieces, the clients, note DHTs ** Bittorrent relies on one central component: tracker. Can we get rid off it? Scale to large number of torrents * Scalable lookup: Provide an abstract interface to store and find data Typical DHT interface: put(key, value) get(key) -> value loose guarantees about keeping data alive For bittorrent trackers: announce tracker: put(SHA(URL), my-ip-address) find tracker: get(SHA(url)) -> IP address of tracker Some DHT-based trackers exist. Many other usages of DHTs * Goal: peer-to-peer implementation of DHT An overlay network partition hash table over n nodes not every node knows about all other n nodes rout to find right hash table Goals: log(n) hops Guarantees about load balance * Example: Chord ** ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring. Everyone agrees on how the ring is divided between nodes Just based on ID bits ** Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID. Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address... Key IDs can be drived from data, or chosen by user ** Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Simplest system: either you are the "closest" or your neighbor is closer. Hand-off queries in a clockwise direction until done Only state necessary is "successor". n.find_successor (k): if k in (n,successor]: return successor else: return successor.find_successor (k) ** Slow but steady; how can we make this faster? This looks like a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. ** Finger table routing: Keep track of nodes exponentially further away: New state: succ(n + 2^i) Many of these entries will be the same in full system: expect O(lg N) n.find_successor (k): if k in (n,successor]: return successor else: n' = closest_preceding node (k) return n'.find_successor (k) Maybe node 8's looks like this: 1: 14 2: 14 4: 14 8: 21 16: 32 32: 42 ** There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This is *better* than simply arranging the nodes in one tree ** How does a new node acquire correct tables? General approach: Assume system starts out w/ correct routing tables. Use routing tables to help the new node find information. Add new node in a way that maintains correctness. Issues a lookup for its own key to any existing node. Finds new node's successor. Ask that node for its finger table. At this point the new node can forward queries correctly: Tweak its own finger table as necessary. ** Does routing *to* us now work? If new node doesn't do anything, query will go to where it would have gone before we joined. I.e. to the existing node numerically closest to us. So, for correctness, we need to let people know that we are here. Each node keeps track of its current predecessor. When you join, tell your successor that its predecessor has changed. Periodically ask your successor who its predecessor is: If that node is closer to you, switch to that guy. Is that enough? ** Everyone must also continue to update their finger tables: Periodically lookup your n + 2^i-th key ** What about concurrent joins? E.g. two new nodes with very close ids, might have same successor. e.g. 44 and 46. Both may find node 48... spiky tree! Good news: periodic stabilization takes care of this. ** What about node failures? Assume nodes fail w/o warning. Strictly harder than graceful departure. Two issues: Other nodes' routing tables refer to dead node. Dead nodes predecessor has no successor. If you try to route via dead node, detect timeout, treat as empty table entry. I.e. route to numerically closer entry instead. Repair: ask any node on same row for a copy of its corresponding entry. Or any node on rows below. All these share the right prefix. For missing successor Failed node might have been closest to key ID! Need to know next-closest. Maintain a _list_ of successors: r successors. If you expect really bad luck, maintain O(log N) successors. We can route around failure. The system is effectively self-correcting. * Summary ** Peer-to-peer systems ** Bittorrent peer-to-peer downloads ** Chord An overlay network Log N tables Log N lookups