Torrent distributed system

Prarthanawickramarathne1
19 min readMay 20, 2020

--

peer-to-peer file sharing

Introduction:

The distributed system is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user. There are four main types of distributed systems as Client-server, Three-tier, n-tier and peer-to-peer. Totrrent is a communication protocol for peer-to-peer file sharing (P2P) which is used to distribute data and electronic files over the Internet in a decentralized manner. It is transferring large files such as digital video files containing TV shows or video clips or digital audio files containing songs. This protocol was designed by Bram Cohen in 2001 and released an implementation in the same year.

It uses a symmetric (tit-for-tat) transferring model in an attempt to reach Pareto efficiency. Its protocol employs various mechanisms and algorithms in a continuous effort to try to ensure that there are no other variations in the network arrangement which will make every downloader at least as well off and at least one downloader strictly better off. In its original implementation, Torrent base its operation around the concept of a torrent file, a centralized tracker and an associated swarm of peers. The centralized tracker provides the different entities with an address list over available peers. Later improvements tries to leave the weakness of a single point of failure (the tracker), by implementing a new distributed tracker solution. Amongst Torrent’s most prominent future uses are applications which combine Torrent and RSS (Really Simple Syndication), with the powerful ability to transform the Internet into the world’s largest TiVo.

There are two keywords in this Torrent terminology such as Seeder and Peer. A peer with complete file is called seeder. Initially there is only one seeder (the one who uploads the file). Peers who download the file completely and continue to share, become seeders. A host that is in process of downloading the file while uploading at the same time. This is sometimes called a leecher as well; however, the term leecher is sometimes used for peer that just download and decide to not upload.

Figure1

Use of Torrent

Piracy and illegal distribution is the first thing brought to mind when you hear about peer-to-peer file sharing. Not unreasonable, since the majority of traffic is illegal exchange of copyrighted material. But the Torrent protocol has features which makes it usable for totally legitimate purposes. It is already possible to download Linux distributions using Torrent which enables much faster download than the regular FTP or HTTP can provide. As mentioned earlier, it was also for this purpose Bram Cohen presented his protocol. The internet browser Opera can also be downloaded using BitTorrent3. In addition to increased download speed, it has the benefit that it eases the pressure on the servers when a new version is published. When they released version 8 of their browser their servers went down due to a massive demand. Legal usage is growing and many use the protocol to distribute their home-made music or short films, game demos and other legal content.

Torrent also has potential business usage. Distribution of ISO-images, operating systems, large software and patches can be done at higher speeds using Torrent .How often haven’t you waited too long for the newest security update from Windows to be finished downloading? Using the Torrent protocol the spreading of this kind of software can be much quicker and more satisfactory for the end users. Within an organization one can also use the protocol to distribute applications and updates more rapidly. Torrent is best suited for new, popular files which many people have interest in. It is easy for a user to find different parts of the file and download them quickly. This can be called the multiplier effect, and a slightly popular show or movie can become insanely popular within days, or maybe within hours. Old or unpopular files will be difficult to find and there will be few users to download from. This makes the protocol effective when dealing with highly demanded files, and obscure files tend not to be available.

There are some features preserve Torrent technology. They are Fault-Tolerent, High Available, Recoverable, Consistent, Scalable, Predictable Performance and Secure.

Fault-Tolerant

The solution should be without any single point of failure, with regard to both operation and content. This would mean that clients should be able to download content without requiring the presence of a central server once a download has been initiated. Content should also be duplicated, so multiple download locations exist for a given data file.

  • When peers appear or disappear at random, the system is not affected significantly as long as there is at least one seeder.
  • One or more trackers should always exist to propagate peers information.
  • If number of seeders goes to zero, peers keep sharing the portions of the files that they have. This might mean that the file(s) might be incompletd. As soon as a seeder re-appears all peers can catch up and get the whole file(s).

There are implementations in which there is no need for trackers.

Highly Available

The first generation of Torrent systems used a central tracker to enable coordination among peers, resulting in low availability due to the tracker’s single point of failure. There are two recent trends to improve Torrent availability:

(i) use of multiple trackers, and

(ii) use of Distributed Hash Tables (DHTs), both of which also help to balance load better.

The study measured over 20,000 torrents spanning 1,400 trackers and 24,000 DHT nodes over a period of several months. This both trends improve availability, but for different and somewhat unexpected reasons. This findings include:

(i) multiple trackers improve availability, but the improvement largely comes from the choice of a single highly available tracker,

(ii) such improvement is reduced by the presence of correlated failures,

(iii) multiple trackers can significantly reduce the connectivity of the overlay formed by peers,

(iv) the DHT improves information availability, but induces a higher response latency to peer queries.

TRACKER RELIABILITY:

There are two different kind of trackers: those using HTTP protocol for the communication with the client and those using UDP protocol. The second possibility has been introduced in order to reduce the load on trackers. HTTP trackers are much more common. Also most of the UDP trackers are associated to a HTTP tracker (they have the same IP address). The availability has been evaluated by probing periodically the trackers (usually every 15 minutes). A single machine in UMass network has been performing the task, with at most ten trackers being probed at the same time. We consider the availability of the tracker as the fraction of the time it was available over the total measurement time.

The way to probe the tracker in order to check if it is working differs according to whether a tracker uses HTTP or UDP. The availability of UDP trackers has been evaluated by trying to establish an UDP handshake as described in the UDP tracker protocol specification. Three UDP packets have been sent consecutively and the tracker is considered unavailable if these attempts fail. HTTP tracker availability has been evaluated by trying to open a TCP connection to the address specified in the announce key. The tracker is considered not available if three consecutive attempts to open the connection fail (the time between two consecutive attempts is equal to 100 seconds). This procedure can produce wrong results. For example some trackers are implemented as modules of Apache web-servers and Torrent requests are identified from the specific URL and forwarded to the tracker module. Our measurements suggest that this is quite common. In such cases we would erroneously conclude that the tracker is available if the tracker module is down, but the web-server is working and accept incoming TCP connection. The problem is not easy to solve and we decided to rely on a heuristic to identify such cases. In such a way we identified 16 web-servers where the Torrent module had been probably uninstalled.

Some trackers the availability depends on the length of the measurement time interval (a similar effect was observed for the peers of the Overnet network) and in particular decreases as the measurement time interval increases. According to some hypothesis is that probably these trackers died, i.e., they finally stop operating. Those assume that a tracker dies when it starts being unavailable until the end of the measurement period for at least two days. It appears that the number of live trackers decreases from 416 to 354 (about 15%) over 58 days, from May 27 to July 24. From the data we can roughly estimate that the tracker lifetime is about 390 days.

MULTI-TRACKER FEATURE:

Multi-Tracker feature allows two or more trackers to take care. In addition to the mandatory announce section in the torrent file, which specifies the tracker URL, a new section, announce-list, has been introduced. It contains a list of lists of tracker URLs. Trackers in the same list have load-balancing purpose: a peer randomly chooses one of them and sends it an announce request. All the trackers in the same list exchange information about the peers they know. The different lists of trackers are intended for backup purpose: a peer tries to contact a tracker in the first list, if all the announce requests to trackers in the first list fail, it tries to contact a tracker in the second list and so on. On the next announce, it repeats the procedure in the same order. Trackers in different lists do not share information. There are two common ways to use multi-tracker feature: only for backup purpose when the announce-list contains lists with a single tracker, and only for load balancing purpose when the announce-list contains a single list with many trackers. This is about 35% of the torrents specify multiple trackers, among which 60% are for backup, 25% for load balancing and 15% for both backup and load balancing.

Multi-tracker feature is clearly intended to improve the availability of the information about the peers in the swarm. In what follows those are going to quantify the improvement.

A. Correlation among different trackers

B. Availability Improvement

C. Problems related to multitracker: swarm splitting

Recoverable

Recovery The primary purpose of a cloud storage system is as a universally accessible, available backup of your data. Suppose the user wanted to access his file f from a different machine. Firstly, he uses his username to retrieve the torrent files T(E(f)k) for all k ≤ n. Then, he begins leeching for his files. The user leeches his file parts until all of them are available to him. After this, he assembles them into E(f), and again, uses the password only locally to decrypt it to f.

The most important method that BitTorent use to recoverable is distributed hash tables.

DISTRIBUTED HASH TABLES:

The latest versions of the most popular clients (Azureus, Mainline, BitComet, µTorrent, BitLord and BitSpirit [16]) implement the functionalities of a DHT node, so that all peers, independently from the content they are interested in (i.e. from the swarm they are in) can form a single DHT infrastructure. The purpose of the DHT is to store the information needed to contact peers interested in a specific content. According to the common DHT language the key is the info-hash of the torrent, while the value is the contact information (e.g. the IP and the port of a peer client). Theoretically the DHT could completely replace the tracker, permitting the operation of trackerless torrents. We said that all the BT clients could form a single DHT, but in reality there are currently two different incompatible implementations the Mainline one, and the Azureus one. Except Azureus all the other clients are compliant with Mainline DHT specifications. This measurement study focuses on the Mainline DHT.

A. A Brief Overview of DHT Operation

When a user creates a new torrent, the program usually allows him to insert some DHT nodes. The DHT nodes can be manually specified or are just randomly picked up from the set of “good” (highly available) DHT nodes from the routing table. These DHT nodes act as bootstrap nodes, in fact they are used in order to initialize the client routing table. The routing table is updated from time to time according to the protocol description. There are also other ways to discover DHT bootstrap nodes to initialize the routing table. For example if a peer is already in a swarm and is connected to another peer, they can exchange DHT related information. In order to download the content, the Torrent client can send requests to the set of DHT nodes in its routing table closest12 to the infohash. The contacted nodes will reply with the contact information of peers interested in the content, if they know any, or with the contact information of the DHT nodes in their own routing table closest to the infohash. The timeout for a request is 20 seconds in a Mainline client.

B. Information availability through the DHT

Similarly to what we did for trackers, we have been measuring the availability of DHT nodes. The DHT protocol implements a specific request, called DHT ping, in order to check if a DHT node is available, so we resort to DHT pings. We considered a node unavailable when it did not answer to three DHT pings sent consecutively. Due to space constraints, we do not show any plot. We simply mention that 70% of the nodes were always unavailable, while the others show an availability nearly uniformly distributed between 0% and 100%. The availability of the bootstrap nodes clearly influence the speed of the query process.

In order to investigate the effectiveness of DHT networks, we customized a Mainline client and conducted experiments on a set of 2569 torrents, those of SET2 with DHT nodes. For each torrent, we first erase the routing table and all the files that keep the information related to contents previously downloaded. Namely, the client starts with a clean state for each torrent. Then we let the client start contacting the DHT nodes in the torrent file and trying to recover information about peers. In the meantime, all the nodes in the routing table are logged (recall that the routing table is updated frequently). The measurement stops when the client gets the first valid peer and the next torrent is considered.

Consistent

Piece Selection Policy:

If pieces are not selected in a well-thought way, all peers may end up having downloads the same pieces but some (of the more difficult pieces) may be missing. If the seeder disappears prematurely, all peers may end up with incomplete files. To solve this issue, one or more of the following policies are used:

Random First Piece Selection

  • Initially the peer has no pieces.
  • Select a first piece and download it as soon as possible.
  • Select a random piece of the file and download it.
  • Once you start downloading, you don’t have anything to upload. It is important to get the first piece as fast as possible, and this means that the “rarest first”-policy is not the most efficient. Rare pieces tend to be downloaded slower, because you can download it’s sub-pieces from only one (or maybe a few) other peers. As mentioned earlier, multiple peers with the same piece increase the download speed. The policy is then to select the first piece randomly. When the first piece is complete, we change to “rarest first”.

Rarest Piece First

  • Determine the pieces that are not found or are rare among the peers and download those first.
  • This ensures that the common pieces are left toward the end to be downloaded.
  • Spreading the seed: Rarest first makes sure that only “new” pieces are downloaded from the seed. In the beginning, the seed will be a bottleneck since it is the only one with any piece of the file. A downloader can see what pieces their peers have, and the “rarest first”-policy will result in that the pieces fetched from the seed are pieces which have not already been uploaded by others.
  • Increased download speed: The more peers that have the piece, the faster the download can happen, as it is possible to download sub-pieces from different places. We want to replicate rare pieces so they can be downloaded faster.
  • Enabling uploading: A rare piece is most wanted by other peers, and by getting a rare piece others will be interested in uploading from you.
  • Most common last: It is sensible to leave the most common pieces to the end of the download. Since many peers have it, the probability of being able to download them later is much larger than that of the rare pieces.
  • Prevent rarest piece missing: When the seed is taken down, it is important that all the different pieces of the file are distributed somewhere among the remaining peers. By replicating the rarest pieces first, we reduce the risk of missing one or more pieces of a file when the seeder leaves.

Endgame mode

  • Sometimes a piece might be downloaded from a peer with a slow transfer rate. This can potentially delay the finishing of a download. To prevent this we have the “endgame mode”. Remember the pipelining principle, which ensures that we always have a number of requests (for sub-pieces) pending, the number often being set to five. When all the sub-pieces a peer lacks are requested, this request is broadcasted to all peers. This helps us to get the last chunk of the file as quickly as possible. Once a sub-piece arrives, we send a cancel-message indicating that we have obtained it and the peers can disregard the request. Some bandwidth is of course wasted by this broadcasting, but in practice this is not very much because of the short period of the endgame mode.

Scalable

Scalability When new nodes are added or nodes removed, the system should be smart about reallocating resources.

Resource Allocation:

No centralized resource allocation exists in Torrent. Every peer is responsible for maximizing its download rate. A peer will, naturally, try to download from whoever they can. To decide which peers to upload to, a peer uses a variant of the “tit-for-tat” algorithm. The “tit-for-tat”-strategy comes from repeated game theory, and is a strategy of cooperation based on reciprocity. The essence is to do onto others as they do onto you :

1. On the first move cooperate.

2. On each succeeding move do what your opponent did the previous move.

3. Be prepared to forgive after carrying out just one act of retaliation (i.e. have a recovery mechanism to ensure eventual cooperation).

  • The Choking Algorithm

Choking is a temporary refusal to upload to another peer, but you can still download from him/her. To cooperate peers allow uploading, and to not cooperate they “choke” the connection to their peers. The principle is to upload to peers who have uploaded to you recently; i.e. “if you scratch my back, I’ll scratch yours”. The goal is to have several bidirectional connections at any time, and achieve “Pareto efficiency”.

So, the big question is how to determine which peers to choke and which to unchoke. A peer always unchokes a fixed number of its peers (the default is four). Deciding which peers to unchoke is determined only by the current download rates. It has been chosen to use a 20-second average to decide this. Due to the usage of TCP it’s not desirable to choke and un-choke too rapidly. Thus, this is calculated every ten seconds.

The result is that any peer will upload to peers which provide the best download rate. The other way around; if your upload rate is high more peers will allow you to download from them. This means that you can get a higher download rate if you have many uploads. This is the most important feature the Torrent protocol. It prohibits a large number of “free riders” which are peers who only download and don’t allow uploading. In order for a peer-to-peernetwork to be efficient all peers have to contribute to the network. This restriction is not present in most other peer-to-peer protocols and applications, and is one of the reasons why Torrent has become so popular.

  • Optimistic unchoking

Torrent also allows an additional unchoked peer, where the download rate criterion isn’t used. This is called optimistic unchoking. The reason for this is to see if there are any currently unused connections which might be better than the ones in use. Which peer is the optimistic unchoke is shifted every 30 seconds. This is considered to be enough time for the upload to boost up to full speed and for the download to start and obtain full speed. If this new connection turns out to be better than one of the existing unchoked connections, it will replace it. This is quite similar to stage one of “tit-for-tat”.

  • Anti-snubbing

What happens if a peer suddenly is choked by all peers it was downloading from? We then have to find new peers, but the optimistic unchoking mechanism only checks one unused connection every 30 seconds. To help the download rate recover more rapid, Torrent introduces “snubbing”5 . If a client hasn’t got anything from a particular peer for 60 seconds, it presumes that is has been “snubbed”. Following the mentality of “tit-for-tat” it retaliates and refuses to upload to that peer (except if it becomes an optimistic unchoke). It will then increase the number of optimistic unchokes in order to try to find new connections quicker.

  • Upload only

We see that using the choking algorithm implemented in Torrent we favor peers which are kind to us. If I can download fast from them, they are allowed to upload from me. But what if I don’t have any downloads? Then it’s impossible to know which peers to unchoke using the presented choking algorithm. Thus, when a download is completed we use a new choking algorithm which unchokes the peers with the highest upload rate. This ensures that pieces get uploaded faster and that they get replicated faster. Peers with good upload rates are also probably not being served by others.

Predictable Performance

§ Speed while not a large motivating factor, we can seek to improve upon the speed of existing cloud storage services, especially for large files (e.g. movies).

Faster our speed is proportional to the replication factor and inversely proportional to the downtime. In most cases, we should be able to guarantee faster download speeds for larger files. In addition, we could potentially increase the replication factor for larger files to maintain smaller download times for them, and use Torrent’s support of low-latency video streaming for large video files. If we store data across our own devices that we can guarantee uptime for, we can multiply speeds by the number of devices we have

  • Popularity Cloud storage in 2013 had 2.4 billion users and is expected to grow to 3.6 billion users by 2018.
  • No space restrictions because in the best case, no data besides torrent files will need to be stored on a server, space is only bound by your device capacity and how much space you want to donate.
  • Potentially free again, because the servers don’t store any data, we can offer this service free of cost.
  • Security Performance
  1. Adequate password verification slowness i.e. maintain a 0.4s time for key generation, independent of the size of the file. This is just fast enough to not be a UX concern for the user, but slow enough to mitigate attacks we choose 105 to obtain 0.4s time.

2. Fast encryption and decryption we observe a fairly consistent encryption and decryption speed of around 50–60 MBps. This can further be sped up with parallelization. Currently it takes about 15 seconds to encrypt and another 15 seconds to decrypt a full length high quality feature film.

3. Fast splitting and joining our modular scheme for splitting and joining files runs at about 200 MBps. This can split a feature film in about 3.5s.

  • Network Performance:

Torrent stack successfully does 1 seeder — multiple leechers. While the leechers can leech from each other and be called peers, the algorithm to make this work efficiently and although it doesn’t crash, peers often block each other. We have no implemented multiple seeders to one leecher. We tested one seeder to 1 and 2 leechers across two different machines (not over localhost). Although this speed is highly network dependent and constrained by the implementation, we simulated it to get an idea of transfer time across a fairly high speed network. In reality, with a complete implementation, the BitTorrent protocol can get speed ups proportional to the the number of seeders. On average, for 1-seeder 1-leecher, we generally achieved a speed of 3 MBps, and for 1- seeder 2-leecher, this increased to an effective speed of 4.5–5 MBps .

  • Overall performance:

For our current implementation, speed remains constant with varying file size more or less. Time is linear in file size, as one might expect. For a 135MB file, the encryption process takes about 3.5 seconds on the client, and 42 seconds on the network, which is a 3MBps transfer speed overall. This is fairly performant. More importantly, this is network bottlenecked.

Secure

Security and Privacy Having centralized services leaves you without anonymity, and gives organizations access to all your data. Further, typically Web 2.0 cloud storage applications transmit password credentials across the wire, and store encrypted salted passwords on their servers, both of which can be compromised. In fact, recently the popular cloud storage platform Dropbox reported 7 million stolen passwords and Apple’s iCloud leaked nude celebrity pictures.

Very Secure With the use symmetric encryption with PBKDF2 and AES-256, we ensure that encryption/decryption happen completely on the client. This means that no other service stores your password hashes, preventing recent password leaks from cloud services like Dropbox. Because the content is encrypted client side before being transferred over the wire, it is also not vulnerable to naive MITM attacks nor is it vulnerable to SSL vulnerabilities or SSL evasion tactics like SSLstrip. A user’s torrent files are stored on a centralized server and accessible with only a username, providing another security layer. The password does not travel over the wire, and is only used locally for decryption. The use of PBKDF2 and AES-256 mean that even in the case that a malicious attacker gets access to a user’s torrents, and is able to download and assemble the encrypted file, they will not be able to use simply dictionary attacks to decrypt it if we enforce the use of a strong password. Although not implemented, we would ideally use Two Factor Authentication (TFA) as an additional security measure.

1. bcrypt salts We randomly generate a crypt salt, known to be one of the most secure, and we use the highest security setting, 31, for this process.

2. PBKDF2 PBKDF2 ensures the secure generation of keys from a password, given a salt. To pre vent against attacks with dictionary and rainbow table, we intentionally slowing down the process to 0.5s by hashing our hash for 105 iterations.

3. AES-256 AES-256 is extremely secure and unbroken to date. It uses 256 bit keys and a salt and operates on 128-bit blocks. We use the same bcrypt salt we use for generating our key with PBKDF2.

4. Obscure sharding For example, suppose a file is split into 1000 chunks, and we require 10 shards. The first chunk is placed in the first file, the second int he second, and the eleventh back in the first. This process adds another layer of obscurity, wherein the penetrator cannot obtain much relevant information about a file from a single shard, even if he somehow manages to decrypt it.

5. Air-gapped Password The password credentials never leave the local machine and this eliminates all the potential threats, despite SSL, over the wire.

6. BitTorrent based verification BitTorrent is inherently secure when it comes to file transfer. If a malicious peer tries to feed you bad data, it’ll cause a hash mismatch and we reject the transfer.

--

--

Prarthanawickramarathne1
Prarthanawickramarathne1

Written by Prarthanawickramarathne1

Undergratuate Software Engineering Student in university of Kelaniya

Responses (1)