A little more than a year ago I mentioned I was working on a Doujin music aggregator that would also be able to parse torrent files. I'd sorta given up on that project because the site had other issues that required attention and my first EE MSc semester has started, to name a few excuses. But what really kept me from picking up this side project again was that I no longer saw its usefulness - I often end up discovering that nobody has those rare albums I wanted. DoujinAggregator would've aggregated music from sources like TLMC, some RuTracker and Nyaa torrents and VKontakte releasers - these sources usually contain releases from well-known circles.

Despite that project ended up being a waste of time I learned new things about torrenting. To encode information about the content Bittorrent clients use a peculiar encoding format known as bencode. A bencoded string can contain four types of values:

  • Integers are encoded by prepending the letter "i" to the decimal representation  of the number followed by the letter "e", like in i42e. The size of the number in bytes is not stored, meaning there's no limitation how big of a number can be encoded.
  • Strings are encoded by prefixing them with their length excluding the null terminator, followed by a semicolon, e.g. 8:ayy lmao
  • Lists are encoded by listing its items and enclosing them within the letters 'l' (lowcase L) and 'e'. For example to encode [2, 23, 'asdf'] you'd need to write li2ei23e4:asdfe.
  • Finally we can encode dictionaries (aka associative arrays or hashmaps), but with string keys only. Similar to how lists are encoded dictionaries start with the letter 'd' and end with 'e'. Keys and their values are written between them. Example: {"foo": "bar", "numbers": [1,2,3]} would become d3:foo3:bar7:numbersli1ei2ei3eee.

A torrent file is basically a huge bencoded dictionary storing all kind of metadata. To see the contents of it you need a bencode parser like this one or you can write one from scratch - which is what I did. I'm specifically showing PHP examples here because **surprise** - my site runs on PHP! That day composer was down for some reason, so I decided to write my own implementation for shit and giggles as a programming exercise. It was specifically written for decoding torrent files (it's not like bencode has other uses); it can decode arbitrarily nested lists or dictionaries too. It took me four hours to complete, debug and test. It uses regexes to detect variable boundaries which is the reason I think it has shit performance (I've never profiled it though). Every method is kept is static on a purpose, because there's nothing to encapsulate. I normally leave some blank lines between the code, but here I tried to squeeze it as much as possible so as not to make the post too long. Although the class is called Bencoder it cannot encode shit because I don't need that functionality.

<?php
class Bencoder {
    private const KEY_MAX_LEN = 1e9;
    private const KEY_MAX_LEN_CHARS = 9;
    private const INT_MAX_CHARS = 19; // Based on PHP_INT_MAX
    public static function decodeTorrent($bstr) {
        $pos = 0;
        return self::decodeDictionary($bstr, $pos);
    }
    public static function decodeDictionary($bstr, &$pos = 0) {
        $pos++;
        $res = [];
        while ($bstr[$pos] !== 'e') {
            $key = self::parseNext($bstr, $pos);
            if (false === $key) {
                throw new RuntimeException('could not decode key because EOL');
            }
            if ($key === 'pieces') {
                $res['pieces'] = self::parseBinaryPart($bstr, $pos);
            }
            else {
                $value = self::parseNext($bstr, $pos);
                if (false == $value) {
                    throw new RuntimeException('could not decode value because EOL');
                }
                $res[$key] = $value;
            }
        }
        ++$pos;
        return $res;
    }
    public static function decodeList($bstr, &$pos = 0) {
        $pos++;
        $res = [];
        while ($pos < strlen($bstr) && $bstr[$pos] !== 'e') {
            $res[] = self::parseNext($bstr, $pos);
        }
        ++$pos;
        return $res;
    }
    public static function decodeInt($bstr, &$pos = 0) {
        $intSubstr = substr($bstr, $pos, self::INT_MAX_CHARS);
        preg_match('~i(\d{1,})e~', $intSubstr, $matches);
        $retval = intval($matches[1]);
        $pos += strlen($matches[1]) + 2;
        return $retval;
    }
    public static function decodeString($bstr, &$pos = 0) {
        $stringKeyRegex = '~(\d{1,' . self::KEY_MAX_LEN_CHARS . '})\:~';
        $sub = substr($bstr, $pos, self::KEY_MAX_LEN_CHARS + 1);
        if (!preg_match($stringKeyRegex, $sub, $matches)) {
            throw new RuntimeException('Failed to decode string: ' . $sub);
        }
        $len = intval($matches[1]);
        $pos += strlen((string) $len) + 1;
        $retval = substr($bstr, $pos, $len);
        $pos += $len;
        return $retval;
    }
    private static function parseBinaryPart($bstr, &$pos) {
        $stringKeyRegex = '~(\d{1,' . self::KEY_MAX_LEN_CHARS . '})\:~';
        $sub = substr($bstr, $pos, self::KEY_MAX_LEN_CHARS + 1);
        preg_match($stringKeyRegex, $sub, $matches);
        $len = intval($matches[1]);
        $pos += strlen($matches[1]) + 1;
        $binarySlice = substr($bstr, $pos, -2);
        $pos += $len;
        $x = unpack('H*', $binarySlice);
        return chunk_split($x[1], 20);
    }
    private static function parseNext($bstr, &$pos) {
        if ($pos >= strlen($bstr)) {
            return false;
        }
        switch($bstr[$pos]) {
        case 'd': return self::decodeDictionary($bstr, $pos);
        case 'l': return self::decodeList($bstr, $pos);
        case 'i': return self::decodeInt($bstr, $pos);
        default: return self::decodeString($bstr, $pos);
        }
    }
}

The torrent we're going to test this code is this Fate Grand Order album I picked from NyaaTorrents page 1, I've been also using this one to test my Bencode parser in CI. If I call Bencoder::decodeDictionary() with the contents of this file and print the result I get the following information: fgo_content.txt. I've omitted most of the checksums because they were too long and uninteresting.

To sum up, a torrent file contains the following metadata:

  • A list of tracker announce URLs
  • Basic metadata like creation date, encoding and a comment
  • A list of files and their sizes in the torrent. Files are listed in a flat structure and their paths are represented as an array that contains pieces of the full path. For example, the path /foo/bar/example.flac is represented as ['foo', 'bar', 'example.flac']
  • The name of the torrent
  • The number of bytes in a piece
  • The SHA1 checksum for every piece in binary encoding.

Sauce: http://www.bittorrent.org/beps/bep_0003.html