How Usenet Stores Binaries

The first rule of Usenet is that “you do not talk about Usenet”. But let’s ignore that for a minute and talk about Usenet, specifically how it manages and encodes binaries. The fact is that I’ve never seen a protocol being raped to such extent that I’m not sure if the proverbial asshole will ever close up again. The original spec for the underlying protocol is the Network News Transfer Protocol (NNTP) and was specified in 1986 by  RFC 977 and later superseded by RFC 3977 in 2006.

The original implementations were used to have decentralized messaging system where users can post messages (much like emails if you will) to various newsgroups organized loosely by their topic. The newsgroups are maintained by a network of servers (collectively known as Usenet) that synchronize and replicate their content so that people connecting to different servers can read and follow the same discussions. Today that same system is used to carry and transfer terabytes of binary data. We’re going to take a look and see how that binary data is packaged and how a 30 years old protocol is abused to make it happen. Please note that this blog post does not concern itself with anything but technical functionality of how uploading and downloading binary data from the Usenet works.

For those who are unfamiliar with Usenet, the basic operation is that of a centralized server/client architecture. A user uses client software, which connects to the server over TCP/SSL. The server has data (so called messages) that are grouped into groups such as alt.comp.programming or comp.games.development.programming.misc or the alt.binaries.* hiearachy, which is mostly used to carry the binary content. The user can then fetch the contents of any said newsgroup and read and/or take part of the discussion by posting his message to the group. Any such postings are public to the entire group and readable by anyone. Overall this process is much like that of using an email client to connect to a server and subscribing to a bunch of mailing lists and having the client periodically interact with the server in order to fetch and display new data to the user.

Glossary

body, article, message = one unit of payload data on the server
overview, header = meta information about the message on the server, such as subject line, date, poster etc.
newsgroup = a categorized “list” of messages, for example. alt.binaries.pictures.f1
headers = the (meta) data for a given newsgroup to display relevant information to the user
subject line = the subject (title) of the message,  has special signifance when message contains binary content that is split across several messages.
nzb = what torrents are to a torrent client nzb files are to a Usenet client

Binary Data in NNTP Messages

So let’s assume that you have a binary file (foobar.avi) that you want to upload to Usenet. Typically there are 2 major problems that need to be circumvented and dealt with.

  1. Message continuation/combination
  2. Message encoding

Message Continuation

Typically the implementations limit the size of individual messages to around 700kb give or take a few tens of kbs. This means that most binary content doesn’t fit into single messages but must be split into several messages. The problem however is that original protocol has no provisions for such mechanisms. The generally accepted solution is to rely on something called “subject line convention”. Briefly the idea is that the subject lines of each message contain enough information for the reader to understand that the messages contain parts of a larger piece and they should be combined together to recreate the final binary. This information pretty much just identifies 2 things: the original file name and the part in question. But to make matters worse there isn’t just a single convention being used but in fact several different conventions depending on the poster software. To illustrate this below is a list of some examples:

  • Guys have fun !!! foobar.avi [01/50]
  • Guys have fun !!! [01/60] foobar.avi.001 (15/50)
  • Guys have fun !!! foobar.avi (1/50)
  • [foobar.avi (1/10)]
  • blah blah (foobar.avi)
  • Guys!! heres your file (6/50) foobar.avi
  • Guys!! heres your file (6/50) “foobar.avi”
  • “foobar.avi (00/11) Here’s that special file yEnc

Luckily the convention is normally fixed between a set of files from one single poster, so the poster software uses one and same convention for all the files in a the uploaded message batch.

Message Encoding

The original NNTP was designed to carry only text, and even that was underspecified by todays’s standards. I.e. there’s no mention of character set for example. Furthermore the specification says that each message body is submitted as a series of lines each terminated by a .\r\n triplet. (In case a line of text contains only a single dot that dot must be doubled) The takeway is that NNTP software in general is not 8-bit clean and cannot be expected to deal properly with binary data as-is since it often contains for example embedded zero bytes. The natural workaround is to use a some form of “binary to text” encoding that transforms the binary data into 7-bit compatible ASCII data.  There are 2 de facto encodings used: yEnc and UUEncoding. On paper there are others too such as binhex and base64 (used in conjuntion with MIME based messages), but in practice these are really not used. UUencode is mostly used to encode pictures that are typically split over to max ~10 parts and yEnc is used for pretty much everything else. yEnc “spec” has the benefit that it tries to solve the messy problem of subject line convention by defining a common convention for yEnc encoded binaries. However as is normal in the Usenet world the convention has several different realizations and any implementation needs to be ready handle several common patterns. It also prefixes (and postfixes) each message with a small header and footer that help the decoding end to understand where the decoded data should be placed in the final binary. UUEncode on the other hand doesn’t properly understand the concept of encoded data being split into several messages and thus it’s in fact impossible to reliably re-construct a UUEncoded message that spans more than 3 messages. That’s because the UUEncoding doesn’t have any per part footer/header, but only a header for the 1st part and a footer for the last part. That means that the order of any parts in-between these two can’t be reliably figured out.

So to recap, the job for any application looking to download and assemble binaries downloaded from the usenet can be summarized in these steps:

  1. identify all messages comprising a single binary
  2. identify the encoding used
  3. realize the offset of each data fragment in the message within the output binary
  4. do the binary decoding (reverse of binary to ASCII encoding)
  5. write the data to the file and combine into final binary

Full on Brokenness and More Legacy

Since the protocol was originally never intended to carry high volumes of binary data the software used to transmit it is simply often broken and results in the data becoming broken.  The classic way to deal with this has been to add redundancy and a tool called Parchive (parity archive) was invented. The basic promise is simple, you add redudant data, compute checksums and through those are able to realize broken data and recover it. Traditionally also a set of files have been archived with the good old WinRAR into a set of .rar files that are then used to calculate the parity files.  In our example the example file, foobar.avi, might be split into X number of .rar (.r00, r01, r02 etc.) files each being like ~50mb or so. At the NNT protocol level each one of those rar files are then uploaded as a multiple of NNTP messages (and the subject line convention is used to identity the messages for each file). Once the uploader has those .rar files par2 (a parchive implementation) is used to create the redudancy data as (.par2, .vol-001, etc.) files. Finally the whole set of files is uploaded to a server where it eventually gets synced between Usenet servers.

What this means for a client software developer is that you now also need to be able to deal with parity and rar files if you want to be a relevant implementation.

The other secondary implication is that since a single file is actually a *set* of files you now might want to identify not only the NNTP messages that are required for a single file but also what files are needed for single media item/download (in this case the original foobar.avi). Better be handy with those regular expressions. 🙂

So How to Retrieve The Messages?

So now that we know how the server stores the data, how do we go about retrieving it and displaying the relevant bits of information to the user?

The traditional way has been to let the user choose a group he/she is interested in and then retrieve the headers from the server. Each header gives us meta-information for each message, such as the subject line (very relevant), poster, date (age) etc. without having to load the actual data body. This information is typically enough for us to build a local database of items of interest and be able to track all the various bits of information (such as individual message ids/numbers) and to be able to show the user a list of messages in the group. Once the user identifies the items he’s interested in viewing (downloading) we consult our database to see which actual message bodies we need to download from the server. Once we receive the bodies we then do the stuff mentioned above, i.e. reverse the ascii armoring, run through the par2 process and finally extract the rar files.

Back in the day this was actually the only way to retrieve content from Usenet. And some people still do this, however the sheer volume of data makes this approach often unfeasible for several of the larger groups in the alt.binaries.* hiearchy. For example as of writing this my current server reports over 17 billion headers for alt.binaries.boneless.

Then came the NZB.

Today NZB files are the equivalent of torrent files in the torrent protocol. Basically there is specialized software that trawsl the major newsgroups and builds databases for the data it discovers. Typically the databases are optimized for just retrieving the articles and provide no provisions/data for direct human viewing. The found items are then often exposed on the web as web APIs, RSS feeds or websites such as nzbindex.nl. Several client applications also have integrations to specific sites providing such as interfaces allowing convenient integration of search and download as a single software function/package.

Summary

This was just a quick introduction to the way an old protocol is used to carry terabytes of data today every day. A lot of it relies on super ugly non-standard conventions that people came up in ad-hoc manner over the years. It’s kinda fascinating to deal with a protocl that has seen so much organic growth. At the same time the lack of formal specifications and standars is a frustrating experience and requires a lot of special cases, regular expression patience and maintenance as new conventions, such as reversing file names, come up.

For those who are algorithmically minded there’s also enough to keep busy for a while. Efficiently dealing with volumes of data north of billions of items is not trivial. 🙂

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s