Friday’s publication of further classified documents originating from GCHQ reveals further details of the scale of internet mass surveillance being conducted in this country.
The documents, in keeping with other classified files released by Edward Snowden, are extremely heavy on jargon, something that can make it time-consuming for even technically fluent readers to understand the documents or make constructive links between them.
I spent some time this morning trying to understand the phrase “target detection identifier” – or TDI. This is a term used fairly liberally in the released documents. What is a TDI, how does GCHQ use them, and why are they important? This is what I learnt.
Firstly a TDI is essentially a presence event: an event that occurs when a particular individual communicates using the internet. One document describes a TDI as being a definite indicator of presence, that [is] unique and persistent to a user/machine.
Any kind of presence indicator must by definition relate to at least one point in time. TDIs are also unique to a particular individual, so it is possible to think of them crudely as being a kind of LED that illuminates briefly every time a given individual browses a web page, makes a Google search or sends an email (although TDIs are also routinely bulk-saved for future analysis).
The documents show clearly that these TDIs are not created only for suspected terrorists, cybercriminals or others specifically targeted by the security services. As far back as 2009, there were 2500 “de-duplicated” events being generated every second. If we assume that “de-duplicated” means that repeated TDIs associated with the same individual have been removed, that suggests that at the time up to 150,000 individuals per minute were having at least some kind of basic metadata recorded about their internet usage.
The document is now around six years old, and we can be certain that the breadth and depth of surveillance has only increased since then.
So why are TDIs important to GCHQ? I believe it is likely to be related to the source of much of the data that is available to them: the transatlantic fibre-optic cables that enter the UK mainland near the Cornish town of Bude. (A GCHQ outpost is conveniently located nearby).
It is known that GCHQ hoovers up the data passing through these cables. Given the huge amount of internet traffic passing through these cables, how can particular chunks of data be associated with individuals?
Two candidates spring to mind. The first is by IP address: the numbers which uniquely identify a particular computer and which will be attached to each packet of data. But IP addresses identify machines, not individuals. Most internet service providers periodically change the IP address provided to customers, and IP addresses are often shared between many users too (in offices, internet cafes and shared Wifi connections for example).
The second option seems more useful. Cookies are already used by websites to uniquely identify and track users, and are typically included in each and every web page loaded, email or chat message sent, and so on.
Cookies not only conveniently identify individual users, rather than machines, but are concise and always transmitted in a standardized form – attributes that could make them easy to identify, even a very high bandwidth stream of raw internet traffic.
Even more conveniently, they are also used by almost every major website.
Given this, it does not surprise me that while the documents claim that over 70 distinct TDIs have been “discovered”, the majority appearing in demonstration screenshots are cookies used by major internet companies.
It makes a lot of sense for GCHQ to piggy-back on the existing methods of tracking users already routinely employed on the internet.
In conclusion, as well as serving as simple indicators of internet activity on the part of an individual, I suspect that TDIs are also an initial method of giving structure to the data being bulk-intercepted from fibre-optic cables.
Once the incoming stream has been tagged with as many TDIs as possible, it can be more efficiently searched and procesed based on individual, network location or application.