Are Companies Bulk-Scanning E-mails for Spy Agencies?

Published on October 5, 2016

UPDATE: Charlie Savage and Nicole Perlroth have an article in The New York Times resolving quite a few of the questions raised here; I’ve added comments as appropriate.

Reuters dropped a bombshell story Tuesday afternoon, reporting that in 2015 Yahoo agreed to scan all their users’ incoming e-mails on behalf of a U.S. intelligence agency, hunting for a particular “character string” and turning over messages where it found a match to the government. Yet the vagueness of the story—which appears to be based on sources with limited access to the details of the surveillance—leaves a maddening number of unanswered questions. Yahoo has not greatly helped matters with a meticulously worded non-denial, calling the story “misleading” without substantively denying it and asserting that the “scanning described in the article does not exist on our systems.” (Obvious follow-up questions: Did it exist in 2015? Does it now exist on some other systems?) Here’s the core claim from Reuters:

Yahoo Inc last year secretly built a custom software program to search all of its customers’ incoming emails for specific information provided by U.S. intelligence officials, according to people familiar with the matter.

The company complied with a classified U.S. government demand, scanning hundreds of millions of Yahoo Mail accounts at the behest of the National Security Agency or FBI, said two former employees and a third person apprised of the events.

Some immediate questions prompted but unanswered by the piece: What exactly were the e-mails being scanned for? Was this conventional scanning for a “selector” associated with an intelligence target, such as the sender’s e-mail account or originating e-mail server, or something else? Was scanning limited to message headers, or did it extend to the content of messages or file attachments? What was the legal authority used to make this demand? If such scanning does not currently “exist on [Yahoo’s] systems,” is it happening elsewhere in the communications stream, or likely to exist there again in the future? Was Yahoo unusual in receiving such a demand, or have other providers been asked to do something similar?

We can start with the last question: Sam Biddle of The Intercept queried some major providers and got relatively straightforward denials from Google, Facebook, Twitter, and Apple—though it remains possible that this is a byproduct of the Reuters story having gotten some details of the story wrong. Microsoft said they had “never engaged in the secret scanning of e-mail traffic like what has been reported” but “would not comment on the record as to whether the company has ever received such a request,” which could reflect simple legal caution—intelligence surveillance requests are invariably covered by broad gag orders, and it gets awkward quickly if you deny getting some types but “no comment” others—or could be an indication that the company received a similar demand, but successfully fought it.

Update: Disregard these two paragraphs; per the New York Times report, this was pursuant to a conventional FISA surveillance order, not §702—presumably one the Reuters sources were not privy to, as I’d parenthetically considered. What about the likely legal basis for the “demand”? The story conspicuously uses the word “demand” rather than “court order,” which suggests that we’re looking at a directive under the controversial Section 702 of the FISA Amendments Act of 2008. Yet the article also implies that the demand was not merely a request: The company considered “fighting” it but concluded they would likely lose in court, and felt sufficiently constrained to keep it secret that they kept their own security team in the dark, eventually prompting the resignation of security chief Alex Stamos. So this was a demand with the force of law, which in the absence of a court order implies §702. (It is possible there was an order that the Reuters sources were not privy to, but let’s tentatively take that wording to indicate the demand did not take that form.) Bolstering that conclusion are the comments of Sen. Ron Wyden, who sits on the intelligence committee, referencing §702 in connection with the Yahoo story and insisting that the “the executive branch has an obligation to inform the public” (not, conspicuously, “the public and Congress”—perhaps implying this is not news to him) if surveillance under §702 is no longer limited to specific user accounts, but involves scanning content.

Unlike traditional surveillance under the Foreign Intelligence Surveillance Act (FISA), which requires the secretive Foreign Intelligence Surveillance Court (FISC) to issue specific orders authorizing monitoring of specific targets, §702 permits the intelligence community to issue “directives” requiring communications providers like Yahoo to provide the government with messages pertaining to foreign targets selected by the agencies without any specific court approval: The court only approves the general procedures used to select targets. At last count, there were 94,368 such “targets” being monitored under the a single blanket authorization.

Thanks to Edward Snowden’s disclosures, the public now knows there are two primary ways the NSA conducts §702 surveillance: the PRISM program involves obtaining content directly from American providers like Yahoo, Microsoft, or Google. The government would build a list of “selectors”—identifiers such as e-mail addresses—linked to their foreign targets, and then “task” collection on those selectors, requiring companies to hand over data associated a specific user account. Separately, the government also used §702 to conduct what it dubbed “Upstream” surveillance—meaning scanning communications traffic in motion as it flowed over the Internet backbone. Upstream surveillance, too, is targeted at particular “selectors,” but crucially, it is not limited to messages sent to or from the target. Rather, communications are scanned in their entirety, and any match for one of the approved selectors—including, for example, a mention of the target’s e-mail address in the body of an e-mail between two other people—would trigger interception of that communication. This practice of acquiring messages that pertain to the target, but are neither communications to or from the target, has been widely referred to as “about collection,” and is one of the more controversial aspects of §702. One limitation, however, was that the National Security Agency would filter the communications using Internet Protocol addresses in an effort to exclude completely domestic communications—e-mails between two people inside the United States—which they are not supposed to collect under §702. At least as of 2009, such filtering was required by the §702 targeting procedures the FISC had approved.

What’s new here, if the Reuters story is at least broadly accurate, is that Yahoo was not just asked to turn over the communications of specific targeted Yahoo users, but apparently scanning the incoming e-mails of all users and turning the matches over to the government. What might be going on here? Well, the simplest explanation would simply be that the government—or perhaps more specifically the FBI, which normally has full access to PRISM databases, but not the Upstream take gathered by NSA—decided to replicate the “about collection” they were doing via Upstream within the systems of e-mail providers. Why would they need to do that? One possibility is just that FBI wanted access to the results of “about” searches in the PRISM databases they have access to. The most likely answer, however, is that especially since the Snowden revelations began, more and more companies have been encrypting their traffic by default. That means data that would have been visible to an NSA sniffer sitting on the Internet backbone is now scrambled and unintelligible, making Upstream increasingly useless. Even for NSA, breaking the encryption on traffic wholesale is likely infeasible—but aside from any message content separately encrypted by individual users, that traffic would be readable once it had arrived at Yahoo and been decrypted with the company’s private keys. Yahoo rolled out default HTTPS encryption for Webmail users in 2014.

Nothing at this point is absolutely certain, but I feel at least reasonably confident about this much, and other surveillance experts I’ve spoken to have come to similar conclusions. On the million dollar question of what, exactly, Yahoo was asked to scan for, it’s necessary to get somewhat more speculative. There are several possibilities.

The most innocuous of these would be that the government only asked Yahoo to look at e-mail headers for the purpose of collecting messages from specific foreign targets—specified by their e-mail addresses or originating e-mail servers—who were not Yahoo users. But this seems doubtful for at least three reasons. First, it doesn’t sound like the sort of thing that would be the source of significant internal controversy, or need to be kept secret from the security team. Second, it seems unlikely to require the production of significant new code: E-mail providers routinely filter incoming e-mail from known spammers or cyberattackers. Third, it seems hard to believe the government would be making requests of this kind for the first time in 2015, or that it would be necessary to specifically scan all and only incoming e-mail in realtime for this purpose. Intelligence agencies often learn about a new e-mail address associated with a surveillance target, and would naturally want to obtain older messages to or from that address from cooperative providers.

More probable, then, is the possibility alluded to earlier: That the Yahoo request represented an expansion of Upstream “about collection” rendered increasingly ineffective by pervasive encryption. The scans, on this scenario, would still be looking for selectors like e-mail or IP addresses associated with §702 targets, but would include the contents of e-mail messages, not only headers. One reason to find this legally unsettling is that the government has previously justified “about” collection on the grounds that it is too technically difficult to isolate header information from message contents in realtime as it flows over the Internet backbone. The claim, in other words, was that they don’t want to go poking through the contents of messages, but since Internet packets all look the same from the outside, this is the only way to ensure they’re not missing messages that are to or from the target. The trouble is that while this argument has some plausibility when it comes to backbone searches, it is not really credible when collection is conducted by a service provider like Yahoo: At that point, the various packets making up a message have not only been decrypted, but stitched together and processed as a particular type of communication with a particular format and characteristic structure. I can’t speak in any detail to Yahoo’s setup, but there is no obvious technical reason they would be unable to pull out messages to or from particular users without looking at message content. It’s a more complicated question whether this casts doubt on the legality of such scans, but at that point it would seem to be a matter of wanting to conduct “about” searches, not a technical necessity.

Yet another possibility is that these scans are related to the expanded use of surveillance authorities for cybersecurity purposes. As Charlie Savage of The New York Times reported last year, NSA and FBI are increasingly turning to these tools to detect cyberattackers. An e-mail scan, then, might target either text or a Web link associated with an ongoing “spear-phishing” attack, a signature associated with a malware payload in a file attachment, or a fragment of a file containing stolen data. This would be extremely useful, of course, when the government knows what a particular cyberattack looks like, or what particular data has been exfiltrated by hackers, but does not know either the identity of the culprit or which particular disposable e-mail accounts their target will be using. But it would also be extremely concerning to the extent that it represents a shift from a familiar and relatively uncontroversial model of surveillance—identify your target, then look specifically at the contents of their communications—to a mass surveillance model where all content is scanned indiscriminately for the purpose of identifying the target.

Update: Ding ding ding. This appears to be the right answer. A related possibility would be scanning for electronic signatures associated with cryptographic tools used by known adversaries—such as a string of characters indicating that a message is encrypted with the Mujahideen Secrets software used by Al Qaeda. NSA’s targeting procedures for §702 seem to contemplate keying surveillance to such signatures. One of the criteria mentioned for verifying the “foreignness” of a surveillance target is:

Information indicates that Internet Protocol ranges and/or specific electronic identifiers or signatures (e.g., specific types of cryptology or steganography) are used almost exclusively by individuals associated with a foreign power or foreign territory, or are extensively used by individuals associated with a foreign power or foreign territory.

Again, it is not hard to see why intelligence agencies would find such scans useful, but it would be a serious mistake to normalize the bulk scanning of communications content—an indiscriminate Fourth Amendment “search” of people not known to be foreign intelligence targets—even if one is not overly dismayed by a particular application of that approach. Surveillance architectures create their own institutional momentum, and a software tool designed to scan for digital signatures can just as easily scan for words or phrases in messages written by humans—a possibility that becomes far more tempting once the necessary technical infrastructure is in place.

It is hard to chose between these possibilities—or rule out others I may not have considered in detail—with the scant information provided by Reuters. Some features of the story are at least suggestive, however. Why was the e-mail scan described as specifically limited to realtime monitoring (rather than applying to all messages in storage) of incoming messages (but not, apparently, outgoing messages sent by Yahoo users). One way to read this would be as supporting the cybersecurity hypothesis: If you are monitoring an ongoing attack, or track some piece of data that has just been stolen, there’s no need to waste time digging through old messages, and speed may be essential to an effective response.

Alternatively, there’s a possible legal or procedural rationale for the limitation that might make sense if the intent was to replicate Upstream “about” searches. In NSA jargon, there is the “target” of surveillance (the person or entity about or from whom information is sought); the “selector” (the specific term used to filter out the information to be collected); and the “facility” at which surveillance is directed (the physical or virtual communications channel from which the information is obtained). In the simplest type of case, these could all be the same: There is a target known only as the user of a particular e-mail address, which serves as both the “selector” and—when communications are obtained from the provider who hosts that account—the “facility” at which surveillance is directed. But they might also all be different. An individual target might have several associated “selectors” (different e-mail accounts or other digital identifiers), and as the case of Upstream “about collection” shows, the “facility” might be an Internet routing switch rather than a particular repository of stored messages associated with that account. My (possibly incomplete) understanding from discussions with intelligence officials is that a scan of the content of message sitting in a particular user’s inbox would be considered surveillance “directed at the facility” of the individual user’s account, even if the scan was based on some different selector. And that probably requires that the user whose inbox it is has either been designated as a foreign target or, if they’re a U.S. person, is named in a particularized warrant issued by the FISC. Conceivably, however, intelligence community lawyers have decided the situation is different if the scans are conducted before messages are routed to specific inboxes—at which stage they’re treated analogously to Upstream traffic. Before arriving in the recipient’s inbox, in other words, there might be no “particular, known U.S. person” who could be considered a “target” of the scan, triggering a laxer set of rules constraining searches.

Much of this is speculative, and we’ll doubtless learn more in the days and weeks to come. But anyone who cares about digital privacy should be keeping a close eye on this evolving story.