Desperately needed research is finally becoming possible. The European Parliament has adopted the new Digital Services Act (DSA) and the final text has now been approved by the Council of the European Union (EU) on Oct. 4. This act overhauls EU law regarding digital service providers’ legal responsibility for the content their users post and obligations for content moderation.

This major set of regulations has many facets, and could begin to affect users all over the world as it enters into force by the end of the year. However, application of all the rules is complicated. This article aims to draw Just Security readers’ attention to one exceptionally important advancement the DSA facilitates when it comes to the study of disinformation: data access for independent researchers.

Only online platforms themselves can fully see what is happening inside the new digital worlds that have drawn in billions of users. Tech giants have maintained complete control over access to data that can, for example, reveal the quantifiable extent to which lies and propaganda spread on their platforms. We know these destructive devices have been rapidly traveling, ricocheting and reverberating on digital platforms to propagate a big lie. We have also known for years how this dynamic can yield  truth decay within societies. 

Legislation to help penetrate this opaque digital sphere has arrived at last. The new DSA allows for legally-mandated access to data so that independent social scientists can study the online public square to move research from largely descriptive elements to studying effects that are “eminently measurable” through methods such as causal inference.

In turn, understanding information flows within the online ecosystem can sharpen and target policy proposals to address the technical capacities of digital disinformation operations – referred to here as disinfo-ops. Such strategies have the potential to be more effective than the whack-a-mole approaches of content moderation that threaten the freedom of expression. Simply put, without rigorous empirical study, we cannot propose fully evidence-based solutions.

The Opaque World of Social Media Platforms

Absolutely nobody knows the full extent of the deliberately distorted, tailored, and micro-targeted information individuals are confronting. We do not know the specifics of social media algorithms, nor the efficacy of their content moderation. Nor are we able to clearly distinguish how much harmful activity is a state-sponsored foreign operation, a domestic campaign, or just computational programming for profit (algorithms) gone dangerously awry.

While social scientists now have access to more data than ever before because of the sheer quantity being generated every day, their access is limited to a substantially smaller fraction than what actually exists. As a result, researchers either publish their findings with only partial data or sign away academic freedom in non-disclosure agreements to work for an online platform. 

Researchers have built creative methods for data collection to deal with the opacity. One team created a browser extension that could be voluntarily downloaded to submit to its researchers the content of political ads seen on Facebook and why those ads targeted them. Facebook eventually sent a cease-and-desist letter to shut down the browser and delete the data they had collected. Others have employed the time-consuming process of “scraping” data using Application Programming Interfaces (APIs) constructed by the social media companies. Despite valuable outcomes, for example, proving tech companies’ public claims about the reach of misinformation to be false, the data thus obtained can be expunged. This leaves no way to duplicate results. 

The Cambridge Analytica debacle, in which a data research scientist harvested information from up to 87 million Facebook profiles, raised important questions on this front. This problem and others cast a dark shadow on the use of personal data and privacy. What is more, leaks of tens of thousands of internal Facebook documents showed that the company collects, retains and studies their own massive troves of data; the findings  have notably shown negative consequences for society.

Perhaps the most stark demonstration of how dense the fog has become resides in reports by the United States Senate Select Committee on Intelligence (SSCI) investigating the Russian operation to interfere in the 2016 presidential election. The Committee outsourced analysis of the primary companies involved (Facebook, Twitter and Alphabet [Google’s parent company]) to two teams of researchers. Even if the reports offered the first groundbreaking view into the foreign operation in 2018, their analyses were incomplete: “None of the platforms appears to have turned over complete sets of related data to SSCI.” 

The SSCI’s 2019 final report on “Russia’s Use of Social Media” found the government and even social media companies themselves face severe constraints when it comes to achieving a basic understanding of the widely known disinfo-op:

[T]he Committee’s ability to identify Russian activity on social media platforms was limited. As such, the Committee was largely reliant on social media companies to identify Russian activity and share that information with the Committee or with the broader public. Thus, while the Committee findings describe a substantial amount of Russian activity on social media platforms, the full scope of this activity remains unknown to the Committee, the social media companies, and the broader U.S. Government (emphasis added).

In essence, one of the most powerful countries in the world – known for its mass surveillance capabilities – is unable to speak to the size and scale of a digital disinformation operations targeting its own citizens. No reports since have filled this information gap.

Fits and Starts

The primary reason data access has been barred is to protect trade secrets and the privacy of users. Individual privacy is a  fundamental human and civil right and must not be dismissed. Nevertheless, experts are exploring ways to crack the nut of data access in a way that preserves privacy. At bottom, finding an appropriate balance among the different interests at stake is vital since disinfo-ops can cut deep and jeopardize the legitimacy of democratic elections.

Progress in accessing more data has been slow. In 2018, Twitter began releasing datasets on foreign or domestic state-backed entities whose aims were  to influence elections and other civic conversations. This development was welcomed by top researchers. However, because the released content  comes from what  Twitter “believe[s]” to be “connected to state linked information operations,” researchers remain dependent on the company to select and build datasets.

That same year, the Social Science One project, incubated at Harvard’s Institute for Quantitative Social Science, was launched with high hopes. Its agreement with Facebook would open access to data for researchers to study “the effects of social media on democracy and elections.” A commission of senior academics was created to act as a third party to manage the industry-academic partnerships the project was meant to facilitate. Outside scholars could petition for datasets through research proposals evaluated by the commission.

To address the real privacy concern, parties eventually agreed to apply “differential privacy” to the datasets in order to prevent reidentification of individuals in the data. This solution introduced calibrated “noise” to provide “mathematical guarantees of privacy for research subjects, and statistical validity guarantees for researchers seeking social science insights.” However, the founders of the project readily admitted that the noise would render conclusions drawn “more uncertain” and might “make certain valid research questions impossible.”

After years of delay and pressure, Facebook finally delivered the first datasets in 2020. However, the company admitted in 2021 to handing over incomplete datasets in which roughly half of U.S. Facebook users excluded. Moreover, some experts have expressed serious concerns with the entire enterprise: “My opinion is that Facebook is working with researchers mainly to gain positive news coverage.”

Facebook has also at times shared specific datasets with the internet company Graphika to explore pages and accounts that had been identified by the FBI as part of a foreign operation. While allowing some independent investigators to analyze particular data is a positive step, such limited access stunts  researchers’ ability to check each other’s work.

Given these shortcomings, various bills before the U.S. Congress aim to force transparency onto social media companies. These proposed laws target paid advertising, public content, algorithmic reporting, and sensitive information. 

Mediated access for vetted researchers to a wide range of sensitive platform data is included in a couple of the legislative measures. In particular, the Platform Transparency and Accountability Act has been informed by law professor Nate Persily, co-founder of Social Science One. He resigned from the project in 2020 and has advocated for a legal obligation to be imposed on these companies to share data, and for their legal immunity for doing so. He argues that this legislation is critical to understanding one of the greatest threats to democracy.

The Digital Services Act in the EU

Platforms have often invoked the European Union’s General Data Protection Regulation (GDPR) to prevent data-sharing with researchers, but the GDPR carves out exceptions for “scientific or historical research purposes.” In deciding whether a requester has a legal basis for accessing particular data, user consent or a legitimate interest is frequently the relevant form of analysis. Either of these paths still require tech companies (or a third party) to determine the validity of each investigation and method of study. As we have seen, platforms have not demonstrated an ability to manage this need, and even important appeals from inside companies have been thwarted. (For example, data scientists at Facebook wished to study the prevalence of COVID-19 misinformation on the platform but were denied the resources to do so.) Moreover, Social Science One illustrates some of the problems with having a third party manage data requests.

Article 6 of the GDPR also envisions the sharing of data if it is “necessary for compliance with a legal obligation.” Because the DSA now imposes legal duties upon the largest online platforms or search engines – those that have an average number of monthly users of 10% or more of the total EU consumers (45 million people for the time being), such as Facebook, Twitter, Google, Instagram and TikTok – the two laws together will expand data-sharing. Researchers will  now have an avenue to access the data of companies with enough reach to pose a systemic societal risk and therefore “bear the highest standard of due diligence obligations.” 

The DSA thus “provides a framework for compelling access to data,” as stipulated in Article 40. Both the European Commission or Member States, as well as independent researchers, can now seek access to certain data:

Upon a reasoned request from the Digital Services Coordinator of establishment, providers of very large online platforms or of very large online search engines shall, within a reasonable period, as specified in the request, provide access to data to vetted researchers […] for the sole purpose of conducting research that contributes to the detection, identification and understanding of systemic risks in the Union […].

The “vetted researchers” who are eligible to receive such data must (a) be affiliated to a research organization, (b) be independent from commercial interests, (c) disclose the funding of their research, (d) be capable of fulfilling the specific data security and confidentiality requirements, (e) provide that access is necessary and proportionate to the research purposes, (f) ensure that the research is carried out for purposes laid out in the DSA, and (g) commit to making the research publicly available. Last, to access data generated and retained by a tech company, such researchers must go through a Digital Service Coordinator designated in each Member State. 

Some valid criticisms were lodged against the draft DSA’s legal regime (see here, here, here and here). Changes were made in the course of its passage. For example, there were calls for journalists and NGOs to be eligible for access, because they also fulfill a key watchdog function in society. While there is not a provision made for the former, the final legislation has dropped the need for an “academic” affiliation and access may include researchers from “civil society organizations that are conducting scientific research.”

Some  drawbacks remain in the final version of the DSA:

  • Researchers cannot gain direct access to data. They must present reasoned requests through a government.
  • Research topics must address “systemic risks” under Article 34(1). One hopes this would include research into disinformation and online extremism techniques such as cross-platform coordination,  attention hacking, or information laundering.
  • There are concerns that the carve-outs for security, trade secrets, and confidential information will be litigated to their limits and reduce access significantly. Trade secrets are given special mention in the bill. 
  • Clear protections for independent data collection via “scraping” methods are not provided.

Mathias Vermeulen has written an excellent piece discussing the DSA at length as it was being debated within the EU. He points out that “establishing this legal basis is only a first step to enable GDPR-compliant access to data.” Another insight he offers is that it is advisable to develop a code of conduct to facilitate the sharing of data that conforms to the law. He usefully points out that Article 40 of the GDPR encourages the creation of such codes of conduct, and codifying how the GDPR should apply in a “specific, practical and precise manner” has benefits for transparency and for democracy.

Conclusion

There is an important awakening to the gravity of this problem and the need for researched solutions. Social media platforms are curated by algorithms that can be manipulated to become lie machines gaming their systems. So the greater access to data tech companies hold will enable us to quantify the extent of such activity and its impact on our societies.

Keen international lawyers have been advancing projects to integrate legal and empirical research (e.g., Counter-Terror Pro LegEm). Such methods are instructive for other jurists who are aware of the new EU law and wish to capitalize on this important aperture. For my part, just before President Trump took office, I argued that Russia’s disinfo-op of 2016 should be considered an act of “coercion” under the international law of non-interference. As we learned more about the extent and nature of the foreign intervention, I developed the argument here, here, and here. This new data access can be harnessed to answer empirical questions on the breadth, depth and precision of operations and thus provide a more precise definition of “coercion” – i.e., one that is verifiable and measurable.

There is still much to be fleshed out as the DSA is applied. But we will finally be discussing the implementation of meaningful privacy-preserving data access rather than clamoring for a basic point of entry. It is a welcome development, and this researcher firmly believes that independent study will almost certainly expose revelations that provoke calls for more access. As Professor Persily lucidly stated: “Data access is the linchpin for all other social media issues.

IMAGE: Futuristic server room with light (Getty Images).