Disappearing Data
On Wednesday, November 5, 2024, Professor Maryam Gooyabadi arrived at the Hazel Quantitative Analysis Center and called a meeting with her undergraduate research assistants to deliver an unusual—and urgent—directive: “Download everything.”
All semester, the students had been working with data related to firearms and gun violence in the United States. Many of the databases—critical to the research and coursework at Wesleyan’s Center for the Study of Guns and Society—were hosted on government websites. But now Gooyabadi feared they were in jeopardy. It was the day after the presidential election, and the incoming administration’s disparagement of research and higher education—combined with its promise of drastic budget cuts—fit a pattern that Gooyabadi had seen before, both in her work as a data scientist studying social phenomena and in her lived experiences. “I grew up in Iran!” she says. “I know what it looks like, what it sounds like. What that [rhetoric] means to me is we are not going to have access to these things.” In a few months’ time, Gooyabadi’s intuition would prove prophetic.
While federally sponsored data is a goldmine for academic experts who, like Gooyabadi, are pursuing specific lines of inquiry, it’s also invisibly crucial to all of us. Data informs economic and environmental policy, public health interventions, and all kinds of laws and regulations. Censored or manipulated data threatens to undermine the science we rely on as a society. That’s why data disappearing from .gov websites this year sent shockwaves through the ranks of the research community. While efforts have begun to recover missing data and preserve what is thought to be at risk, this is a problem whose ultimate scope and impact remain unknown.
“This is absolutely an unprecedented situation,” says James Guerrera-Sapone, Wesleyan’s research data librarian.
Open Data as the Norm
It’s difficult to imagine a sector of society that doesn’t rely on federal data. Families use government housing data when planning to relocate. Farmers use climate data to plant for more successful harvests. The maritime shipping industry—which contributes more than $550 billion annually to the GDP—relies on NOAA (National Oceanic and Atmospheric Administration) oceanographic data to keep cargo ships safe and on schedule. The application of data has led to seat belt requirements, secondhand smoke bans, and the childhood vaccine regimens that have eradicated smallpox and polio.
And since the advent of the internet, open data policies and legislation have been met with bipartisan support. The government’s statutory approach to data is determined by the Foundations for Evidence-Based Policymaking Act, which was signed into law in 2019 by President Trump. It stipulates that the federal government has a responsibility to produce and make public vast amounts of information.
Guerrera-Sapone says what scares him is not simply the fact that data is being removed but that it is being manipulated without the alterations being documented.
Such data has historically been seen as trustworthy, reliable, and perpetual, according to Guerrera-Sapone. He says, “The research community believed that online government documents and data would continue to be available as they always have been, regardless of who came to power.”
That changed this year. Within days of Donald Trump’s second inauguration on January 20, 2025, his administration issued a barrage of executive orders and memos detailing policy approaches for everything from climate change to diversity, equity, and inclusion initiatives. Then datasets and web pages started disappearing from more than a dozen government sites. On February 2 The New York Times reported that some 8,000 web pages had already been removed, including more than 3,000 from the Centers for Disease Control and Prevention (CDC), 3,000 from the Census Bureau, and an environmental justice mapping tool hosted by the Environmental Protection Agency (EPA). The phenomenon prompted the creation of a dedicated Wikipedia page to track the latest developments.
The Research Community Reacts
Widespread removal of federal data has galvanized a historic mobilization of data scientists, librarians, internet experts, and open-government activists. Numerous organizations and coalitions have sprung up to monitor government web pages, secure data, and help researchers access datasets. One of the most active is the Data Rescue Project, which serves as a clearinghouse for data rescue–related efforts. Another coalition, called Public Environmental Data Partners (PEDP), recreated and relaunched the EPA’s missing Climate and Economic Justice Screening Tool, among other efforts.
Meanwhile, other activists have taken the administration to court. The nonprofit advocacy group Doctors for America sued over pages removed from the CDC website. A temporary restraining order, issued on February 11 by US District Judge John Bates, resulted in the restoration of some of them. Legal challenges have also been filed by the Union of Concerned Scientists and the Sierra Club, among others.
Guerrera-Sapone, a participant in the volunteer efforts of the Data Rescue Project, says that the Wesleyan Library felt compelled to respond to the crisis. “It’s an important part of the Library’s mission to preserve and provide access to information, and it was clear very quickly that this was a threat to that mission, which needed to be addressed,” he says. In February, the Library posted a LibGuide resource page to educate Wesleyan faculty, students, and staff about disappearing data and related rescue efforts. The guide also encourages Wesleyan researchers to use the Library’s local data repository. A prominent message on the page reads: “If you see something, save something.”
Guerrera-Sapone says what scares him is not simply the fact that data is being removed but that it is being manipulated without the alterations being documented. A Lancet article published in July presented an analysis of changes to 232 datasets from the CDC, the Department of Health and Human Services, and Veterans Affairs, nearly half of which had been found to be “substantially altered.” The vast majority reflected changes in language—the replacement of the word “sex” for “gender”—but the authors note that respondents might answer a question differently depending on the term, so even changing column headers and categories of responses affects the interpretation of the data.
The article also notes that in some cases modifications to datasets were not logged, as is customary—meaning researchers who are trying to replicate a study would not even know that the source data had been altered.
Says Guerrera-Sapone, “This is a huge blow to the integrity of research, which relies on attribution and replication to verify claims and analyses. I worry that this could result in a chilling effect on the usage of open federal data in all research moving forward, and there is no replacement for the breadth of data collection that the government does.”
The Promise of Data vs. the Curation of Ignorance
The work of Gooyabadi and her students highlights the great irony of the new era of data suppression: It is happening at a time when more useful information can be gleaned from data than ever before. As a social scientist with expertise in methodology and data science, Gooyabadi seeks to apply advanced machine learning, computational modeling, and complex system analysis to social phenomena. Her students are learning how to use emerging AI tools to extract and analyze data in ways that were previously impossible or prohibitively time-consuming. Today, with the right methods, we can use data to learn what we most need to know as a society.
The work of Gooyabadi and her students highlights the great irony of the new era of data suppression: It is happening at a time when more useful information can be gleaned from data than ever before.
One of the key datasets that Gooyabadi’s students downloaded was the K–12 School Shooting Database, located on the website of the Naval Postgraduate School in Monterey, California. Rich in detail about more than 2,000 incidents since 1970 involving firearms on school grounds, it included information about the shooters, their motives, the guns used, whether the shooting was premeditated, the duration of the shooting, and more, as well as the specifics about victims and a rating for the scope and accuracy of the media coverage.
This dataset was central to the research of two students in a project-based course called Visualizing Firearms History. One student was examining potential predictive factors related to the shooters. The other explored characteristics and patterns among the victims. Both projects were to be presented at the undergraduate research conference of the Center for the Study of Guns and Society last April. A week before the conference, Gooyabadi asked the students to take a screenshot of the dataset’s landing page—it would provide a good visual for the slide presentation, as well as proper attribution of their source material. But when they clicked the link for that page, what appeared instead was a white screen with three words at the top: “Content not found.” The dataset had vanished.
Today, Gooyabadi and others at Wesleyan and beyond are working to make that dataset—and around three gigabytes of firearms-related data in total—available online for use by other researchers.
If you asked her a year ago, Gooyabadi would have casually defined data as “a collection of information that’s gathered and made ready for analysis.” But now she feels compelled to share with students a more nuanced definition—one she explicitly spells out on her syllabi: “Data is curated information that reflects particular worldviews and serves particular people.”