Calling out the FOIA metadata problem

Fred Trotter

November 21, 2018

The Freedom of Information Act has allowed the public to free copious amounts of health data from the U.S. government in addition to solving countless transparency issues. At CareSet, which releases DocGraph Datasets, we are especially focused on using FOIA to increase healthcare transparency.

The DocGraph Journal was founded on this principle of transparency – most notably through our Hop Teaming Dataset – which shows how Medicare patients move through the healthcare system. ProPublica similarly opened up information in the Medicare Part D benefit program through a FOIA request, which shows the prescribing patterns of Medicare providers. These two datasets gave researchers and analysts newfound visibility into the entire U.S. healthcare system. Those datasets along with other FOIA datasets that followed greatly impact the way healthcare is delivered today.

We have seen the value generated by FOIA requests for healthcare data. Our journalists at DocGraph wanted to understand the data stream created by HHS FOIA offices (such as CMS, FDA, and the CDC). These offices already publish FOIA logs, where they list the FOIA requests answered by their office. Unfortunately, these logs rarely specify how much data was delivered to requesters. It is nearly impossible to look at these logs and spot when “the next big healthcare dataset” was requested and delivered. Additionally, it’s difficult to understand what type of data is being released by FOIA offices at all.

This is part of a common strategy of studying other FOIA requests to gain insights on what might be valuable to FOIA. Journalists frequently do this to better understand what kind of information might be available inside a government agency. If we see small and interesting data being released from the FOIA office, we might attempt to ‘scale’ that FOIA request into a larger release. If we see large and potentially valuable datasets, we could replicate them by submitting a similar FOIA and asking our friends to do the same. This is an important mechanism for translating data released under FOIA into open datasets. Under what is known as ‘eFOIA’ regulations, once a similar FOIA request is made 3 or more times, the FOIA office must publish the data/documents online.

Most FOIA requests cover documents, but from our perspective, data requests are much more important. Especially when these datasets relate to some aspect of the healthcare system, we want to ensure that this data is widely available. To that end, we submitted a series of FOIA requests that were designed to improve the way the FOIA offices release datasets.

Then we submitted FOIA requests to all HHS operational agencies to get more information on their FOIA logs, specifically which fulfilled requests resulted in data, rather than documents. We wanted to know how large these files were by asking what the file size was and how many lines of text (data, usually released as CSV files, typically have many newline characters) each one held, if applicable. These are both characteristic information about a file, aka metadata.

“Most FOIA requests cover documents, but from our perspective, data requests are much more important.”

Our requests were met with immediate resistance by most of the FOIA agents. Out of the offices who have responded to the request, all but one claimed that they either did not keep records of file size/row count or that they had no responsive results from a search. CMS claimed that they did the search, but found nothing. Many CMS requests shown in the agency’s available FOIA Logs are clearly requesting data, so this is a confusing response. Here are some of the answers we received, via email or through official letters (underlines ours):

HRSA: Please note that this office does not keep records relative to the amount of data released to requestors; The FOIA does not require agencies to create new records or to conduct research, analyze data, or answer questions when responding to requests.

SAMSHA: As you probably know, one thing that we are not required to do in response to FOIA requests is create records. The records you are seeking do not exist. Therefore, I cannot provide them to you.

CDC/ATSDR: Our office does not track that information and thus there is not an existing government record that contains such a data compilation, nor is there an automated query that can be constructed to produce this information. Under the Act, agencies are not required to create records or conduct research in order to respond to a FOIA request. The information you seek would require staff to conduct research on all files provided in response to FOIA requests over an extended period of time, compile all data on file size, and create a new record that includes information on each request and the amount of data produced in response to it. Such efforts fall outside what is required by agencies under the Act.

CMS: After a careful search of the Centers for Medicare & Medicaid Services (CMS) files, i.e., a search reasonably calculated to locate records responsive to your request and employing reasonable standards, we were unable to locate any records responsive to your request.

FDA: After various discussions with the group that created and maintains our FOIA database, we have concluded that we do not have the records you are seeking.

The two pieces of metadata we were asking for (file size and file line counts) can be queried easily on any modern Operating System using simple command line tools. We were unsuccessful in communicating that any computer file system is also a minimal database, and is capable of responding to simple queries using tools that are available out-of-the-box on modern operating systems.

Almost universally, they reacted by saying that this would be creating new data, or by not responding to our points at all and instead advising us to take it up with the FOIA public liaison.

Essentially, the FOIA offices held that “querying the computer file system in the manner you would a database” was NOT part of the “search” (which they are required to do) but instead was “creating new records” or “research”, which they are not required to do.  So their response was essentially saying “If the FOIA officer who receives the FOIA request is adequately ignorant on how the computer functions on which a specific record is kept, it is acceptable to respond that the records do not exist – i.e. because of the ignorance of the officer, they do not exist from the perspective of that officer”.

An example of how the FOIA offices could use their computer’s terminal to find the row count of files within a directory

An example of how the FOIA offices could use their computer’s terminal to find the row count of files within a directory

The data and computer illiteracy of these agents prevented them from conducting a valid and simple FOIA search because those searches involved the ‘ls’ (list) command and the ‘wc’ (word count) command. Both of these commands are so commonplace that they have Wikipedia articles. The hours we spent trying to debate with these agencies, and the fact that they were likely headed to appeals, made this batch of requests unsustainable. We made these FOIA requests as a public service with the intention of convincing the agencies involved that they should simply release this additional information along with their FOIA logs to the public (which they already release). Our focus at DocGraph must remain on healthcare-related datasets, that is why we decided to withdraw the remaining requests and make the issue public.

We have found cases where the courts acknowledge the validity of searching databases, and the results of those searches not counting as a “new record”, as noted in Schladetsch v HUD: “The fact that the agency may have to search numerous records to comply with the request and that the net result of complying with the request will be a document the agency did not previously possess is not unusual in FOIA cases, nor does this preclude the applicability of the Act”. The judge also stated in this case that “Because an electronic search of computer databases does not amount to a creation of new records…it follows that the programming necessary to instruct the computer to conduct the search does not involve the creation of a record.”

Judges also recognize there are limitations in what kind of programming must go into answering a request. Drawing lines, for example, when an agency would have to “undertake programming that would assign new or different values to existing data, replace groups of data with median figures or variables, and collapse and band data into newly defined categories.” (Sander v State Bar of California). Nor would an agency need to create a “database listing” or “a new database or reorganize its method of archiving data” (National Security Counselors v Central Intelligence Agency). However, the FOIA requests DocGraph submitted did not ask for any kind of data manipulation, data creation, or restructured database listing – we simply asked for the raw results from a database query.

This FOIA experience has been disheartening, because while there is a supposed strong movement towards making electronic government records more available to the public, as mandated by the eFOIA amendments, there is also an oppositional and obtuse force which may hide those contents of government databases from public view, which should rightly be available under FOIA.

What is even more interesting is the “perverse practical consequences” of this position. There are at least two.

First, we could just simply replicate every FOIA request that is listed in the FOIA logs. This would result in us storing those same fulfilled foia requests on our own computer, and we would then use the ‘ls’ and ‘wc’ commands ourselves to get the answers that we seek. Finding and copying every FOIA request is an order of magnitude more work than just running a command line query. This encourages us to create a massive amount of busy work for the FOIA office, just so they could avoid working in an intelligent manner. Further, this would result in us understanding what data is coming out of the FOIA offices… but the FOIA offices themselves still would not know.

But then, we did not create the term “perverse practical consequence”. That is something that a DC court mentioned when siding with the CIA in National Security Counselors v Central Intelligence Agency – stating that they did not have to provide database listings to the Plaintiff:

The Court pauses, however, to note the perverse practical consequences of the CIA’s choice to refuse to provide database listings in response to FOIA requests. The FOIA requires agencies to disclose all non-exempt data points that it retains in electronic databases…and thus although the CIA may not be required to produce an index or database listing in response to a FOIA request, it can be required to hand over the contents of entire databases of information to the extent those contents are not exempt from disclosure…Despite the fact that the CIA can continue to escape the production of database listings under the FOIA if it wishes, the CIA may nevertheless find it more efficient to begin producing such database listings upon request because failing to do so may prompt requesters to seek the reams of data underlying such listings instead.

Note that in our case, because we are only asking for information that the FOIA office has provided, all of the information is definitively FOIA available.

“Finding and copying every FOIA request is an order of magnitude more work than just running a command line query.”

The second “perverse practical consequence” of the position of the agencies, in this case, is the implications of their assertion that a “query is not a search” but “a query produces new records”.

Suppose we wanted to do a FOIA request for a paper record, contained in a metal filing cabinet in a specific government office. Assuming the file was “non-exempt” it would eventually be provided under FOIA. Figuring out which office, filing cabinet, and drawer contained the file would be considered a “search” and centrally part of the FOIA offices job. This would all be true even if the filing cabinet had some other drawer, which contained a file, which was exempt because it contained private health information about a person (or any of the other FOIA exemptions).

However, say that file was digitized, and then placed on a specific server, on a specific drive, and in a database. And if that database also contained the digitized version of the exempt file, then retrieving only the file in question might require a database query. If a FOIA office can refuse to run such a query because it would result in “creating” a new record… well, this is equivalent to creating a new FOIA exemption – the “all records in a database” exemption.

Even by navigating an operating system in the traditional way, it is easy to find the size of files and open them to get the row count.

Even by navigating an operating system in the traditional way, it is easy to find the size of files and open them to get the row count.

That is obviously not the intention of either the letter or spirit of the law as it is written or as it has been interpreted by the courts. In fact, the courts and lawmakers have frequently discussed the equivalency of file cabinets and databases when it comes to non-exempt data being available, as well as how agencies must use programming to adapt searches for databases instead of file cabinets.

When the court in the above-mentioned CIA case opined that searching and sorting a database does not create a new record, it referred to legislative history to the E-FOIA Amendments where it was noted “Computer records found in a database rather than a file cabinet may require the application of codes or some form of programming to retrieve the information”. Further, an Illinois FOIA case, which used Federal case law in its decision making, stated that the “analogy to file cabinets is helpful: the database is akin to a file cabinet, and the data that populates the database is like the files. FOIA permits a proper request for a single file, some of the files, or all of the files”.

Obviously, this result is very frustrating, and we simply do not have the time or money to sue all of these FOIA offices for clearly breaching their duty under the FOIA law. Essentially, they are asking us to pay for their ignorance of how their computer systems operate. Information already stored on their computers is not FOIA available, because they are not fully aware of how their computers work.

Even as we complain about a specific issue, however, we continue to celebrate HHS continued commitments to data transparency. Specifically, the potential for HHS agencies to better track the data within their offices is, fortunately, being addressed by HHS itself. Recently, we blogged about the HHS report that detailed the intra-agency data-sharing obstacles at HHS. The Chief Data Officer of HHS, Mona Siddiqui, and her colleagues are working directly with those who handle some of the most important health datasets in the country so that they may be harnessed effectively. We strongly support their efforts and will be watching how these efforts impact the way the FOIA offices handle their data assets as well.

We should also celebrate the one agency who just gave us the data! On a final positive note, we’d like to thank the Administration for Community Living for answering the meta-FOIA request discussed in this post. ACL was the only agency which fulfilled our request – and they did it fast!

Edit: DocGraph is now referred to as CareSet Journal.

Fred Trotter

Fred shapes our software development and data gathering strategies, which doesn't stop him from getting elbow-deep in the code on a regular basis. He is co-author of the first Health IT O’Reilly book Hacking Healthcare, and co-creator of the DIRECT protocol mandated in Meaningful Use. Fred’s technical commentary and data journalism work has been featured in several online and print journals including Wired, Forbes, U.S. News, NPR, Government Health IT, and Modern Healthcare.

Connect with CareSet Today

Let's start a conversation to explore how CareSet's comprehensive healthcare data insights can empower your business for data-driven success.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.