Federal regulations indicate that tax-exempt (nonprofit) organizations may not wildly overpay their employees and officers (“excess benefit transactions”) without incurring tax penalties. The excessiveness of a payment is measured against the fair market value of goods or the “reasonableness” of compensation, namely “the amount that would ordinarily be paid for like services by like enterprises (whether taxable or tax-exempt) under like circumstances.” Payments by tax-exempt organizations are presumed to be reasonable if the terms are approved in advance by an “authorized body” (such as the board or a committee of the board), the organization references data about compensation for comparable services and organizations, and the process used to determine the terms is appropriately documented. The Internal Revenue Service (IRS) can rebut this presumption of reasonableness with sufficient evidence that the payments were excessive.
Download the nonprofit compensation database.
All of that is to say that nonprofit organizations will often find themselves in need of data about what other organizations are paying their employees, particularly their highest-paid employees, such as the executive director or chief executive officer. How are organizations supposed to find this information? One particularly useful source is the publicly available financial reports that tax-exempt organizations have to file with the IRS each year (Form 990), which includes the compensation of the organization’s officers and highest paid employees. Let’s make a compensation database!
The IRS releases Form 990 data in an XML format for organizations that e-filed their returns. According to IRS summary data, 83% of Form 990 submissions and 62% of Form 990-EZ submissions were electronic in 2019, so the XML data cover a sizable proportion of records for tax-exempt organizations. Coverage is likely weaker for organizations with smaller budgets, which are more likely to file 990-EZ. However, coverage will likely improve in the coming years, as the 2019 Taxpayer First Act requires e-filing for tax-exempt organizations for tax years beginning after July 1, 2019.
There are three versions of Form 990: the full form, an abbreviated (“EZ”) form, and a form for private foundations. (There is also a “postcard” form for small organizations, but those forms aren’t included in the XML repository, currently.) The forms each put our information of interest in slightly different places. Moreover, the IRS has changed the XML format, most notably in 2013. So, we’ll need to sanitize the compensation data to make it easier to handle. Jacob Fenton has helpfully compiled a table of the field descriptions. I extracted the fields of interest for the compensation database and labeled each with a target variable name.
Rather than working entirely in memory, we’ll process each file individually, writing out the results to disk, and then load in the compiled data at the end. This will prevent having to constantly reallocate and copy large data tables.
Preparation
To get going, we first load the necessary packages
(xml2
,
httr
,
and htmltools
)
and create a couple of helper functions to handle missing
values. We’ll read in the mapping of XML elements from the various Form 990
versions to consolidated variable names, and set a startdate
, which could
be used to limit processing only to recently released records (e.g., to
update the database later as the IRS makes more data available).
Bucket listing
The URL in baseurl
asks the Amazon S3 API to
list the objects
in the IRS Form 990 bucket. We specify prefix=2
to exclude the index files
(the identifiers for the XML files start with the year in which the form was
submitted to the IRS). If we knew that we only wanted returns submitted
starting in a given year, we could also include the start-after
parameter,
such as “&start-after=2019
” for forms submitted to the IRS in 2019 or later.
Intermediate output files
We’ll create two output files: The “submission records file” will include a record for each Form 990 submission; the “compensation records file” will include a row for each person listed on the Form 990—directors, officers, and highly compensated individuals.
Mapping
The mapping file includes the XPath
for each XML element of interest in the Element
field and the target
variable name in the Variable
field. Many data elements, such as the
organization’s name and annual expenses, have only one value for each
Form 990 submission, but the elements related to each individual are repeated
for each individual. The Group
field identifies XML elements that are
repeated in each individual’s block. For the submission records file, we want
only one row per Form 990 submission, so we subset to those XML elements that
don’t belong to a group (vars$Group == ''
). The rec.elems
variable holds
the XPaths for the elements of interest; the rec.map
variable holds
the corresponding consolidated variable names; and rec.vars
provides the
order of the variables for the output file.
Default namespace
The xml2
package
names all default namespaces with d1
, d2
, etc. (see xml_ns()
), so we
need to add the explicit namespace “d1:
” to the XPath elements in
rec.elems
. The call to gsub()
looks for all forward slashes (/
) that
are not followed by an attribute name (signified by @
) and replaces them
with “/d1:
”. We also prepend the path with an extra slash to indicate that
the path does not need to start from the root node (a Return
element).
Compensation record blocks
The process is similar for the compensation records file, except that we
have a mapping for each of the different types of blocks for each individual
(comp.groups
), so comp.elems
and comp.map
are lists indexed by the
XPath for the parent node for the block.
Extraction
Listing pagination
The Amazon S3 API paginates bucket listings by 1,000 objects, so we’ll have
to request a page of results, process those files, and then request the next
page of results, etc. The next page of results is indicated by a
“continuation token” (NextContinuationToken
in the listing results data
structure) appended to the bucket listing API call.
Error handling
Since there are over 3.5 million files to process, we wouldn’t want the
script to die in the middle due to an HTTP error, so we’ll protect the
fetching and parsing of each XML file in a tryCatch()
block. If there is
an error downloading a file, we add the key
to err
for retrying later.
Submission record extraction
We call xml_find_first()
for each element rec.elems
to extract the
submissions record data, storing the results (either the data element or
NULL
if the element wasn’t in the file) in the vals
vector. Then, we
rename the vals
elements with their target variable names (setNames(vals,
rec.map)
) and remove the empty elements (condense()
). We then order the
results according to the column order for our output file
((...)[rec.vars]
) and replace missing values with empty strings
(naToEmpty()
) before outputting the submission record.
Individuals by block type
The data about individuals (directors, officers, and highly compensated
employees) is stuffed into different elements in the various versions of
Form 990 (and even in different sections of Form 990-EZ). To extract the data
about individuals, we’ll iterate over each block type (comp.groups
).
Extract individuals
We start by finding all of the nodes that correspond to block type g
(xml_find_all()
). We convert each node to a list of individual data
elements (as_list(p)
, note that this is xml2::as_list()
, not
base::as.list()
). This list of individual data elements is actually a
list of lists: Each child of the individual block could, in principle, have
multiple children. However, in the context of this particular XML schema,
we know that each child of the individual block will have only a single
value. So, we want to collapse the list of lists so that each element of
the resulting individuals list vals
is a simple named vector of the
individual compensation elements. To collapse the list of lists, we call
getElement()
on each of the sublists to extract the first (and only)
element, allowing sapply()
to combine the results into the simple named
vector. At the end of this line, vals
is a list of compensation data
vectors for each individual.
Element to variable mapping for individuals
Next, we want to rename the data elements for each individual from the XML
element names to our target variable names and reorder these variables for
output to the compensation records file. For each individual’s named vector
(p
), we reorder the elements according to our canonical element order for
this block (g
) from the mapping file (p[ comp.elements[[g]] ]
) and
rename with our target variable names (setNames(..., comp.map[[g]])
).
Then, we reorder the elements again according to our canonical column
ordering for the compensation records file ((...)[comp.vars]
) and replace
missing values with empty strings (naToEmpty()
). At the end of this line,
vals
is still a list of compensation vectors for each individual, but the
vectors have been reorganized into the standard column order.
Clean up
To finish up, we close the two intermediate output files and write out the vector of files that encountered HTTP errors for follow-up later.
We read the data back in using the
data.table
package,
which is much more efficient than the usual data.frame
for very large
datasets.
Amended returns
When an organization files an amended Form 990, the IRS releases both the
original and the amended returns. For this database, we only want to keep
the most recent (amended) data. So, we’ll sort the dataset by the submission
Timestamp
and select the most-recent record for each organization (EIN
)
and tax year.
Parse ISO 8601 date
The dates in the XML file are in
ISO 8601 format, which requires a
little munging
before we can parse it with as.POSIXct()
. Then we order by the parsed
timestamp ts
and select the last (i.e., most recent) record (.SD[.N]
)
for each organization and tax year (by=.(EIN, TaxYear)
).
Finally, we can output the final database. To make handling the data a little easier, particularly for users who only want data from certain years, we’ll output a file for each tax year separately.
Filtering compensation records
We filtered the submission records to select the most recent submission
explicitly. For the compensation records, we’ll use the retained submission
records as our guide, performing an
inner join
using the X[Y, nomatch=0]
idiom (see ?"[.data.table"
).
Performance
Amazon EC2
Since the IRS XML files are stored in an Amazon S3 bucket in the us-east-1
(Northern Virginia) region, I figured the data transfer would go a lot
faster if I processed the data on an
Amazon EC2 instance in the same region. (No
need to pull the data halfway across the country!) I used an
m6g.medium
instance, built with the new custom-to-Amazon Arm-based
Graviton2 processors.
Throughput
As of early June 2021, the IRS had released 3,534,226 Form 990 records, which included 33,369,586 listed individuals (directors, officers, highly compensated employees), yielding a 2.5 GB dataset. Data processing took 65.4 hours (2.7 days), at a rate of 15 submissions per second. Throughout, the CPU was at about 38% utilization, suggesting that processing was likely network-I/O bound, even given the speed advantages of data transfers within the same data center. I probably could have achieved a 30% speed up by breaking the download and parsing code into two separate threads (so that the download of the next file could continue during parsing of the last), but the extra coding effort probably wouldn’t have been worth it.
Memory management
I originally attempted this process using the
XML
package, rather than the
xml2
package.
Both are based on the libxml2
library, but it turns out that the XML
package leaks memory like crazy,
even using XML::free()
to (ostensibly) release internal parser structures.
With 3.5 million documents to parse, such rampant memory mismanagement was a
nonstarter (my initial run ground to a halt after about 4 hours on a machine
with 4 GB of RAM). Processing went much more smoothly with xml2
. Memory
usage was about 6 times larger at the end of the 2.7-day run than at the
beginning, suggesting that there was likely still a leak somewhere, but
total usage stayed under 1 GB for the entire process, so the minor leak was
workable.