Homepage | IPUMS

U.S. Census and American Community Survey microdata from 1850 to the present. Learn More about the IPUMS USA project

Current Population Survey microdata including basic monthly surveys and supplements from 1962 to the present. Learn More about the IPUMS CPS project

Visit Site

World's largest collection of census microdata covering over 100 countries, contemporary and historical. Learn More about the IPUMS International Project

Visit Site

Health survey data from around the world, including harmonized data collections for Visit the IPUMS DHS site, MICS, and Visit the PMA site. Learn More about the IPUMS Global Health Projects

Visit Site

U.S. Census summary tables and GIS data from 1790 to the present. Learn More about the IPUMS NHGIS Project

Visit Site

Summary tables and GIS data from population, housing, and agricultural censuses around the world. Learn More about the IPUMS IHGIS Project

Visit Site

Historical and contemporary time use data from 1930 to the present. Learn More about the IPUMS Time Use Projects

Visit Site

Historical and contemporary U.S. health survey data from Visit the IPUMS NHIS site (1963-present) and Visit the IPUMS MEPS site (1996-present). Learn More about the IPUMS Health Surveys Projects

Visit Site

Survey data on the science and engineering workforce in the U.S. from 1993 to the present. Learn More about the IPUMS Higher Ed Project

Visit Site

Does 1 + 2 = 8? Automating QA/QC for Tabular Data

The problem with OCR and numbers

To extract data tables from census reports only available as print documents, IPUMS IHGIS uses optical character recognition (OCR) software to automate the conversion of scanned images into digital representations of letters and numbers. OCR software has made great strides in accuracy for textual information by using dictionaries of known words to interpret uncertain letters. However, dictionaries do not help in distinguishing uncertain numerical digits. While a dictionary can suggest that the third character in “wh_t” should be an ‘a’ and not an ‘o’, there is no simple way to tell whether the third digit in “45_” should be a 3 or an 8. To ensure that IHGIS data are accurate, we must have confidence that each number has been recognized correctly and matches the number in the source document.

To address this gap, we developed an R package that leverages IHGIS structured metadata to identify logical relationships between cell counts and row/column totals and determine where cells don’t add up as expected. Often, a given cell participates in multiple relationships, which allows the package to use patterns among discrepancies to pinpoint and correct errors. The package can automatically identify and correct up to 95% of error cells, depending on the structure of relationships.

(Read More)

IPUMS is certified by Core Trust Seal based on the Core Trustworthy Data Repositories Requirements. Learn more about the certification at coretrustseal.org.

Does 1 + 2 = 8? Automating QA/QC for Tabular Data

Help Power IPUMS

Data-Intensive
Research Conference

Association of Health Care Journalists (AHCJ)

American Society of Health Economists (ASHEcon)

2026 Data-Intensive Research Conference

Jobs

The IPUMS family of Projects

Does 1 + 2 = 8? Automating QA/QC for Tabular Data

Association of Health Care Journalists (AHCJ)

American Society of Health Economists (ASHEcon)

2026 Data-Intensive Research Conference