Information Systems
College of Business Administration
University of Missouri - St. Louis

Data Scrubbing

. Why do data need to be scrubbed?
Data come from multiple sources: internal, external and even desktop
Data problems:
      Volatility
      Fragmentation across systems
      Complexity and age of applications providing data
      Lax rules about data management
      No clear ownership identified
      No definitions of data quality

. Activities in Data Scrubbing
Identification of problems of spelling, typographical errors, transpositions, or variations in spelling
     Vikci, Vikki, Vickie, Vicky, Vcki all transformed to Vicki

Identification of records not using corporate standards for coding
     Degree programs of Management Information Systems or MIS transformed to IS
     Are 314.354-1624, 314-354-1624 and (314) 354-1624 all the same telephone number?

Identification of poorly documented data

Identification and removal of duplicate records
     Vicki Sauter .... 314-516-6281
     Vicki Sauter .... 314-516-6281

Identification and tagging of similar records for subsequent human review
     Vicki Sauter .... 314-516-6281
     Vicki L. Sauter .... 314-516-6281
     Dr. Sauter .... 314-516-6281

The removal of spurious and invalid records
     Vicki Sauter .... 314-261-3442

Data validation (for example using a zipcode checker to identify that addresses are correct)
     is 60651 legitimate for Chicago, IL?

The removal of obsolete data
     Vicki Sauter .... 312-261-3442

The comparison and removal of records matching third party information

Enrichment of data with product attributes, images, and manufacturer specifications

Identification of missing or incomplete data
     Vicki Sauter ....


. A case for data scrubbing.


. Data Scrubbing


. Dylan's BI study notes on data scrubbing


. What is data scrubbing


| UM-St. Louis Home Page | College of Business Page | IS Home Page | DSS Home Page |



Page Owner: Professor Sauter (Vicki.Sauter@umsl.edu)
© Vicki L. Sauter. All rights reserved.