Data Scrubbing
Why do data need to be scrubbed?
- Data come from multiple sources: internal, external and even desktop
- Data problems:
- Volatility
- Fragmentation across systems
- Complexity and age of applications providing data
- Lax rules about data management
- No clear ownership identified
- No definitions of data quality
Activities in Data Scrubbing
- Identification of problems of spelling, typographical errors, transpositions, or variations in spelling
- Vikci, Vikki, Vickie, Vicky, Vcki all transformed to Vicki
- Identification of records not using corporate standards for coding
- Degree programs of Management Information Systems or MIS transformed to IS
- Are 314.354-1624, 314-354-1624 and (314) 354-1624 all the same telephone number?
- Identification of poorly documented data
- Identification and removal of duplicate records
- Vicki Sauter .... 314-516-6281
- Vicki Sauter .... 314-516-6281
- Identification and tagging of similar records for subsequent human review
- Vicki Sauter .... 314-516-6281
- Vicki L. Sauter .... 314-516-6281
- Dr. Sauter .... 314-516-6281
- The removal of spurious and invalid records
- Vicki Sauter .... 314-261-3442
- Data validation (for example using a zipcode checker to identify that addresses are correct)
- is 60651 legitimate for Chicago, IL?
- The removal of obsolete data
- Vicki Sauter .... 312-261-3442
- The comparison and removal of records matching third party information
- Enrichment of data with product attributes, images, and manufacturer specifications
- Identification of missing or incomplete data
- Vicki Sauter ....
A case for data scrubbing.
Data Scrubbing
Dylan's BI study notes on data scrubbing
What is data scrubbing
| UM-St. Louis Home Page
| College of
Business Page | IS
Home Page | DSS Home Page |
Page Owner: Professor Sauter
(Vicki.Sauter@umsl.edu)
© Vicki L. Sauter. All rights reserved.