February 13th: Defining Data Quality
Post authored by Lora Leligdon
Welcome to Love Your Data week! Each day this week we will be blogging, tweeting, and sharing practical tips, resources, and stories to help you adopt good data practices. Up first, know your data quality!
Data quality is the degree to which data meets the purposes and requirements of its use. Depending on the uses, good quality data may refer to complete, accurate, credible, consistent or “good enough” data.
Things to consider:
What is data quality and how can we distinguish between good and bad data? How are the issues of data quality being addressed in various disciplines?
- Data quality refers to the quality of content (values) in one’s data set. For example, if a data set contains names and addresses of customers, all names and addresses have to be recorded (data is complete) and correspond to the actual names and addresses (data is accurate), and all records have to be up-to-date (data is current).
- The most common characteristics of data quality include completeness, validity, consistency, timeliness, and accuracy. Additionally, data has to be useful (fit for purpose), documented, and reproducible/verifiable.
- At least four activities impact the quality of data: modeling the world (deciding what to collect and how), collecting or generating data, storage/access, and formatting/transformation.
- Assessing data quality requires disciplinary knowledge and is time-consuming.
- Data quality issues: how to measure, how to track lineage of data (provenance), when data is “good enough”, what happens when data is mixed and triangulated (esp. high quality and low quality data), and crowdsourcing for quality.
- Data quality is responsibility of both data providers and data curators: data providers ensure the quality of their individual data sets, while curators help the community with consistency, coverage, and metadata.
“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
― Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values
- Bad Data Costs the U.S. $3 Trillion Per Year https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
- Data Quality and Curation http://datascience.codata.org/articles/abstract/10.2481/dsj.GRDI-011/
- Good data are not enough http://www.nature.com/news/good-data-are-not-enough-1.20906
- Bad data issues guide https://github.com/Quartz/bad-data-guide
- Examples of how not to prepare or provide data http://okfnlabs.org/bad-data/
- Data quality assessment (provides a table of various quality dimensions and their definitions): Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211. http://doi.org/10.1145/505248.506010
- Want to learn more? Attend the upcoming Dartmouth workshops “Data Management Planning with the DMPTool” and “Data Cleaning with OpenRefine and R” to learn hands-on approaches to ensuring quality data.
- Use criteria for good data (e.g., completeness, accuracy, fitness for use, documentation) to assess where your data stands.
- Discuss your approaches to data collection and measures you took/could take to ensure integrity and completeness of your data.
- Discuss steps to address missing or incomplete data in the context of your research. Does it matter? How much missing data affects validity, reliability or trustworthiness of your conclusions?
Remember to join our conversation on Twitter (#LYD17 #loveyourdata) or share your insights on Facebook (#LYD17 #loveyourdata). Up tomorrow…. Documenting, Describing, and Defining your data.
Our daily blog posts are courtesy of the 2017 LYD Week Planning Committee. Learn more at https://loveyourdata.wordpress.com/lydw-2017/!