-
Notifications
You must be signed in to change notification settings - Fork 572
Paper: A Practical Guide to Data Quality: Problems, Detection, and Strategy #1103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Curvenote Preview
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I am going to be the reviewer for this paper. Please do not merge in the PR yet
|
Hi, I have been assigned to review this. I will do a pass by the weekend. |
|
Inviting reviewers: @[email protected] and @[email protected] |
papers/edson_bomfim/main.md
Outdated
|
|
||
| In the modern technological landscape, data is the foundational asset that powers innovation, from scientific research, training sophisticated machine learning (ML) models to driving critical business analytics. A dataset can be defined as a structured collection of individual data points, or instances, that hold information about a set of entities sharing some common characteristics. The utility and reliability of any system built upon this data are inextricably linked to its quality. | ||
|
|
||
| This paper treats the quality of data as the degree to which it accurately and faithfully represents the real-world phenomena it purports to describe [@anchoring-data-quality] [@data-linter]. This is a crucial distinction. For example, a common challenge in machine learning is class imbalance, where one class is significantly underrepresented. However, if this imbalance accurately reflects the real world (e.g., fraudulent transactions are rare), the dataset itself is not of poor quality; rather, it is a perfect representation of a difficult problem. The challenge then lies with the modeling technique, not the data's fidelity. Our focus is on errors where the data *fails* to represent the world correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be more mathematical by saying this paper seeks to address issues where observations dont match the natural distribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is nice!
Thanks for the feedback!
I'll proceed with the changes.
| 7. **Non-standardized / Non-Conforming Data / Irregularities:** The use of different formats or units to represent the same information (e.g., "USA", "U.S.A.", "United States"; or measurements in both Celsius and Fahrenheit) [@problems-methods-data-cleasing] [@problems-methods-data-cleasing]. | ||
| 8. **Ambiguous Data / Incomplete Context:** Data that can be interpreted in multiple ways without additional context. This can arise from abbreviations ("J. Stevens" for John or Jack) or homonyms ("Miami" in Florida vs. Ohio) [@problems-methods-data-cleasing] [@ajax]. | ||
| 9. **Embedded / Extraneous Data / Value Items Beyond Attribute Context:** A single field contains multiple pieces of information that should be separate (e.g., a "name" field containing "President John Stevens") or information that belongs in another field (e.g., a zip code in an address line) [@ajax] [@problems-methods-data-cleasing]. | ||
| 10. **Misfielded Values:** Correct data is stored in the wrong field (e.g., the value "Portugal" in a "city" attribute) [@ajax]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it different from wrong data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum... not sure which one do you mean exactly. I'll try to explain all of them.
First:
4. Wrong / Incorrect Data / Invalid Data: The data is wrong in essence. For instance a temperature was 10 but the dataset shows 20.
- Non-standardized / Non-Conforming Data / Irregularities: the instance may even carry the correct information. It is just out of the expected standard. With the same example of temperature, maybe the data instance came with 50 (implicitly Fahrenheit when it should be Celsius); or even a string "ten".
- Ambiguous Data / Incomplete Context: the instance may even carry the correct information. It is just not possible to know what it is just from the context provided. For example a temperature field with just "1" could be ambiguous in some situations, but it could also be obvious (let's say if we are reading a dataset of low temperature super conductors). So it is not only from the instance itself, but also the context it is in.
- Embedded / Extraneous Data / Value Items Beyond Attribute Context: A field contains information of multiple fields. Perhaps each of these information could even be correct. They are just misplaced. This is even more similar to the number 10.
- Misfielded Values: the instance may hold the correct information, just in the wrong field. The temperature above could be inside the "time" field, for example.
Does it makes clearer? Perhaps we could improve some of them in some way?
Also there are some intersection between these types, they are not 100% mutually exclusive. For instance the number 9 and 10 are quite similar!
And even more: maybe there is yet another problem with the naming "Wrong / Incorrect Data / Invalid Data". At least for me (I'm not a native English speaker) these words seems like a "catch all" for many of the other problems. Maybe there are better wording for them. I couldn't find though. All ears for suggestion on it : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect identified categories to be mutually exclusive. A dataset can have multiple problems but those problems should describe a set of unique problems and not overlap. That is not a good taxonomy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. And conceptually they are unique.
But the intersection still exists, both on the "symptom" and on the "treatment" of the problem. Not on the concept of the problem itself.
This characteristic is quite common among some other useful categorizations in the wild. For instance the ICD 10 has some very good examples of such:
Similarity on the cause:
- A18: Tuberculosis of other organs
- G01: Meningitis in bacterial diseases classified elsewhere
Also on the symptom (they are very different on the name, but the detection and treatment overlaps a lot): - F41.X: Other anxiety disorders
- F33.X: Major depressive disorder, recurrent
Even with these overlapping of some aspects of the classification, they are very useful in some specific contexts. The same can be applied to these data quality classification.
Even so do you believe they should be somewhat "merged" together?
|
|
||
| A data quality problem, or "dirty data," is any instance where the data fails to correctly represent the real-world entity or event it describes [@taxonomy-of-dity-data] [@problems-methods-data-cleasing] [@survey-of-data-quality-tools]. These problems can be subtle or overt, but all carry the risk of corrupting analysis and undermining model performance. Below is a taxonomy of common data quality problems compiled from the literature. | ||
|
|
||
| ### Taxonomy of Data Quality Problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the problem of distribution not matching the real world has not been addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. However I'm not sure I understood what, exactly, "addressed" means here.
Do you mean that the paper doesn't says what to do when one finds the "problematic instances"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a very common data problem where you dont observe the natural distribution of data. This occurs mostly due to data capture mistakes. An example would be, lets say you try to capture favorite dishes of people living in new york city but you only talk to people living in Brooklyn and not in any of the other boroughs. This would lead to a bias in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. I understood the general problem.
Just not what "addressed" mean on your comment "i think the problem (...) has not been addressed".
What do you mean by this "addressing", specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Address" means to think about and begin to deal with. In this context it would mean you mention the problem and then add it as a category
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the clarification and the example, @anmolsinghalIMC, that was super helpful.
You're absolutely right, the paper wasn't covering sampling bias, and I completely agree that it's a critical issue that can undermine a dataset's quality.
My reasoning is that while it's a vital problem, it seems to stem more from the study design or data sourcing strategy, whereas the paper's focus is on finding flaws within the data that has already been collected.
To make this distinction clear and directly address your feedback, I've updated the paper by adding a new point to the "A Note on Excluded Concepts" section. It's called "Representational Validity/Sampling Bias" and explicitly acknowledges the issue while explaining the rationale for keeping it out of the main taxonomy.
This was a fantastic suggestion, and I think it makes the paper's scope much clearer and more robust. Thanks again for the constructive feedback!
|
|
||
| Also we present the @tbl:cat-detection-to-problems which provides a high-level map for practitioners, showing which categories of methods are effective at identifying specific data quality problems. | ||
|
|
||
| ```{list-table} Linking Data Problems to Detection Method Categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there definitely needs to be a section explaining these choices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure, thanks for the idea!
I'll work on it right now.
If not a problem, please give me some couple days to send an update with this new section for your appreciation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @anmolsinghalIMC, finally I was able to send the new section explaining each category.
I appreciate your feedback. Let me know if you can see ways to improve it.
Thank you.
|
Thanks for the feedback @anmolsinghalIMC I'll work on a new sub section of Also the CI/CD pipeline is raising this error which I'm not sure if it is something related to the paper (or something wrong I did). Thanks a lot for taking the time for this review. Greatly appreciated. |
Hi, This is Anil, I will be finishing review today. Thank you for your patience. |
anilsil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paper offers a well-structured, comprehensive, and practical framework for understanding and addressing data quality issues. Auther is clear about problem statement and case-study to be presented. In that process this paper presents a clear taxonomy of 23 data quality problems and 22 detection methods, mapped into different categories, making it highly useful for practitioners. The writing is clear,about the proposed concept of “data testing” aligns well with software engineering best practices.
Please consider adding empirical validation/examples using real-world datasets,including diagrams or visual summarization is not bad idea especially for table-3 where various methods are commended.
|
@hongsupshin will serve as editor for this paper. |
|
Hello @anilsil, thanks for your valuable feedback. |
|
@edinhodiluviano Hi, my name is Hongsup Shin, and I am the editor of your paper. @anilsil @anmolsinghal It looks like the author has some questions about your comments, so if you can answer those soon, I would really appreciate it! |
|
I have completed the review from my side |
|
@edinhodiluviano Hi, just a reminder that the Final Author Revision Deadline is 9/4/2025. This means you shouldn't be making any changes after this date. If you want to make changes, please do so before this date. Thank you! |
If you are creating this PR in order to submit a draft of your paper, please name your PR with
Paper: <title>. An editor will then add adraftlabel; this will trigger GitHub Actions to run automated checks on your paper and build a preview. You may then work to resolve failed checks and ensure the preview build looks correct. If you have any questions, please tag the proceedings team in a comment on your PR with@scipy-conference/2025-proceedings.See the project readme for more information.