Paper: A Practical Guide to Data Quality: Problems, Detection, and Strategy #1103

edinhodiluviano · 2025-06-13T21:54:52Z

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a draft label; this will trigger GitHub Actions to run automated checks on your paper and build a preview. You may then work to resolve failed checks and ensure the preview build looks correct. If you have any questions, please tag the proceedings team in a comment on your PR with @scipy-conference/2025-proceedings.

See the project readme for more information.

github-actions · 2025-06-13T21:55:13Z

Curvenote Preview

Directory	Preview	Checks	Updated (UTC)
papers/edson_bomfim	🔍 Inspect	✅ 33 checks passed (10 optional)	Oct 14, 2025, 6:07 PM

anmolsinghal

Hi,

I am going to be the reviewer for this paper. Please do not merge in the PR yet

anilsil · 2025-06-18T01:55:47Z

Hi, I have been assigned to review this. I will do a pass by the weekend.

ameyxd · 2025-06-23T18:36:59Z

Inviting reviewers: @[email protected] and @[email protected]

anmolsinghalIMC · 2025-06-23T18:49:40Z

papers/edson_bomfim/main.md

+
+In the modern technological landscape, data is the foundational asset that powers innovation, from scientific research, training sophisticated machine learning (ML) models to driving critical business analytics. A dataset can be defined as a structured collection of individual data points, or instances, that hold information about a set of entities sharing some common characteristics. The utility and reliability of any system built upon this data are inextricably linked to its quality.
+
+This paper treats the quality of data as the degree to which it accurately and faithfully represents the real-world phenomena it purports to describe [@anchoring-data-quality] [@data-linter]. This is a crucial distinction. For example, a common challenge in machine learning is class imbalance, where one class is significantly underrepresented. However, if this imbalance accurately reflects the real world (e.g., fraudulent transactions are rare), the dataset itself is not of poor quality; rather, it is a perfect representation of a difficult problem. The challenge then lies with the modeling technique, not the data's fidelity. Our focus is on errors where the data *fails* to represent the world correctly.


Can be more mathematical by saying this paper seeks to address issues where observations dont match the natural distribution

That is nice!
Thanks for the feedback!
I'll proceed with the changes.

anmolsinghalIMC · 2025-06-23T19:04:43Z

papers/edson_bomfim/main.md

+7.  **Non-standardized / Non-Conforming Data / Irregularities:** The use of different formats or units to represent the same information (e.g., "USA", "U.S.A.", "United States"; or measurements in both Celsius and Fahrenheit) [@problems-methods-data-cleasing] [@problems-methods-data-cleasing].
+8.  **Ambiguous Data / Incomplete Context:** Data that can be interpreted in multiple ways without additional context. This can arise from abbreviations ("J. Stevens" for John or Jack) or homonyms ("Miami" in Florida vs. Ohio) [@problems-methods-data-cleasing] [@ajax].
+9.  **Embedded / Extraneous Data / Value Items Beyond Attribute Context:** A single field contains multiple pieces of information that should be separate (e.g., a "name" field containing "President John Stevens") or information that belongs in another field (e.g., a zip code in an address line) [@ajax] [@problems-methods-data-cleasing].
+10. **Misfielded Values:** Correct data is stored in the wrong field (e.g., the value "Portugal" in a "city" attribute) [@ajax].


how is it different from wrong data

Hum... not sure which one do you mean exactly. I'll try to explain all of them.
First:
4. Wrong / Incorrect Data / Invalid Data: The data is wrong in essence. For instance a temperature was 10 but the dataset shows 20.

Non-standardized / Non-Conforming Data / Irregularities: the instance may even carry the correct information. It is just out of the expected standard. With the same example of temperature, maybe the data instance came with 50 (implicitly Fahrenheit when it should be Celsius); or even a string "ten".

Ambiguous Data / Incomplete Context: the instance may even carry the correct information. It is just not possible to know what it is just from the context provided. For example a temperature field with just "1" could be ambiguous in some situations, but it could also be obvious (let's say if we are reading a dataset of low temperature super conductors). So it is not only from the instance itself, but also the context it is in.

Embedded / Extraneous Data / Value Items Beyond Attribute Context: A field contains information of multiple fields. Perhaps each of these information could even be correct. They are just misplaced. This is even more similar to the number 10.

Misfielded Values: the instance may hold the correct information, just in the wrong field. The temperature above could be inside the "time" field, for example.

Does it makes clearer? Perhaps we could improve some of them in some way?

Also there are some intersection between these types, they are not 100% mutually exclusive. For instance the number 9 and 10 are quite similar!
And even more: maybe there is yet another problem with the naming "Wrong / Incorrect Data / Invalid Data". At least for me (I'm not a native English speaker) these words seems like a "catch all" for many of the other problems. Maybe there are better wording for them. I couldn't find though. All ears for suggestion on it : )

I would expect identified categories to be mutually exclusive. A dataset can have multiple problems but those problems should describe a set of unique problems and not overlap. That is not a good taxonomy

You are correct. And conceptually they are unique.
But the intersection still exists, both on the "symptom" and on the "treatment" of the problem. Not on the concept of the problem itself.

This characteristic is quite common among some other useful categorizations in the wild. For instance the ICD 10 has some very good examples of such:
Similarity on the cause:

A18: Tuberculosis of other organs

G01: Meningitis in bacterial diseases classified elsewhere
Also on the symptom (they are very different on the name, but the detection and treatment overlaps a lot):

F41.X: Other anxiety disorders

F33.X: Major depressive disorder, recurrent

Even with these overlapping of some aspects of the classification, they are very useful in some specific contexts. The same can be applied to these data quality classification.

Even so do you believe they should be somewhat "merged" together?

anmolsinghalIMC · 2025-06-23T19:05:27Z

papers/edson_bomfim/main.md

+
+A data quality problem, or "dirty data," is any instance where the data fails to correctly represent the real-world entity or event it describes [@taxonomy-of-dity-data] [@problems-methods-data-cleasing] [@survey-of-data-quality-tools]. These problems can be subtle or overt, but all carry the risk of corrupting analysis and undermining model performance. Below is a taxonomy of common data quality problems compiled from the literature.
+
+### Taxonomy of Data Quality Problems


i think the problem of distribution not matching the real world has not been addressed

Thanks. However I'm not sure I understood what, exactly, "addressed" means here.
Do you mean that the paper doesn't says what to do when one finds the "problematic instances"?

There is a very common data problem where you dont observe the natural distribution of data. This occurs mostly due to data capture mistakes. An example would be, lets say you try to capture favorite dishes of people living in new york city but you only talk to people living in Brooklyn and not in any of the other boroughs. This would lead to a bias in the data.

Thanks for the reply. I understood the general problem.
Just not what "addressed" mean on your comment "i think the problem (...) has not been addressed".

What do you mean by this "addressing", specifically?

"Address" means to think about and begin to deal with. In this context it would mean you mention the problem and then add it as a category

Thanks so much for the clarification and the example, @anmolsinghalIMC, that was super helpful.
You're absolutely right, the paper wasn't covering sampling bias, and I completely agree that it's a critical issue that can undermine a dataset's quality.
My reasoning is that while it's a vital problem, it seems to stem more from the study design or data sourcing strategy, whereas the paper's focus is on finding flaws within the data that has already been collected.
To make this distinction clear and directly address your feedback, I've updated the paper by adding a new point to the "A Note on Excluded Concepts" section. It's called "Representational Validity/Sampling Bias" and explicitly acknowledges the issue while explaining the rationale for keeping it out of the main taxonomy.
This was a fantastic suggestion, and I think it makes the paper's scope much clearer and more robust. Thanks again for the constructive feedback!

anmolsinghalIMC · 2025-06-23T19:08:30Z

papers/edson_bomfim/main.md

+
+Also we present the @tbl:cat-detection-to-problems which provides a high-level map for practitioners, showing which categories of methods are effective at identifying specific data quality problems.
+
+```{list-table} Linking Data Problems to Detection Method Categories


there definitely needs to be a section explaining these choices

Oh sure, thanks for the idea!
I'll work on it right now.
If not a problem, please give me some couple days to send an update with this new section for your appreciation.

Hello @anmolsinghalIMC, finally I was able to send the new section explaining each category.
I appreciate your feedback. Let me know if you can see ways to improve it.
Thank you.

edinhodiluviano · 2025-06-24T14:55:52Z

Thanks for the feedback @anmolsinghalIMC
I just sent a new version with the first change (commit here if you want to see just the change).

I'll work on a new sub section of Categories of Detection Methods to provide more details on each of the categories. As soon as ready I'll ping you here.

Also the CI/CD pipeline is raising this error which I'm not sure if it is something related to the paper (or something wrong I did).
A bit after saying 🏁 Checks completed, it says Job update failed. Should I worry now about it or can I wait for the retry when the next update is sent?

Thanks a lot for taking the time for this review. Greatly appreciated.

anilsil · 2025-06-29T13:09:28Z

Inviting reviewers: @[email protected] and @[email protected]

Hi, This is Anil, I will be finishing review today. Thank you for your patience.

anilsil

This paper offers a well-structured, comprehensive, and practical framework for understanding and addressing data quality issues. Auther is clear about problem statement and case-study to be presented. In that process this paper presents a clear taxonomy of 23 data quality problems and 22 detection methods, mapped into different categories, making it highly useful for practitioners. The writing is clear,about the proposed concept of “data testing” aligns well with software engineering best practices.

Please consider adding empirical validation/examples using real-world datasets,including diagrams or visual summarization is not bad idea especially for table-3 where various methods are commended.

ameyxd · 2025-07-02T19:25:45Z

@hongsupshin will serve as editor for this paper.

edinhodiluviano · 2025-07-09T02:41:24Z

Hello @anilsil, thanks for your valuable feedback.
Took me some time, but I'm finally happy with a new section with some examples in a real dataset.
The code is simple enough... and really isn't that hard. The interesting parts are the concepts behind the simple code given and the reality of the datasets in the wild.
I'd be grateful if you could evaluate those changes and share any improvements that you can see.
Thanks

hongsupshin · 2025-07-23T17:32:56Z

@edinhodiluviano Hi, my name is Hongsup Shin, and I am the editor of your paper. @anilsil @anmolsinghal It looks like the author has some questions about your comments, so if you can answer those soon, I would really appreciate it!

anmolsinghalIMC · 2025-07-31T01:41:35Z

I have completed the review from my side

hongsupshin · 2025-08-27T15:55:08Z

@edinhodiluviano Hi, just a reminder that the Final Author Revision Deadline is 9/4/2025. This means you shouldn't be making any changes after this date. If you want to make changes, please do so before this date. Thank you!

edinhodiluviano added 2 commits June 13, 2025 18:48

draft

6a991b3

draft

20728b7

fix table formatting

5a98114

rowanc1 added paper This indicates that the PR in question is a paper draft This triggers Curvenote Preview actions labels Jun 13, 2025

add missing doi; ignore error for non-existing

bed8153

anmolsinghal reviewed Jun 17, 2025

View reviewed changes

anmolsinghalIMC reviewed Jun 23, 2025

View reviewed changes

add a more mathematical definition for the definition of data-quality

1bc1ec0

anilsil reviewed Jun 29, 2025

View reviewed changes

ameyxd added the assigned-editor label Jul 2, 2025

ameyxd assigned hongsupshin Jul 2, 2025

edinhodiluviano added 3 commits July 8, 2025 21:19

create a section explaining the dection methods categories

114e87c

add empirical validation/examples using real-world dataset

06096ad

explicitly ignoring not existing DOI

cbc67f8

edinhodiluviano added 2 commits August 3, 2025 18:56

addressing/excluding sampling bias

3146374

update introduction and add a brief to the new case study section

f368c46

fwkoch added approved This triggers Curvenote Submission action and removed draft This triggers Curvenote Preview actions labels Oct 14, 2025

fwkoch merged commit 8e57750 into scipy-conference:2025 Oct 14, 2025
16 checks passed


		In the modern technological landscape, data is the foundational asset that powers innovation, from scientific research, training sophisticated machine learning (ML) models to driving critical business analytics. A dataset can be defined as a structured collection of individual data points, or instances, that hold information about a set of entities sharing some common characteristics. The utility and reliability of any system built upon this data are inextricably linked to its quality.

		This paper treats the quality of data as the degree to which it accurately and faithfully represents the real-world phenomena it purports to describe [@anchoring-data-quality] [@data-linter]. This is a crucial distinction. For example, a common challenge in machine learning is class imbalance, where one class is significantly underrepresented. However, if this imbalance accurately reflects the real world (e.g., fraudulent transactions are rare), the dataset itself is not of poor quality; rather, it is a perfect representation of a difficult problem. The challenge then lies with the modeling technique, not the data's fidelity. Our focus is on errors where the data fails to represent the world correctly.


		A data quality problem, or "dirty data," is any instance where the data fails to correctly represent the real-world entity or event it describes [@taxonomy-of-dity-data] [@problems-methods-data-cleasing] [@survey-of-data-quality-tools]. These problems can be subtle or overt, but all carry the risk of corrupting analysis and undermining model performance. Below is a taxonomy of common data quality problems compiled from the literature.

		### Taxonomy of Data Quality Problems


		Also we present the @tbl:cat-detection-to-problems which provides a high-level map for practitioners, showing which categories of methods are effective at identifying specific data quality problems.

		```{list-table} Linking Data Problems to Detection Method Categories

Paper: A Practical Guide to Data Quality: Problems, Detection, and Strategy #1103

Paper: A Practical Guide to Data Quality: Problems, Detection, and Strategy #1103

Uh oh!

Conversation

edinhodiluviano commented Jun 13, 2025

Uh oh!

github-actions bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anmolsinghal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anilsil commented Jun 18, 2025

Uh oh!

ameyxd commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edinhodiluviano commented Jun 24, 2025

Uh oh!

anilsil commented Jun 29, 2025

Uh oh!

anilsil left a comment

Choose a reason for hiding this comment

Uh oh!

ameyxd commented Jul 2, 2025

Uh oh!

edinhodiluviano commented Jul 9, 2025

Uh oh!

hongsupshin commented Jul 23, 2025

Uh oh!

anmolsinghalIMC commented Jul 31, 2025

Uh oh!

hongsupshin commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

github-actions bot commented Jun 13, 2025 •

edited

Loading

anmolsinghal left a comment •

edited

Loading