Skip to content

Conversation

@edinhodiluviano
Copy link
Contributor

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a draft label; this will trigger GitHub Actions to run automated checks on your paper and build a preview. You may then work to resolve failed checks and ensure the preview build looks correct. If you have any questions, please tag the proceedings team in a comment on your PR with @scipy-conference/2025-proceedings.

See the project readme for more information.

@github-actions
Copy link

github-actions bot commented Jun 13, 2025

Curvenote Preview

Directory Preview Checks Updated (UTC)
papers/edson_bomfim 🔍 Inspect 33 checks passed (10 optional) Oct 14, 2025, 6:07 PM

@rowanc1 rowanc1 added paper This indicates that the PR in question is a paper draft This triggers Curvenote Preview actions labels Jun 13, 2025
Copy link

@anmolsinghal anmolsinghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I am going to be the reviewer for this paper. Please do not merge in the PR yet

@anilsil
Copy link

anilsil commented Jun 18, 2025

Hi, I have been assigned to review this. I will do a pass by the weekend.

@ameyxd
Copy link
Contributor

ameyxd commented Jun 23, 2025

Inviting reviewers: @[email protected] and @[email protected]


In the modern technological landscape, data is the foundational asset that powers innovation, from scientific research, training sophisticated machine learning (ML) models to driving critical business analytics. A dataset can be defined as a structured collection of individual data points, or instances, that hold information about a set of entities sharing some common characteristics. The utility and reliability of any system built upon this data are inextricably linked to its quality.

This paper treats the quality of data as the degree to which it accurately and faithfully represents the real-world phenomena it purports to describe [@anchoring-data-quality] [@data-linter]. This is a crucial distinction. For example, a common challenge in machine learning is class imbalance, where one class is significantly underrepresented. However, if this imbalance accurately reflects the real world (e.g., fraudulent transactions are rare), the dataset itself is not of poor quality; rather, it is a perfect representation of a difficult problem. The challenge then lies with the modeling technique, not the data's fidelity. Our focus is on errors where the data *fails* to represent the world correctly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be more mathematical by saying this paper seeks to address issues where observations dont match the natural distribution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is nice!
Thanks for the feedback!
I'll proceed with the changes.

7. **Non-standardized / Non-Conforming Data / Irregularities:** The use of different formats or units to represent the same information (e.g., "USA", "U.S.A.", "United States"; or measurements in both Celsius and Fahrenheit) [@problems-methods-data-cleasing] [@problems-methods-data-cleasing].
8. **Ambiguous Data / Incomplete Context:** Data that can be interpreted in multiple ways without additional context. This can arise from abbreviations ("J. Stevens" for John or Jack) or homonyms ("Miami" in Florida vs. Ohio) [@problems-methods-data-cleasing] [@ajax].
9. **Embedded / Extraneous Data / Value Items Beyond Attribute Context:** A single field contains multiple pieces of information that should be separate (e.g., a "name" field containing "President John Stevens") or information that belongs in another field (e.g., a zip code in an address line) [@ajax] [@problems-methods-data-cleasing].
10. **Misfielded Values:** Correct data is stored in the wrong field (e.g., the value "Portugal" in a "city" attribute) [@ajax].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it different from wrong data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... not sure which one do you mean exactly. I'll try to explain all of them.
First:
4. Wrong / Incorrect Data / Invalid Data: The data is wrong in essence. For instance a temperature was 10 but the dataset shows 20.

  1. Non-standardized / Non-Conforming Data / Irregularities: the instance may even carry the correct information. It is just out of the expected standard. With the same example of temperature, maybe the data instance came with 50 (implicitly Fahrenheit when it should be Celsius); or even a string "ten".
  2. Ambiguous Data / Incomplete Context: the instance may even carry the correct information. It is just not possible to know what it is just from the context provided. For example a temperature field with just "1" could be ambiguous in some situations, but it could also be obvious (let's say if we are reading a dataset of low temperature super conductors). So it is not only from the instance itself, but also the context it is in.
  3. Embedded / Extraneous Data / Value Items Beyond Attribute Context: A field contains information of multiple fields. Perhaps each of these information could even be correct. They are just misplaced. This is even more similar to the number 10.
  4. Misfielded Values: the instance may hold the correct information, just in the wrong field. The temperature above could be inside the "time" field, for example.

Does it makes clearer? Perhaps we could improve some of them in some way?

Also there are some intersection between these types, they are not 100% mutually exclusive. For instance the number 9 and 10 are quite similar!
And even more: maybe there is yet another problem with the naming "Wrong / Incorrect Data / Invalid Data". At least for me (I'm not a native English speaker) these words seems like a "catch all" for many of the other problems. Maybe there are better wording for them. I couldn't find though. All ears for suggestion on it : )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect identified categories to be mutually exclusive. A dataset can have multiple problems but those problems should describe a set of unique problems and not overlap. That is not a good taxonomy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. And conceptually they are unique.
But the intersection still exists, both on the "symptom" and on the "treatment" of the problem. Not on the concept of the problem itself.

This characteristic is quite common among some other useful categorizations in the wild. For instance the ICD 10 has some very good examples of such:
Similarity on the cause:

  • A18: Tuberculosis of other organs
  • G01: Meningitis in bacterial diseases classified elsewhere
    Also on the symptom (they are very different on the name, but the detection and treatment overlaps a lot):
  • F41.X: Other anxiety disorders
  • F33.X: Major depressive disorder, recurrent

Even with these overlapping of some aspects of the classification, they are very useful in some specific contexts. The same can be applied to these data quality classification.

Even so do you believe they should be somewhat "merged" together?


A data quality problem, or "dirty data," is any instance where the data fails to correctly represent the real-world entity or event it describes [@taxonomy-of-dity-data] [@problems-methods-data-cleasing] [@survey-of-data-quality-tools]. These problems can be subtle or overt, but all carry the risk of corrupting analysis and undermining model performance. Below is a taxonomy of common data quality problems compiled from the literature.

### Taxonomy of Data Quality Problems

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the problem of distribution not matching the real world has not been addressed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. However I'm not sure I understood what, exactly, "addressed" means here.
Do you mean that the paper doesn't says what to do when one finds the "problematic instances"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a very common data problem where you dont observe the natural distribution of data. This occurs mostly due to data capture mistakes. An example would be, lets say you try to capture favorite dishes of people living in new york city but you only talk to people living in Brooklyn and not in any of the other boroughs. This would lead to a bias in the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. I understood the general problem.
Just not what "addressed" mean on your comment "i think the problem (...) has not been addressed".

What do you mean by this "addressing", specifically?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Address" means to think about and begin to deal with. In this context it would mean you mention the problem and then add it as a category

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the clarification and the example, @anmolsinghalIMC, that was super helpful.
You're absolutely right, the paper wasn't covering sampling bias, and I completely agree that it's a critical issue that can undermine a dataset's quality.
My reasoning is that while it's a vital problem, it seems to stem more from the study design or data sourcing strategy, whereas the paper's focus is on finding flaws within the data that has already been collected.
To make this distinction clear and directly address your feedback, I've updated the paper by adding a new point to the "A Note on Excluded Concepts" section. It's called "Representational Validity/Sampling Bias" and explicitly acknowledges the issue while explaining the rationale for keeping it out of the main taxonomy.
This was a fantastic suggestion, and I think it makes the paper's scope much clearer and more robust. Thanks again for the constructive feedback!


Also we present the @tbl:cat-detection-to-problems which provides a high-level map for practitioners, showing which categories of methods are effective at identifying specific data quality problems.

```{list-table} Linking Data Problems to Detection Method Categories

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there definitely needs to be a section explaining these choices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure, thanks for the idea!
I'll work on it right now.
If not a problem, please give me some couple days to send an update with this new section for your appreciation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @anmolsinghalIMC, finally I was able to send the new section explaining each category.
I appreciate your feedback. Let me know if you can see ways to improve it.
Thank you.

@edinhodiluviano
Copy link
Contributor Author

Thanks for the feedback @anmolsinghalIMC
I just sent a new version with the first change (commit here if you want to see just the change).

I'll work on a new sub section of Categories of Detection Methods to provide more details on each of the categories. As soon as ready I'll ping you here.

Also the CI/CD pipeline is raising this error which I'm not sure if it is something related to the paper (or something wrong I did).
A bit after saying 🏁 Checks completed, it says Job update failed. Should I worry now about it or can I wait for the retry when the next update is sent?

Thanks a lot for taking the time for this review. Greatly appreciated.

@anilsil
Copy link

anilsil commented Jun 29, 2025

Inviting reviewers: @[email protected] and @[email protected]

Hi, This is Anil, I will be finishing review today. Thank you for your patience.

Copy link

@anilsil anilsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paper offers a well-structured, comprehensive, and practical framework for understanding and addressing data quality issues. Auther is clear about problem statement and case-study to be presented. In that process this paper presents a clear taxonomy of 23 data quality problems and 22 detection methods, mapped into different categories, making it highly useful for practitioners. The writing is clear,about the proposed concept of “data testing” aligns well with software engineering best practices.

Please consider adding empirical validation/examples using real-world datasets,including diagrams or visual summarization is not bad idea especially for table-3 where various methods are commended.

@ameyxd
Copy link
Contributor

ameyxd commented Jul 2, 2025

@hongsupshin will serve as editor for this paper.

@edinhodiluviano
Copy link
Contributor Author

Hello @anilsil, thanks for your valuable feedback.
Took me some time, but I'm finally happy with a new section with some examples in a real dataset.
The code is simple enough... and really isn't that hard. The interesting parts are the concepts behind the simple code given and the reality of the datasets in the wild.
I'd be grateful if you could evaluate those changes and share any improvements that you can see.
Thanks

@hongsupshin
Copy link
Contributor

@edinhodiluviano Hi, my name is Hongsup Shin, and I am the editor of your paper. @anilsil @anmolsinghal It looks like the author has some questions about your comments, so if you can answer those soon, I would really appreciate it!

@anmolsinghalIMC
Copy link

I have completed the review from my side

@hongsupshin
Copy link
Contributor

@edinhodiluviano Hi, just a reminder that the Final Author Revision Deadline is 9/4/2025. This means you shouldn't be making any changes after this date. If you want to make changes, please do so before this date. Thank you!

@fwkoch fwkoch added approved This triggers Curvenote Submission action and removed draft This triggers Curvenote Preview actions labels Oct 14, 2025
@fwkoch fwkoch merged commit 8e57750 into scipy-conference:2025 Oct 14, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This triggers Curvenote Submission action assigned-editor paper This indicates that the PR in question is a paper

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants