Skip to content

fix: Redact SSO PII before deletion#38425

Open
ktyagiapphelix2u wants to merge 12 commits intoopenedx:masterfrom
ktyagiapphelix2u:ktyagi/SSOPII
Open

fix: Redact SSO PII before deletion#38425
ktyagiapphelix2u wants to merge 12 commits intoopenedx:masterfrom
ktyagiapphelix2u:ktyagi/SSOPII

Conversation

@ktyagiapphelix2u
Copy link
Copy Markdown
Contributor

@ktyagiapphelix2u ktyagiapphelix2u commented Apr 23, 2026

Description

Implements automatic PII redaction for UserSocialAuth records before deletion to prevent personally identifiable information from persisting after records are removed.

Jira Ticket

https://2u-internal.atlassian.net/browse/BOMS-514

@ktyagiapphelix2u ktyagiapphelix2u marked this pull request as ready for review April 23, 2026 11:29
@ktyagiapphelix2u ktyagiapphelix2u requested a review from a team as a code owner April 23, 2026 11:29
Comment thread openedx/core/djangoapps/user_api/accounts/tests/test_utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/tests/test_utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/signals.py Outdated
# Unlink LMS social auth accounts
UserSocialAuth.objects.filter(user_id=user.id).delete()
# Redact and unlink LMS social auth accounts
for social_auth in UserSocialAuth.objects.filter(user_id=user.id):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In the original PR where we fixed redact before delete, didn't we already establish use of django calls that avoided looping?
  2. Can we avoid the separate method and just add some hard-coded values? What are all the fields we need to update, and do you have example values that have PII? Just curious. From current redact_user_social_auth_pii it looks like there are multiple values you are clearing out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed, loop is gone. Now using a single bulk update() + delete() entirely in SQL via Django ORM.

Two PII fields cleared: uid → redacted_@retired.invalid (per-row to avoid the unique_together constraint), extra_data → {}.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktyagiapphelix2u: As a reminder, we discussed removing the function redact_user_social_auth_pii and moving it into the delete signal receiver that was calling it. I think that will make it more clear that this is just about redacting before deleting.

"Failed to redact PII for UserSocialAuth before deletion: user_id=%s, provider=%s, error=%s",
instance.user_id,
instance.provider,
str(e)
Copy link
Copy Markdown
Contributor

@robrap robrap May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to include the exception in the message, because logger.exception will already include the exception details.

@ktyagiapphelix2u
Copy link
Copy Markdown
Contributor Author

@robrap We’re dealing with multiple ways SSO records can get deleted through Django admin, user actions like unlinking accounts, bulk retirement scripts. The challenge is that we don’t control all of these paths, so we can’t reliably add PII redaction directly into each one.

Instead, we’ve set up a two-layer approach.

The first layer is a Django signal that runs automatically right before any SSO record is deleted. This acts as a safety net. No matter how the deletion is triggered whether it’s from admin, user action, the signal ensures sensitive fields like the UID and extra data are redacted. It’s centralized, consistent, so it won’t cause issues if it runs more than once.

The second layer is used only in cases we fully control, like user retirement flows. There, we proactively run a bulk redaction step before deleting records. This is much faster because it uses efficient database operations. When the delete happens afterward, the signal still fires, but it detects that the data is already redacted and simply exits without doing extra work.

Together, these two layers cover both safety and performance. The signal guarantees we never miss redaction, even in code we don’t control, while the explicit bulk step keeps large-scale operations efficient.


try:
update_fields = {}
redacted_uid = f'redacted_{instance.pk}@retired.invalid'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should use something more generic, like redact-before-delete@safe.com (or whatever we've come up with). This is not a retired email in all cases.
  2. Since we are using this email as a flag between the bulk retirement and here, we should be using a constant that both pieces of code make use of, to ensure they stay in sync.
  3. I'd comment the short-circuit code below to mention why we are doing it. Something like:
These fields may have already been redacted as part of a bulk retirement,
so we skip the update if it is already done to reduce query count.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for future] I wonder if we should have generic code for models with annotated PII that automatically introduce this redaction into the pre_delete signal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants