Skip to content

Conversation

@adrianviquez
Copy link
Contributor

@adrianviquez adrianviquez commented Nov 22, 2025

This is the worker task for the TA data export.

This task fetches test run data for a provided integration. This will upload a tarfile to a passed bucket.

This is the 2nd part for this #563 endpoint. This should be merged after 563 to ease PR review.

https://linear.app/getsentry/issue/CCMRG-1901/ta-build-codecov-worker-task-for-data-export


Note

Adds a worker task to stream/export test analytics runs per repo to JSON and upload as chunked tar.gz files to GCS, with a shared archiver utility and tests.

  • Worker:
    • New task tasks/export_test_analytics_data.py: streams Testrun data per repo (compact JSON arrays) and uploads via _Archiver to GCS (tar.gz chunks) with success/fail tracking and Sentry reporting.
    • Registers task in tasks/__init__.py.
  • Shared:
    • New utility shared/storage/data_exporter.py _Archiver: manages chunked tar.gz creation and GCS uploads; supports _add_file and upload_json.
  • Tests:
    • Add unit tests covering serialization, success/failure/mixed repository processing, and multiple test runs (tasks/tests/unit/test_export_test_analytics_data.py).
  • Dependencies:
    • Add google-cloud-storage to libs/shared/pyproject.toml and root pyproject.toml.

Written by Cursor Bugbot for commit e515b84. This will update automatically on new commits. Configure here.

@sentry
Copy link

sentry bot commented Nov 22, 2025

Codecov Report

❌ Patch coverage is 60.76923% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.79%. Comparing base (240de92) to head (e515b84).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
libs/shared/shared/storage/data_exporter.py 0.00% 51 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #582      +/-   ##
==========================================
- Coverage   93.87%   93.79%   -0.08%     
==========================================
  Files        1284     1286       +2     
  Lines       46528    46791     +263     
  Branches     1522     1525       +3     
==========================================
+ Hits        43676    43887     +211     
- Misses       2542     2594      +52     
  Partials      310      310              
Flag Coverage Δ
apiunit 96.57% <ø> (+0.02%) ⬆️
sharedintegration 38.64% <0.00%> (-0.15%) ⬇️
sharedunit 88.44% <0.00%> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Nov 22, 2025

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
libs/shared/shared/storage/data_exporter.py 0.00% 51 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@codecov-eu
Copy link

codecov-eu bot commented Nov 22, 2025

Codecov Report

❌ Patch coverage is 60.76923% with 51 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
libs/shared/shared/storage/data_exporter.py 0.00% 51 Missing ⚠️

📢 Thoughts on this report? Let us know!

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 22, 2025

CodSpeed Performance Report

Merging #582 will not alter performance

Comparing worker-task-for-test-analytics-export (e515b84) with main (7383d2c)1

Summary

✅ 9 untouched

Footnotes

  1. No successful run was found on main (a2f8b4d) during the generation of this report, so 7383d2c was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Comment on lines +100 to +101
gcs_client = storage.Client(project=gcp_project_id)
bucket = gcs_client.bucket(destination_bucket)

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for now

.values(*fields)
)
test_runs_data = [
serialize_test_run(tr) for tr in list(test_runs_qs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For repos with a large number of test runs, this seems like it could take up a lot of memory. Thoughts on possibly batching the processing of test runs if there are lot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added batching and writing to a temp file to help w/ memory during runtime

Comment on lines 174 to 177
with open(json_file_path, "rb") as f:
archiver._add_file(blob_name, f)

pathlib.Path(json_file_path).unlink()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should have a try here to handle still running .unlink even if we run into a problem uploading the file

Copy link
Contributor Author

@adrianviquez adrianviquez Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped the add_file call on a try/finally so that file cleanup happens. Also added missing_ok=True in case the temp file isnt created for some reason and we prevent an exception from it.

"integration_name": integration_name,
"repositories_succeeded": repositories_succeeded,
"repositories_failed": repositories_failed,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent return structure missing successful field

The error return paths include a "successful": False field, but the success return path at lines 204-209 omits the successful key entirely. This creates an inconsistent API contract where callers cannot reliably check result["successful"] - it would raise a KeyError on success. Other similar tasks in the codebase consistently include "successful": True in their success returns.

Additional Locations (2)

Fix in Cursor Fix in Web

Comment on lines +35 to +39
def _upload(self) -> None:
self.archive.close()
blob_name = str(self.prefix / f"{self.counter}.tar.gz")
blob = self.bucket.blob(blob_name)
blob.upload_from_filename(self.current_archive, content_type="application/gzip")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing error handling for _upload() in _Archiver.__exit__() can crash the task.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

The _Archiver.__exit__() method calls self._upload() at line 70 without any error handling. If _upload() (defined at lines 35-39) fails during the final archive upload (e.g., due to network issues, GCS permissions, or bucket not found), the exception will propagate uncaught. This will crash the task, leading to unexpected server failures and potential data loss for successfully processed repositories.

💡 Suggested Fix

Wrap the self._upload() call within the _Archiver.__exit__() method in a try-except block to catch and handle potential exceptions during GCS upload, preventing task crashes.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: libs/shared/shared/storage/data_exporter.py#L35-L39

Potential issue: The `_Archiver.__exit__()` method calls `self._upload()` at line 70
without any error handling. If `_upload()` (defined at lines 35-39) fails during the
final archive upload (e.g., due to network issues, GCS permissions, or bucket not
found), the exception will propagate uncaught. This will crash the task, leading to
unexpected server failures and potential data loss for successfully processed
repositories.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5661483

self.entries += 1
if self.current_archive.stat().st_size >= ARCHIVE_SIZE_THRESHOLD:
self._upload()
self._next_chunk()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Failed upload leaves archiver in broken state

When the archive size exceeds ARCHIVE_SIZE_THRESHOLD, _upload() is called which closes the archive before uploading. If the upload fails (e.g., network error during upload_from_filename), the exception propagates without calling _next_chunk(), leaving self.archive pointing to a closed tarfile. In the task code, this exception is caught and processing continues, but subsequent _add_file() calls will fail on self.archive.addfile() since the archive is closed. This causes cascading failures for all remaining repositories after a single upload failure.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants