Skip to content

order contributors to resolve by unique email for improved efficiency and github API fault tolerance #3780

@MoralCode

Description

@MoralCode

SELECT DISTINCT
commits.cmt_author_name AS NAME,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'not_unresolved' as resolution_status
FROM
commits
WHERE
commits.repo_id = :repo_id
AND (:since_date is NULL OR commits.data_collection_date > :since_date)
AND (NOT EXISTS ( SELECT contributors.cntrb_canonical FROM contributors WHERE contributors.cntrb_canonical = commits.cmt_author_raw_email )
or NOT EXISTS ( SELECT contributors_aliases.alias_email from contributors_aliases where contributors_aliases.alias_email = commits.cmt_author_raw_email)
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name ))
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
UNION
SELECT DISTINCT
commits.cmt_author_name AS NAME,--commits.cmt_id AS id,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'unresolved' as resolution_status
FROM
commits
WHERE
commits.repo_id = :repo_id
AND (:since_date is NULL OR commits.data_collection_date > :since_date)
AND EXISTS ( SELECT unresolved_commit_emails.email FROM unresolved_commit_emails WHERE unresolved_commit_emails.email = commits.cmt_author_raw_email )
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name )
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
ORDER BY
hash
""").bindparams(repo_id=repo_id,since_date=last_collected_date)

This query at the start of contributor resolution returns the commit contributors (name, email, commit hash).

There are a few, less impactful problems with this that cause it to make contributor resolution less efficient than it could be.

  1. we are querying from the commits table, which is a misnomer and actually has one row per file per commit (known issue commits table is actually representing commit files #3682 )
  2. the query is sorting by commit hash, meaning that, because of how hashes are generated in git, a contributor with multiple commits to resolve will have multiple rows in this file spread theoretically evenly throughout the results of this query

This means we have to perform extra iterations later and rely on our per-iteration checks to catch things like a contributor that got resolved on the first iteration, and now has several other records that need linking to the same already-resolved contributor ID.

What I would ideally like to see is a query that gives us one record back per unique contributor email address (which means we would need to have a LIST of commit hashes).

this gives us several key benefits:

  1. a single iteration of the contributor processing loop for each contributor (by unique email), meaning less time spent checking and re-checking the same email over and over
  2. the contributor resolution process will have access to multiple commit hashes to check with the github API, better guarding against one off API errors that put records in the unresolved table (i.e. enabling logic like "okay the first commit didnt give us the github username, lest try 2 or 3 more commits before falling back to the less reliable search API for resolution")

in other words, this dramatically reduces overlapping processing. For example, while the current process already links all commits matching the resolved email to the contributor, this new process would prevent a scenario where the first commit from a contributor is seen, the contributor is resolved and linked (including all their commits), yet that contributor is not removed from the iteration loop/still exists in the initial query (above) because they were so heavily duplicated. This leads to future iterations of contributor resolution running for records that already have their cmt_ght_author_id set to a value, straight up wasting time and processing cycles

ref:

def link_commits_to_contributor(logger, facade_helper, contributorQueue):

This "list of commits" behavior can be achieved with a postgres string aggregate function, to comma-separate the hashes in the single existing column. 8Knot already uses this behavior in other parts of its code and it is something Cali is familiar with.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions