order contributors to resolve by unique email for improved efficiency and github API fault tolerance

https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L210-L246

This query at the start of contributor resolution returns the commit contributors (name, email, commit hash). 

There are a few, less impactful problems with this that cause it to make contributor resolution less efficient than it could be.

1. we are querying from the commits table, which is a misnomer and actually has one row per file per commit (known issue #3682 )
2. the query is sorting by commit hash, meaning that, because of how hashes are generated in git, a contributor with multiple commits to resolve will have multiple rows  in this file **spread theoretically evenly throughout the results of this query** 


This means we have to perform extra iterations later and rely on our per-iteration checks to catch things like a contributor that got resolved on the first iteration, and now has several other records that need linking to the same already-resolved contributor ID.


What I would ideally like to see is a query that gives us one record back per unique contributor email address (which means we would need to have a LIST of commit hashes).

this gives us several key benefits:
1. a single iteration of the contributor processing loop for each contributor (by unique email), meaning less time spent checking and re-checking the same email over and over
2. the contributor resolution process will have access to multiple commit hashes to check with the github API, better guarding against one off API errors that put records in the unresolved table (i.e. enabling logic like "okay the first commit didnt give us the github username, lest try 2 or 3 more commits before falling back to the less reliable search API for resolution")


in other words, this dramatically reduces overlapping processing. For example, while the current process already links all commits matching the resolved email to the contributor, this new process would prevent a scenario where the first commit from a contributor is seen, the contributor is resolved and linked (including all their commits), yet that contributor is not removed from the iteration loop/still exists in the initial query (above) because they were so heavily duplicated. This leads to future iterations of contributor resolution running for records that **already have their `cmt_ght_author_id` set to a value**, straight up wasting time and processing cycles  

ref: https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L160


This "list of commits" behavior can be achieved with a postgres string aggregate function, to comma-separate the hashes in the single existing column. 8Knot already uses this behavior in other parts of its code and it is something Cali is familiar with.

	SELECT DISTINCT
	commits.cmt_author_name AS NAME,
	commits.cmt_commit_hash AS hash,
	commits.cmt_author_raw_email AS email_raw,
	'not_unresolved' as resolution_status
	FROM
	commits
	WHERE
	commits.repo_id = :repo_id
	AND (:since_date is NULL OR commits.data_collection_date > :since_date)
	AND (NOT EXISTS ( SELECT contributors.cntrb_canonical FROM contributors WHERE contributors.cntrb_canonical = commits.cmt_author_raw_email )
	or NOT EXISTS ( SELECT contributors_aliases.alias_email from contributors_aliases where contributors_aliases.alias_email = commits.cmt_author_raw_email)
	AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name ))
	GROUP BY
	commits.cmt_author_name,
	commits.cmt_commit_hash,
	commits.cmt_author_raw_email
	UNION
	SELECT DISTINCT
	commits.cmt_author_name AS NAME,--commits.cmt_id AS id,
	commits.cmt_commit_hash AS hash,
	commits.cmt_author_raw_email AS email_raw,
	'unresolved' as resolution_status
	FROM
	commits
	WHERE
	commits.repo_id = :repo_id
	AND (:since_date is NULL OR commits.data_collection_date > :since_date)
	AND EXISTS ( SELECT unresolved_commit_emails.email FROM unresolved_commit_emails WHERE unresolved_commit_emails.email = commits.cmt_author_raw_email )
	AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name )
	GROUP BY
	commits.cmt_author_name,
	commits.cmt_commit_hash,
	commits.cmt_author_raw_email
	ORDER BY
	hash
	""").bindparams(repo_id=repo_id,since_date=last_collected_date)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

order contributors to resolve by unique email for improved efficiency and github API fault tolerance #3780

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

order contributors to resolve by unique email for improved efficiency and github API fault tolerance #3780

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions