-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(circleci-plugin): incremental data collection #7986
Conversation
Hi, @Nickcw6 . Thanks for your contribution. Do you think it is ready for review? |
Hey @klesh - yes it is |
Great work! I’ve noticed that you’re also collecting unfinished jobs. However, let’s revisit this based on some past discussions with other PPMCs and Committers. We concluded that perhaps we don’t need to collect them at all. Most jobs tend to finish relatively quickly compared to our data collection frequency. Plus, the cherry on top: we don’t analyze pending jobs at all! 😂 So, for simplicity’s sake, I propose that we ignore pending jobs during collection. Keep up the good work, and let’s streamline our process! |
…rom both before & after createdAfter of last collection
…FinalizableEntity to collect data based on previous collection
…n collecting unfinished job details
bca1883
to
93f1170
Compare
@klesh Thanks! I have pushed a small refactor commit. Re: unfinished jobs - not sure I necessarily agree in this case. I'm not super familiar with other CI platforms but jobs in CircleCI are often statistically significant & not always that quick; in our case e.g. we have some relatively long running build jobs, and some of our deployment matchers are configured to look for a specific To ensure we still collect these, whilst not collecting unfinished jobs, we'd need to:
100% agree we should be as aligned as possible but IMO this adds more complexity than the current approach of just collecting everything & recollecting any unfinished jobs... Unless I am missing something - open to suggestions :) |
If I understand your point correctly:
Could you clarify a couple of points for me?
|
@klesh CircleCI is functionally layered as such (where the item to the left contains items to the right): Pipelines -> Workflows -> Jobs -> Commands Workflows and jobs (and to a lesser extent pipelines) are what are statistically significant. Commands are typically small blocks of code that make up jobs - there is no way to collect these and little value in doing so anyway (not sure if these are maybe referred to as jobs in other CI tools which is contributing to a misunderstanding here?) For some context & a more concrete example: a pipeline is created upon merge to main. Within this pipeline we often have a generic workflow called As it is this If a workflow is unfinished, then at least some of the jobs it contains will also be unfinished. With the current approach we'd mark both as such and recollect both next time. If we don't collect the unfinished jobs, then we'd need a mechanism to recollect these jobs based on the workflow being unfinished, which in my mind is more complicated than the current implementation. In terms of your questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
What does this PR do?
Incremental data collection for the CircleCI plugin.
Haven't swapped the extractor or converter to be incremental as this is enough for now to make a substantial difference in pipeline duration, but something to consider.
Does this close any open issues?
N/A
Screenshots
Full data collection:
Incremental data collection ran afterwards:
Other Information
Had to make some adjustments to the existing
NewStatefulApiCollectorForFinalizableEntity
function, namely:createdAfter
) or valid (created aftercreatedAfter
), extended this to check each item in turn if required to prevent duplicate records.created_at
property - we have to usestarted_at
which may be null, therefore also added checks fortime.IsZero()
. We still collect the data if this is zero.There are some annoying design choices around the jobs collection worth noting, we use the
workflow/{id}/jobs
endpoint to collect all jobs for a workflow on a single endpoint - as mentioned above this lacks a definitivecreated_at
property, so we usestarted_at
which may be null. Couple of ways round this:job/{id}
endpoint.started_at
is null for a selection of statuses we care about (e.g.running
), then we give this a temp timestamp oftime.Now()
IsZero()
and collect these records anywayblocked
orunauthorized
jobs)created_at
of the workflow (created_date
on_tool_circleci_workflows
) and extend_tool_circleci_jobs
with aworkflow_created_date
columnI couldn't get 4) to work nicely in the
GetCreated
function, so have opted for 3), but open to other ideas/suggestions on how this could be implemented.