[BUILD-1003] Rollback from triggerRemoteJob plugin to GWT plugin Created: 26/Jan/23  Updated: 03/Feb/23

Status: Open
Project: Build
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Neutral
Reporter: Maxime Michel Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2023-01-26 at 13.16.57.png     PNG File image-2023-02-02-18-24-28-054.png     Text File log.txt    
Issue Links:
relation
is related to BUILD-916 Trigger a Norsu build (SRE Jenkins) a... Closed
Template:
Acceptance criteria:
Empty
Date of First Response:

 Description   

As discussed in today's Testing&QA meeting, we uncovered two issues with the triggerRemotePlugin that don't exist with the Generic Webhook Trigger plugin:

  1. when a job on the SRE Jenkins instance is already running, then the plugin offers no option to trigger a job and add it to the queue. Instead, it just quits silently. See log.txt. One possible solution could be to wait for the upstream job to finish, but that's not acceptable because it will slow on premises CI, which is already really long.
  2. when a job on the SRE Jenkins instance is triggered by core CI, we would like to add a text description to state this fact. Currently it's really hard to understand which build comes from the other side. (See screenshot) One needs to match timestamps and triggering user.


 Comments   
Comment by Rubén Martín Romero [ 26/Jan/23 ]

Regarding the first point, it is weird to me, since the behavior should be equivalent to the build trigger, and in that case AFAIK the job is queued by default. Anycase I have had a look to the triggerRemotePlugin options, I have seen this one that could help us on this:

/*
Wait to trigger remote builds until no other builds are running.

Prevent Remote Build QueueWait to trigger remote builds until no other builds are running.
mandatory: no
default: false
*/

//Example:
triggerRemoteJob blockBuildUntilComplete: false, job: '<remote_job>', preventRemoteBuildQueue: true, useCrumbCache: true, useJobInfoCache: true

OTOH, can you share with me some job in Core Jenkins that is using this trigger remote plugin so I can do some test?

With respect to the second point, we can easily add (and we should do it) a build description to the triggered job (the one in SRE Jenkins), which check if the trigger user is sre, and if so put some message like "Remote triggered" and even the timestamp or any other additional info that you consider. This is something that we already have in quite a few jobs, and you can use these examples as reference:

Comment by Maxime Michel [ 30/Jan/23 ]

Regarding the first point, it is weird to me, since the behavior should be equivalent to the build trigger, and in that case AFAIK the job is queued by default.

The difference is that the GWT pings a proxy pipeline in which we call `build`, which is Jenkins standard behavior, hence it will add it to the queue. With triggerRemoteJob, though, there is logic in the plugin that decides to not trigger anything unless satisfying conditions are met.

Anycase I have had a look to the triggerRemotePlugin options, I have seen this one that could help us on this:

I'm not sure I understand the option above, so some experimenting could be useful indeed (see below), however, I'm afraid that if it waits for the upstream build to finish, then that's going to make core CI even longer, and that's not acceptable.

OTOH, can you share with me some job in Core Jenkins that is using this trigger remote plugin so I can do some test?

You can perform the change with one of the registered relationships we have registered (boms - cloud-webapp) here: https://git.magnolia-cms.com/projects/BUILD/repos/pipeline-templates/browse/vars/magnoliaDefaultPipeline.groovy#237

Then trigger a build here: https://jenkins.magnolia-cms.com/job/build/job/boms/job/master/

With respect to the second point, we can easily add (and we should do it) a build description to the triggered job (the one in SRE Jenkins), which check if the trigger user is sre, and if so put some message like "Remote triggered" and even the timestamp or any other additional info that you consider.

This is better than the current situation, however, ideally what we would want for optimal debugging purposes would be to know which upstream is the exact culprit (which job & which build number). This is not possible with the only data available being the username of the bot that triggered the job & the date? Here are the couples that are currently registered:

  • addon/addons-packs/release/6.2 -> nightly/magnolia-nightly/master
  • build/boms/master -> cloud/magnolia-cloud-webapp/master
  • platform/ce/master -> cloud/norsu/main
  • platform/ce/master -> cloud/magnolia-cloud-webapp/master

 

Comment by Rubén Martín Romero [ 02/Feb/23 ]

Thank you for the update mmichel and also for the job provided to test the remote trigger. Very useful!

I have done some tests now in the afternoon, when the activity in SRE Jenkins is much calmer, and the first thing that I have verified is that the jobs triggered from Core Jenkins are correctly queued in the SRE Jenkins, which is actually the standard behavior of Jenkins regardless of the source that is triggering the job. So although you can see this message in https://jenkins.magnolia-cms.com/job/build/job/boms/job/master/156/console:

The remote job is blocked. Build #2,563 is already in progress (ETA: 26 min). 

 I have verified that the job is correctly queued in the SRE jenkins side:

Therefore, we don't even need to consider adding any additional option to the remote trigger call performed from Core Jenkins, since this statement doesn't match the actual behavior:

  1. when a job on the SRE Jenkins instance is already running, then the plugin offers no option to trigger a job and add it to the queue. Instead, it just quits silently. See log.txt. One possible solution could be to wait for the upstream job to finish, but that's not acceptable because it will slow on premises CI, which is already really long."

Regarding the second point and your last update:

we would want for optimal debugging purposes would be to know which upstream is the exact culprit (which job & which build number)

How were you previously getting all that information using the GWT? Maybe adding that info. as variables to the request, and referencing them in the cause set up in that target webhook job (the intermediate one created in SRE Jenkins)?

Even being this the case, I still think that is not justified to rollback to the previous solution based on GWT just because of this, since adding a build description to indicate when the build is triggered from remote (Core Jenkins) IMHO should be enough for us... WDYT mmichel about giving a chance to this option and see how it works?

I am telling you this because we already had an internal discussion (in SRE) to uninstall the GWT from our Jenkins, since right now we don't have any job using that feature, and this is something that we still have on the table :S

Comment by Maxime Michel [ 03/Feb/23 ]

As far as the build actually triggering jobs and queuing them, I don't think the 'manual testing on a quiet afternoon' is an actual representation of what the plugin is doing in production. After all, we have all seen it not happening in front of us during the meeting.

since adding a build description to indicate when the build is triggered from remote (Core Jenkins) IMHO should be enough for us...

I think this was brought up in a Foundation-DevX meeting 10 days ago, and yesterday again during the #testing-qa meeting as a pain point for everybody, that it's not clear enough which pipelines trigger which in general. In the particular case of cross-Jenkins builds, as I have explained above, multiple source jobs from core Jenkins may trigger target jobs on SRE Jenkins, hence a generic description doesn't cut it. If a developer is trying to troubleshoot why a build is failing and the source is upstream core Jenkins, with your solution he wouldn't know which source job it is, even less which actually build number. Knowing the build number could tie it to the commit that triggered all of it.

PS: it's been a week now… And I don't find the explicit request in the Foundation-DevX notes anymore. So not sure anymore whether DevX requested this to me, if we agreed to do that during the meeting, or it was only discussed with mgeljic. Let me unassign myself anyway until this is requested again because I have better things to do than argue about pluginA vs pluginB. To be honest you should keep the complexity of your domain to yourself, as a non-SRE I don't need to know that you want to install or uninstall one additional plugin.

Comment by Rubén Martín Romero [ 03/Feb/23 ]

Hi mmichel , thanks for your update! Regarding this:

we have all seen it not happening in front of us during the meeting.

Please, can you point me to the case in which the job was not correctly queued? That is an standard behavior of Jenkins that should always work regardless of the load that Jenkins is dealing, so if this is not working as expected in some cases, we would need to review that issue in order to fix it. Anycase we will try to be also attentive to this from our side. BTW and just to clarify, Jenkins only queue one build per job, and therefore if there is a build already queued, Jenkins will never add another build to the queue for that job. Could this match the case that you saw this morning?

Generated at Sun Feb 11 23:47:14 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.