Feat/event based instance replacement #434

mello7tre · 2020-08-10T15:13:19Z

Issue Type

Bugfix Pull Request

Summary

Problem arise if ASG have a LifeCycle Hook on terminating instances and
the lifecycle action take more time than the Autospotting schedule.

When terminating instances LifeCycle Hook (LH) put the instance in the
lifecycle state Terminating:Wait (and later Terminating:Proceed).
But during that time, until LH complete the instance status is still running.

So AutoSpotting when counting running instances will consider that
instance too.

This will cause multiples problems:
* spot instance can be terminated even if needed
* instance can be terminated even if already in terminating state
* spot instance are launched even if not needed

The fix is very simple:
in "scanInstances" func do not execute "a.instances.add" if the instance
lifecycle begin with Terminating.

Code contribution checklist

…alk (LeanerCloud#366) * Support custom role for cfn-init in Beanstalk UserData * Wrappers & refactoring * Docs * Docs fixes * More docs fixes * Docs fixes * Yet more docs fixes * Lambda & Kubernetes config * AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers * Update param description * Kubernetes config fix * Rename Beanstalk variable + Move test data * Add missing permission for AutoSpottingBeanstalk role

…te so it can also run from a StackSet

As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG

LeanerCloud#354 (comment)

Created Queue in CF template Begin sending message

Use it only if AttachInstances method fails for wrong ASG max size.

progress

- Ported to StackSet deploy mode. - Fix to "ScalingActivityInProgress" if launched ondemand instance is terminated before going "inservice": Created method "waitForInstanceStatus", that wait, until a max retry (5), that an instance belonging to an ASG is in the desired status (InService). (Sleep time is 5*retry) - Fix to multiple problems caused by lambda concurrency changing ASG MaxSize: Created another Lambda (LambdaManageASG) in the same region of the main one; code python3.7, inline in template. Lambda concurrency is set to one. Is function is to change ASG MaxSize by the ammount specified. Used a "try-catch approach": if AttachInstances return error code "ValidationError" and string "update the AutoScalingGroup sizes" is present in error message, means that we need to increase ASG MaxSize. So we execute method changeAutoScalingMaxSize that invoke LambdaManageASG. If invoke return error "ErrCodeTooManyRequestsException", means that multiple Main lambdas are executing LambdaManageASG, we sleep for a random interval part of seconds and retry. Method attachSpotInstance now return an int that represent the ammount of change to the ASG MaxSize, so that in swapWithGroupMember we can defer the call to changeAutoScalingMaxSize to decrease the ASG MaxSize of the previously increased ammount. We use "waitForInstanceStatus" in attachSpotInstance before returning to be sure that spot instance has been attached and are InService before beginning to terminate ondemand one.

- Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time.

- fix/improvement on Suspending/Resuming Termination process: suspend/resume is now handled by LambdaManageASG too. When i successfully suspend termination, i add a tag "autospotting_suspend_process_by" with value equals to the instanceId event that triggered the main lambda. When i try to resume termination first check if the value of the tag above equals the current one. If not, means that another main lambda, different from the one who suspended it, is trying to resume the process; in that case i do not resume it. Considered that LambdaManageASG has a concurrent execution limit set to 1 we can have the following cases: 1) *) A main lambda suspend the process. *) No other lambda suspend it. *) Main lambda resume the process. *) Another main lambda suspend the process *) ....and so on 2) *) A main lambda suspend the process. *) Before the first one resume the process another one suspend it (so replacing the tag value) *) First lambda do not resume the process (tag value differ) *) Second lambda resume the process In the wrost scenario of a Lambda dying before resuming the process, it will be resumed after another one will suspend it.

code cosmetic fix

- for rand seed instead of using time.Now().UnixNano() we build a seed based on the instanceId that triggered the event. The seed is build this way: for every char of instanceId (starting from third char) we get his rune "representation" and sum it to previous one. We use it as a temporary seed and get a random number between 0 and 9. The final seed is the concatenation of the generated random numbers. This way we have a seed number (int64) that depend from the instanceId and of the same lenght.

- no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember.

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot too. We use it as a trick to avoid rare cases of concurrency between scheduled and event lambdas. As lambda that handle "suspendResumeProcess" have a concurrency limit of one, scheduled and event lambdas, if concurrent, will be "time shifted" by a random value. This way they will not execute attachSpotInstance at the same time. In case of scheduled lambda we add the "S" char to the instanceId used for the randSeed to avoid that it resume process suspended by event lambda. - Fix in swapWithGroupMember: defer asg.suspendResumeProcess for resume so that it will be executed even if function swapWithGroupMember return error.

…313298]

[25ac31f]

need to invert test for needReplaceOnDemandInstances and onDemandInstance == nil currently if onDemandInstance == nil methods return and execution stop. This way the check to terminate a spotInstace if their number is more than required is never done. Assume that ASG scale down and terminate the onDemand instances that autospotting is not terminating [autospotting_min_on_demand_number]. On the next runs onDemandInstance will be nil and spot instances in excess will not be terminated.

…based-instance-replacement

asg.terminateRandomSpotInstanceIfHavingEnough There is no need to call asg.terminateRandomSpotInstanceIfHavingEnough in terminateUnneededSpotInstance. We already execute spotInstance.terminate, on the unattached spot instance.

As terminateInstanceInAutoScalingGroup is used also for terminating spot instances, changed comment and logging message.

Problem arise if ASG have a LifeCycle Hook on terminating instances and the lifecycle action take more time than the Autospotting schedule. When terminating instances LifeCycle Hook (LH) put the instance in the lifecycle state Terminating:Wait (and later Terminating:Proceed). But during that time, until LH complete the instance status is still running. So AutoSpotting when counting running instances will consider that instance too. This will cause multiples problems: * spot instance can be terminated even if needed * instance can be terminated even if already in terminating state * spot instance are launched even if not needed The fix is very simple: in "scanInstances" func do not execute "a.instances.add" if the instance lifecycle begin with Terminating.

cristim

Always love a simple fix for a tricky edge case.

* Implement event-based instance replacement Context: This change attempts to replace on-demand instances with spot as soon as they are launched (triggered by the on-demand instance entering the pending state), in order to avoid runnin expensive on-demand capacity in parallel with the newly launched spot instances. This should cover new groups out of the box, as well as scaling out or replacement of terminated spot instances. It has a few nice side effects concerning cost, since on-demand instances are likely to run for just a few seconds, as well as potentially tackling some known issues with ElasticBeanstalk(#343, #344) and deletion of CloudFormation stacks(#332). It may also trick the ASG into running the start lifecycle hooks(#296), and will immediately replace any scaled out instances, partially addressing #284. The current/legacy replacement logic will still kick in when enabling AutoSpotting for the first time on an existing group, as well as when spot instances fail to be launched so the group must failover to on-demand capacity in order to avoid downtime. Once the legacy replacement logic is done on the current group, the group is tagegd with a configuration tag that enables the event logic. This tag is also set for all groups newer than an hour so their instances are automatically replaced by the event-based replacement logic. Current status - changed regional CloudFormation stack to intercept EC2 instances as soon as entering the 'pending' and 'running' state' - implement handling code for these events, that launches a replacement spot instance for each on-demand instance currently in pending state and performs the usual instance swapping logic when a spot instance enters the running state, without waiting for it to exit its grace period. - largely kept the initial instance replacement logic as it is (waiting for the grace period) to avoid introducing regressions, and also in case the events are not properly propagated like in case of misconfigurations. - it compiles and was executed against against real infrastructure and is working as expected for a couple of groups I tested it against. - still missing test coverage for the new code and some existing tests were crashing and hat to be disabled for now - implemented support for handling the AutoScaling lifecycle hooks when replacing new on-demand instances in event-based execution mode - added support for running it locally against custom JSON event data in CloudFront Event format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type ToDo: - test against more infrastructure - iron out any crashes - write automated tests for the new code and re-enable older test cases so that coverage doesn't drop - port regional CloudFormation stack changes to the main template so it can also run from a StackSet - update the documentation to mention this new behavior * Further improvements Only handle the "running" events of new instances because the instance information wasn't filled while the instance was in pending state. Also run spot attachment in parallel with detachment of the on-demand instance in order to avoid race conditions with the group's process that enforces the desired capacity like when doing multiple instances in parallel for new groups or multiple instance launches. * More improvements - Suspend termination processes on ASG while swapping instances - Handle launch lifecycle hooks of unattached spot instances by intercepting the API call failures from CloudTrail * Fix CloudWatch event rule * Allow flavor to be customised (#359) * Fix typo (#361) * Increase automated testing coverage - Refactor the cron execution mode to support determining the action and running the action as independent steps, which makes the code easier to test. - switch to Go 1.13 - Suspend the AutoScaling termination process for 5 minutes - Reset the hourlySavings value between tests - Improve mocking - Print log statements during tests when the AUTOSPOTTING_DEBUG variable is set to "true" when running tests in verbose mode - change isProtectedFromTermination to return true in case of API failures. * Suspend termination processes for 5 minutes * Feat/event based instance replacement (#371) * Allow flavor to be customised (#359) * fix typo (#360) * Fix typo (#361) * UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366) * Support custom role for cfn-init in Beanstalk UserData * Wrappers & refactoring * Docs * Docs fixes * More docs fixes * Docs fixes * Yet more docs fixes * Lambda & Kubernetes config * AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers * Update param description * Kubernetes config fix * Rename Beanstalk variable + Move test data * Add missing permission for AutoSpottingBeanstalk role * Bring Go to 1.13 (#367) * merge fixes * Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet * Merge fix * Fix to template * Multiple fixes to handle lambda concurrency As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG * Removed commented code * Begin working on: #354 (comment) * Progress #354 Created Queue in CF template Begin sending message * Progress #354 * Progress #354 * Progress #354 * Progress #354 * Progress #354 * * Use inline python lambda to increase ASG Use it only if AttachInstances method fails for wrong ASG max size. * * Use inline python lambda to increase ASG progress * * Use inline python lambda to increase ASG progress * * Improvements on event based instance replacement - Ported to StackSet deploy mode. - Fix to "ScalingActivityInProgress" if launched ondemand instance is terminated before going "inservice": Created method "waitForInstanceStatus", that wait, until a max retry (5), that an instance belonging to an ASG is in the desired status (InService). (Sleep time is 5*retry) - Fix to multiple problems caused by lambda concurrency changing ASG MaxSize: Created another Lambda (LambdaManageASG) in the same region of the main one; code python3.7, inline in template. Lambda concurrency is set to one. Is function is to change ASG MaxSize by the ammount specified. Used a "try-catch approach": if AttachInstances return error code "ValidationError" and string "update the AutoScalingGroup sizes" is present in error message, means that we need to increase ASG MaxSize. So we execute method changeAutoScalingMaxSize that invoke LambdaManageASG. If invoke return error "ErrCodeTooManyRequestsException", means that multiple Main lambdas are executing LambdaManageASG, we sleep for a random interval part of seconds and retry. Method attachSpotInstance now return an int that represent the ammount of change to the ASG MaxSize, so that in swapWithGroupMember we can defer the call to changeAutoScalingMaxSize to decrease the ASG MaxSize of the previously increased ammount. We use "waitForInstanceStatus" in attachSpotInstance before returning to be sure that spot instance has been attached and are InService before beginning to terminate ondemand one. * * Improvements on event based instance replacement - Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time. * * Improvements on event based instance replacement - fix/improvement on Suspending/Resuming Termination process: suspend/resume is now handled by LambdaManageASG too. When i successfully suspend termination, i add a tag "autospotting_suspend_process_by" with value equals to the instanceId event that triggered the main lambda. When i try to resume termination first check if the value of the tag above equals the current one. If not, means that another main lambda, different from the one who suspended it, is trying to resume the process; in that case i do not resume it. Considered that LambdaManageASG has a concurrent execution limit set to 1 we can have the following cases: 1) *) A main lambda suspend the process. *) No other lambda suspend it. *) Main lambda resume the process. *) Another main lambda suspend the process *) ....and so on 2) *) A main lambda suspend the process. *) Before the first one resume the process another one suspend it (so replacing the tag value) *) First lambda do not resume the process (tag value differ) *) Second lambda resume the process In the wrost scenario of a Lambda dying before resuming the process, it will be resumed after another one will suspend it. * * Improvements on event based instance replacement code cosmetic fix * * Improvements on event based instance replacement - for rand seed instead of using time.Now().UnixNano() we build a seed based on the instanceId that triggered the event. The seed is build this way: for every char of instanceId (starting from third char) we get his rune "representation" and sum it to previous one. We use it as a temporary seed and get a random number between 0 and 9. The final seed is the concatenation of the generated random numbers. This way we have a seed number (int64) that depend from the instanceId and of the same lenght. * * Improvements on event based instance replacement - no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember. * * Improvements on event based instance replacement - Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot too. We use it as a trick to avoid rare cases of concurrency between scheduled and event lambdas. As lambda that handle "suspendResumeProcess" have a concurrency limit of one, scheduled and event lambdas, if concurrent, will be "time shifted" by a random value. This way they will not execute attachSpotInstance at the same time. In case of scheduled lambda we add the "S" char to the instanceId used for the randSeed to avoid that it resume process suspended by event lambda. - Fix in swapWithGroupMember: defer asg.suspendResumeProcess for resume so that it will be executed even if function swapWithGroupMember return error. * * gofmt cosmetic changes * Fix to rand.Intn parameter to include even 9 Co-authored-by: Chris Farrenden <cfarrend@gmail.com> Co-authored-by: Rubi <14269809+codenoid@users.noreply.github.com> Co-authored-by: Jawad <jawad.stouli@gmail.com> Co-authored-by: Gábor Lipták <gliptak@gmail.com> * Feat/event based instance replacement - Fix to terminateRandomSpotInstance logic in Schedule run + specify ondemandprice multiplier on a ASG group level (#425) * Allow flavor to be customised (#359) * fix typo (#360) * Fix typo (#361) * UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366) * Support custom role for cfn-init in Beanstalk UserData * Wrappers & refactoring * Docs * Docs fixes * More docs fixes * Docs fixes * Yet more docs fixes * Lambda & Kubernetes config * AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers * Update param description * Kubernetes config fix * Rename Beanstalk variable + Move test data * Add missing permission for AutoSpottingBeanstalk role * Bring Go to 1.13 (#367) * merge fixes * Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet * Merge fix * Fix to template * Multiple fixes to handle lambda concurrency As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG * Removed commented code * Begin working on: #354 (comment) * Progress #354 Created Queue in CF template Begin sending message * Progress #354 * Progress #354 * Progress #354 * Progress #354 * Progress #354 * * Use inline python lambda to increase ASG Use it only if AttachInstances method fails for wrong ASG max size. * * Use inline python lambda to increase ASG progress * * Use inline python lambda to increase ASG progress * * Improvements on event based instance replacement - Ported to StackSet deploy mode. - Fix to "ScalingActivityInProgress" if launched ondemand instance is terminated before going "inservice": Created method "waitForInstanceStatus", that wait, until a max retry (5), that an instance belonging to an ASG is in the desired status (InService). (Sleep time is 5*retry) - Fix to multiple problems caused by lambda concurrency changing ASG MaxSize: Created another Lambda (LambdaManageASG) in the same region of the main one; code python3.7, inline in template. Lambda concurrency is set to one. Is function is to change ASG MaxSize by the ammount specified. Used a "try-catch approach": if AttachInstances return error code "ValidationError" and string "update the AutoScalingGroup sizes" is present in error message, means that we need to increase ASG MaxSize. So we execute method changeAutoScalingMaxSize that invoke LambdaManageASG. If invoke return error "ErrCodeTooManyRequestsException", means that multiple Main lambdas are executing LambdaManageASG, we sleep for a random interval part of seconds and retry. Method attachSpotInstance now return an int that represent the ammount of change to the ASG MaxSize, so that in swapWithGroupMember we can defer the call to changeAutoScalingMaxSize to decrease the ASG MaxSize of the previously increased ammount. We use "waitForInstanceStatus" in attachSpotInstance before returning to be sure that spot instance has been attached and are InService before beginning to terminate ondemand one. * * Improvements on event based instance replacement - Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time. * * Improvements on event based instance replacement - fix/improvement on Suspending/Resuming Termination process: suspend/resume is now handled by LambdaManageASG too. When i successfully suspend termination, i add a tag "autospotting_suspend_process_by" with value equals to the instanceId event that triggered the main lambda. When i try to resume termination first check if the value of the tag above equals the current one. If not, means that another main lambda, different from the one who suspended it, is trying to resume the process; in that case i do not resume it. Considered that LambdaManageASG has a concurrent execution limit set to 1 we can have the following cases: 1) *) A main lambda suspend the process. *) No other lambda suspend it. *) Main lambda resume the process. *) Another main lambda suspend the process *) ....and so on 2) *) A main lambda suspend the process. *) Before the first one resume the process another one suspend it (so replacing the tag value) *) First lambda do not resume the process (tag value differ) *) Second lambda resume the process In the wrost scenario of a Lambda dying before resuming the process, it will be resumed after another one will suspend it. * * Improvements on event based instance replacement code cosmetic fix * * Improvements on event based instance replacement - for rand seed instead of using time.Now().UnixNano() we build a seed based on the instanceId that triggered the event. The seed is build this way: for every char of instanceId (starting from third char) we get his rune "representation" and sum it to previous one. We use it as a temporary seed and get a random number between 0 and 9. The final seed is the concatenation of the generated random numbers. This way we have a seed number (int64) that depend from the instanceId and of the same lenght. * * Improvements on event based instance replacement - no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember. * * Improvements on event based instance replacement - Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot too. We use it as a trick to avoid rare cases of concurrency between scheduled and event lambdas. As lambda that handle "suspendResumeProcess" have a concurrency limit of one, scheduled and event lambdas, if concurrent, will be "time shifted" by a random value. This way they will not execute attachSpotInstance at the same time. In case of scheduled lambda we add the "S" char to the instanceId used for the randSeed to avoid that it resume process suspended by event lambda. - Fix in swapWithGroupMember: defer asg.suspendResumeProcess for resume so that it will be executed even if function swapWithGroupMember return error. * * gofmt cosmetic changes * Fix to rand.Intn parameter to include even 9 * fix terminateRandomSpotInstanceIfHavingEnough need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0 if all ASG instances are in state running and Min OnDemand instance equals total ondemand running and all instances running equals desired capacity means that i do not need to terminate a spot instance need some testing * fix terminateRandomSpotInstanceIfHavingEnough fixes * fix terminateRandomSpotInstanceIfHavingEnough changed allInstancesRunning to return ondemand instances running too * Specify/override multiplier for the on-demand price on a group level As i.price already have been multiplied by the global value, if specified, i need first to divide it by the same value and then multiply it by the multiplier specific to the ASG. We need to do this for both scheduled and event actions. For event we act in function belongsToEnabledASG. For schedule we act in launchSpotReplacement. Need deep testing... * Specify/override multiplier for the on-demand price on a group level - fix in loadConfOnDemandPriceMultiplier need to use a.config.OnDemandPriceMultiplier in place of a.region.conf.OnDemandPriceMultiplier this way a.region.conf.OnDemandPriceMultiplier will conserve the original global value * Misc fixes to be able to run go tests * Misc fixes to be able to run go tests - continue * Misc fixes to be able to run go tests - end for now * For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough added contion: "spot capacity is correct, skip termination" Co-authored-by: Chris <chris.farrenden@domain.com.au> Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com> Co-authored-by: Jawad <jawad.stouli@gmail.com> Co-authored-by: Gábor Lipták <gliptak@gmail.com> * Feat/event based instance replacement - Merged all "new" commit from master (#433) * Allow flavor to be customised (#359) * fix typo (#360) * Fix typo (#361) * UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366) * Support custom role for cfn-init in Beanstalk UserData * Wrappers & refactoring * Docs * Docs fixes * More docs fixes * Docs fixes * Yet more docs fixes * Lambda & Kubernetes config * AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers * Update param description * Kubernetes config fix * Rename Beanstalk variable + Move test data * Add missing permission for AutoSpottingBeanstalk role * Bring Go to 1.13 (#367) * merge fixes * Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet * Merge fix * Fix to template * Multiple fixes to handle lambda concurrency As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG * Removed commented code * Begin working on: #354 (comment) * Progress #354 Created Queue in CF template Begin sending message * Progress #354 * Progress #354 * Progress #354 * Progress #354 * Progress #354 * * Use inline python lambda to increase ASG Use it only if AttachInstances method fails for wrong ASG max size. * * Use inline python lambda to increase ASG progress * * Use inline python lambda to increase ASG progress * * Improvements on event based instance replacement - Ported to StackSet deploy mode. - Fix to "ScalingActivityInProgress" if launched ondemand instance is terminated before going "inservice": Created method "waitForInstanceStatus", that wait, until a max retry (5), that an instance belonging to an ASG is in the desired status (InService). (Sleep time is 5*retry) - Fix to multiple problems caused by lambda concurrency changing ASG MaxSize: Created another Lambda (LambdaManageASG) in the same region of the main one; code python3.7, inline in template. Lambda concurrency is set to one. Is function is to change ASG MaxSize by the ammount specified. Used a "try-catch approach": if AttachInstances return error code "ValidationError" and string "update the AutoScalingGroup sizes" is present in error message, means that we need to increase ASG MaxSize. So we execute method changeAutoScalingMaxSize that invoke LambdaManageASG. If invoke return error "ErrCodeTooManyRequestsException", means that multiple Main lambdas are executing LambdaManageASG, we sleep for a random interval part of seconds and retry. Method attachSpotInstance now return an int that represent the ammount of change to the ASG MaxSize, so that in swapWithGroupMember we can defer the call to changeAutoScalingMaxSize to decrease the ASG MaxSize of the previously increased ammount. We use "waitForInstanceStatus" in attachSpotInstance before returning to be sure that spot instance has been attached and are InService before beginning to terminate ondemand one. * * Improvements on event based instance replacement - Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time. * * Improvements on event based instance replacement - fix/improvement on Suspending/Resuming Termination process: suspend/resume is now handled by LambdaManageASG too. When i successfully suspend termination, i add a tag "autospotting_suspend_process_by" with value equals to the instanceId event that triggered the main lambda. When i try to resume termination first check if the value of the tag above equals the current one. If not, means that another main lambda, different from the one who suspended it, is trying to resume the process; in that case i do not resume it. Considered that LambdaManageASG has a concurrent execution limit set to 1 we can have the following cases: 1) *) A main lambda suspend the process. *) No other lambda suspend it. *) Main lambda resume the process. *) Another main lambda suspend the process *) ....and so on 2) *) A main lambda suspend the process. *) Before the first one resume the process another one suspend it (so replacing the tag value) *) First lambda do not resume the process (tag value differ) *) Second lambda resume the process In the wrost scenario of a Lambda dying before resuming the process, it will be resumed after another one will suspend it. * * Improvements on event based instance replacement code cosmetic fix * * Improvements on event based instance replacement - for rand seed instead of using time.Now().UnixNano() we build a seed based on the instanceId that triggered the event. The seed is build this way: for every char of instanceId (starting from third char) we get his rune "representation" and sum it to previous one. We use it as a temporary seed and get a random number between 0 and 9. The final seed is the concatenation of the generated random numbers. This way we have a seed number (int64) that depend from the instanceId and of the same lenght. * * Improvements on event based instance replacement - no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember. * * Improvements on event based instance replacement - Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot too. We use it as a trick to avoid rare cases of concurrency between scheduled and event lambdas. As lambda that handle "suspendResumeProcess" have a concurrency limit of one, scheduled and event lambdas, if concurrent, will be "time shifted" by a random value. This way they will not execute attachSpotInstance at the same time. In case of scheduled lambda we add the "S" char to the instanceId used for the randSeed to avoid that it resume process suspended by event lambda. - Fix in swapWithGroupMember: defer asg.suspendResumeProcess for resume so that it will be executed even if function swapWithGroupMember return error. * * gofmt cosmetic changes * Fix to rand.Intn parameter to include even 9 * fix terminateRandomSpotInstanceIfHavingEnough need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0 if all ASG instances are in state running and Min OnDemand instance equals total ondemand running and all instances running equals desired capacity means that i do not need to terminate a spot instance need some testing * fix terminateRandomSpotInstanceIfHavingEnough fixes * fix terminateRandomSpotInstanceIfHavingEnough changed allInstancesRunning to return ondemand instances running too * Specify/override multiplier for the on-demand price on a group level As i.price already have been multiplied by the global value, if specified, i need first to divide it by the same value and then multiply it by the multiplier specific to the ASG. We need to do this for both scheduled and event actions. For event we act in function belongsToEnabledASG. For schedule we act in launchSpotReplacement. Need deep testing... * Specify/override multiplier for the on-demand price on a group level - fix in loadConfOnDemandPriceMultiplier need to use a.config.OnDemandPriceMultiplier in place of a.region.conf.OnDemandPriceMultiplier this way a.region.conf.OnDemandPriceMultiplier will conserve the original global value * Misc fixes to be able to run go tests * Misc fixes to be able to run go tests - continue * Misc fixes to be able to run go tests - end for now * For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough added contion: "spot capacity is correct, skip termination" * Merge Updating AWS SDK to latest version - [0ee95c8] * Merge Relicense to OSL-3 - [972bc61] * Merge Update readme, mention the relicensing to OSL-3 - [9b438dc] * Merge Fix `make archive` on macOS [c929391] * Merge Move some logs to debug [fa0c27b] * Merge Delete old Gopkg files [5504e34] * Merge Update dependencies [0e14a7b] * Merge Support spot price buffer percentage of 0 [26ca955] * Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d] * Merge Allow specifying GOOS and GOARCH [8d67d6e] * Merge Use /bin/bash for shell [0eed8cd] * Merge Ignore terminating spot instances that don't belong to AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c] * Merge Revert Use /bin/bash for shell [6e45440] * Merge Pass --abort-on-container-exit to docker-compose [8f90fba] * Merge Rename travisci make targets to ci [2d76a9a] * Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828] * Merge Use paginated version of DescribeSpotPriceHistory [9a770b3] * Merge Delete DescribeSecurityGroups mock [3bfd542] * Merge Move config loading out of main and add tests for it [30a4392] * Fixes to Merge Move config loading out of main and add tests for it * Merge Move logs about incompatible instance types to debug [985d675] * Merge Remove incorrect Makefile conditionals [a01ee26] * Merge Actually fail the build if gofmt fails [c391d69] * Merge Add tools to go.mod [8a9a90b] * Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298] * Merge Added spot premium [ecf31a5] * Merge Update how bid price is calculated for premium instances [8d13fb0] * Merge Update dependencies [caf373d] * Merge Cron timezone [8e04942] * Merge Spleling [a37aafc] * Merge Update README.md [0109a0b,fb15aa4,4e03db7] * Merge Use the larger of min-OD-instances and (min-OD-percent * current) [25ac31f] * gofmt changes * cronEventAction - fix to logic need to invert test for needReplaceOnDemandInstances and onDemandInstance == nil currently if onDemandInstance == nil methods return and execution stop. This way the check to terminate a spotInstace if their number is more than required is never done. Assume that ASG scale down and terminate the onDemand instances that autospotting is not terminating [autospotting_min_on_demand_number]. On the next runs onDemandInstance will be nil and spot instances in excess will not be terminated. Co-authored-by: Chris <chris.farrenden@domain.com.au> Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com> Co-authored-by: Jawad <jawad.stouli@gmail.com> Co-authored-by: Gábor Lipták <gliptak@gmail.com> * Feat/event based instance replacement (#434) * Allow flavor to be customised (#359) * fix typo (#360) * Fix typo (#361) * UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366) * Support custom role for cfn-init in Beanstalk UserData * Wrappers & refactoring * Docs * Docs fixes * More docs fixes * Docs fixes * Yet more docs fixes * Lambda & Kubernetes config * AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers * Update param description * Kubernetes config fix * Rename Beanstalk variable + Move test data * Add missing permission for AutoSpottingBeanstalk role * Bring Go to 1.13 (#367) * merge fixes * Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet * Merge fix * Fix to template * Multiple fixes to handle lambda concurrency As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG * Removed commented code * Begin working on: #354 (comment) * Progress #354 Created Queue in CF template Begin sending message * Progress #354 * Progress #354 * Progress #354 * Progress #354 * Progress #354 * * Use inline python lambda to increase ASG Use it only if AttachInstances method fails for wrong ASG max size. * * Use inline python lambda to increase ASG progress * * Use inline python lambda to increase ASG progress * * Improvements on event based instance replacement - Ported to StackSet deploy mode. - Fix to "ScalingActivityInProgress" if launched ondemand instance is terminated before going "inservice": Created method "waitForInstanceStatus", that wait, until a max retry (5), that an instance belonging to an ASG is in the desired status (InService). (Sleep time is 5*retry) - Fix to multiple problems caused by lambda concurrency changing ASG MaxSize: Created another Lambda (LambdaManageASG) in the same region of the main one; code python3.7, inline in template. Lambda concurrency is set to one. Is function is to change ASG MaxSize by the ammount specified. Used a "try-catch approach": if AttachInstances return error code "ValidationError" and string "update the AutoScalingGroup sizes" is present in error message, means that we need to increase ASG MaxSize. So we execute method changeAutoScalingMaxSize that invoke LambdaManageASG. If invoke return error "ErrCodeTooManyRequestsException", means that multiple Main lambdas are executing LambdaManageASG, we sleep for a random interval part of seconds and retry. Method attachSpotInstance now return an int that represent the ammount of change to the ASG MaxSize, so that in swapWithGroupMember we can defer the call to changeAutoScalingMaxSize to decrease the ASG MaxSize of the previously increased ammount. We use "waitForInstanceStatus" in attachSpotInstance before returning to be sure that spot instance has been attached and are InService before beginning to terminate ondemand one. * * Improvements on event based instance replacement - Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time. * * Improvements on event based instance replacement - fix/improvement on Suspending/Resuming Termination process: suspend/resume is now handled by LambdaManageASG too. When i successfully suspend termination, i add a tag "autospotting_suspend_process_by" with value equals to the instanceId event that triggered the main lambda. When i try to resume termination first check if the value of the tag above equals the current one. If not, means that another main lambda, different from the one who suspended it, is trying to resume the process; in that case i do not resume it. Considered that LambdaManageASG has a concurrent execution limit set to 1 we can have the following cases: 1) *) A main lambda suspend the process. *) No other lambda suspend it. *) Main lambda resume the process. *) Another main lambda suspend the process *) ....and so on 2) *) A main lambda suspend the process. *) Before the first one resume the process another one suspend it (so replacing the tag value) *) First lambda do not resume the process (tag value differ) *) Second lambda resume the process In the wrost scenario of a Lambda dying before resuming the process, it will be resumed after another one will suspend it. * * Improvements on event based instance replacement code cosmetic fix * * Improvements on event based instance replacement - for rand seed instead of using time.Now().UnixNano() we build a seed based on the instanceId that triggered the event. The seed is build this way: for every char of instanceId (starting from third char) we get his rune "representation" and sum it to previous one. We use it as a temporary seed and get a random number between 0 and 9. The final seed is the concatenation of the generated random numbers. This way we have a seed number (int64) that depend from the instanceId and of the same lenght. * * Improvements on event based instance replacement - no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember. * * Improvements on event based instance replacement - Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot too. We use it as a trick to avoid rare cases of concurrency between scheduled and event lambdas. As lambda that handle "suspendResumeProcess" have a concurrency limit of one, scheduled and event lambdas, if concurrent, will be "time shifted" by a random value. This way they will not execute attachSpotInstance at the same time. In case of scheduled lambda we add the "S" char to the instanceId used for the randSeed to avoid that it resume process suspended by event lambda. - Fix in swapWithGroupMember: defer asg.suspendResumeProcess for resume so that it will be executed even if function swapWithGroupMember return error. * * gofmt cosmetic changes * Fix to rand.Intn parameter to include even 9 * fix terminateRandomSpotInstanceIfHavingEnough need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0 if all ASG instances are in state running and Min OnDemand instance equals total ondemand running and all instances running equals desired capacity means that i do not need to terminate a spot instance need some testing * fix terminateRandomSpotInstanceIfHavingEnough fixes * fix terminateRandomSpotInstanceIfHavingEnough changed allInstancesRunning to return ondemand instances running too * Specify/override multiplier for the on-demand price on a group level As i.price already have been multiplied by the global value, if specified, i need first to divide it by the same value and then multiply it by the multiplier specific to the ASG. We need to do this for both scheduled and event actions. For event we act in function belongsToEnabledASG. For schedule we act in launchSpotReplacement. Need deep testing... * Specify/override multiplier for the on-demand price on a group level - fix in loadConfOnDemandPriceMultiplier need to use a.config.OnDemandPriceMultiplier in place of a.region.conf.OnDemandPriceMultiplier this way a.region.conf.OnDemandPriceMultiplier will conserve the original global value * Misc fixes to be able to run go tests * Misc fixes to be able to run go tests - continue * Misc fixes to be able to run go tests - end for now * For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough added contion: "spot capacity is correct, skip termination" * Merge Updating AWS SDK to latest version - [0ee95c8] * Merge Relicense to OSL-3 - [972bc61] * Merge Update readme, mention the relicensing to OSL-3 - [9b438dc] * Merge Fix `make archive` on macOS [c929391] * Merge Move some logs to debug [fa0c27b] * Merge Delete old Gopkg files [5504e34] * Merge Update dependencies [0e14a7b] * Merge Support spot price buffer percentage of 0 [26ca955] * Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d] * Merge Allow specifying GOOS and GOARCH [8d67d6e] * Merge Use /bin/bash for shell [0eed8cd] * Merge Ignore terminating spot instances that don't belong to AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c] * Merge Revert Use /bin/bash for shell [6e45440] * Merge Pass --abort-on-container-exit to docker-compose [8f90fba] * Merge Rename travisci make targets to ci [2d76a9a] * Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828] * Merge Use paginated version of DescribeSpotPriceHistory [9a770b3] * Merge Delete DescribeSecurityGroups mock [3bfd542] * Merge Move config loading out of main and add tests for it [30a4392] * Fixes to Merge Move config loading out of main and add tests for it * Merge Move logs about incompatible instance types to debug [985d675] * Merge Remove incorrect Makefile conditionals [a01ee26] * Merge Actually fail the build if gofmt fails [c391d69] * Merge Add tools to go.mod [8a9a90b] * Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298] * Merge Added spot premium [ecf31a5] * Merge Update how bid price is calculated for premium instances [8d13fb0] * Merge Update dependencies [caf373d] * Merge Cron timezone [8e04942] * Merge Spleling [a37aafc] * Merge Update README.md [0109a0b,fb15aa4,4e03db7] * Merge Use the larger of min-OD-instances and (min-OD-percent * current) [25ac31f] * gofmt changes * cronEventAction - fix to logic need to invert test for needReplaceOnDemandInstances and onDemandInstance == nil currently if onDemandInstance == nil methods return and execution stop. This way the check to terminate a spotInstace if their number is more than required is never done. Assume that ASG scale down and terminate the onDemand instances that autospotting is not terminating [autospotting_min_on_demand_number]. On the next runs onDemandInstance will be nil and spot instances in excess will not be terminated. * terminateUnneededSpotInstance - no need to call asg.terminateRandomSpotInstanceIfHavingEnough There is no need to call asg.terminateRandomSpotInstanceIfHavingEnough in terminateUnneededSpotInstance. We already execute spotInstance.terminate, on the unattached spot instance. * terminateInstanceInAutoScalingGroup is used for spot too As terminateInstanceInAutoScalingGroup is used also for terminating spot instances, changed comment and logging message. * Avoid adding instance in Terminating Lifecycle State Problem arise if ASG have a LifeCycle Hook on terminating instances and the lifecycle action take more time than the Autospotting schedule. When terminating instances LifeCycle Hook (LH) put the instance in the lifecycle state Terminating:Wait (and later Terminating:Proceed). But during that time, until LH complete the instance status is still running. So AutoSpotting when counting running instances will consider that instance too. This will cause multiples problems: * spot instance can be terminated even if needed * instance can be terminated even if already in terminating state * spot instance are launched even if not needed The fix is very simple: in "scanInstances" func do not execute "a.instances.add" if the instance lifecycle begin with Terminating. Co-authored-by: Chris <chris.farrenden@domain.com.au> Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com> Co-authored-by: Jawad <jawad.stouli@gmail.com> Co-authored-by: Gábor Lipták <gliptak@gmail.com> * terminateUnneededSpotInstance - fix to total is unused (#435) * Add missing import for ioutil * Fix build and tests * Fix linter errors reported by `make ci-docker` * Address a few codeclimate issues Co-authored-by: Chris <chris.farrenden@domain.com.au> Co-authored-by: Cristian Magherusan-Stanciu <cmagh@amazon.com> Co-authored-by: mello7tre <mello+github@ankot.org> Co-authored-by: Chris Farrenden <cfarrend@gmail.com> Co-authored-by: Rubi <14269809+codenoid@users.noreply.github.com> Co-authored-by: Jawad <jawad.stouli@gmail.com> Co-authored-by: Gábor Lipták <gliptak@gmail.com>

Chris and others added 30 commits August 19, 2019 19:16

Allow flavor to be customised (LeanerCloud#359)

719ad73

fix typo (LeanerCloud#360)

2ad0e99

Fix typo (LeanerCloud#361)

89a188c

Bring Go to 1.13 (LeanerCloud#367)

d078902

merge with master

b6e040e

merge fixes

a33cc2c

Begin - port regional CloudFormation stack changes to the main templa…

900568a

…te so it can also run from a StackSet

Merge fix

1c854e0

Fix to template

2784995

Multiple fixes to handle lambda concurrency

168e97a

As instaces can be launched in concurrency/overlapping we have the problems related to multiple lambdas acting on the same ASG

Removed commented code

0cc82a1

Begin working on:

a1350fc

LeanerCloud#354 (comment)

Progress LeanerCloud#354

a0ae84b

Created Queue in CF template Begin sending message

Progress LeanerCloud#354

d8261a2

Progress LeanerCloud#354

3a6fd0a

Progress LeanerCloud#354

1a53041

Progress LeanerCloud#354

68b8338

Progress LeanerCloud#354

40224d7

* Use inline python lambda to increase ASG

1899417

Use it only if AttachInstances method fails for wrong ASG max size.

* Use inline python lambda to increase ASG

f102241

progress

* Use inline python lambda to increase ASG

1d395bc

progress

* Improvements on event based instance replacement

63ba2b2

- Add logPrefix to better identify lambda actions in case of concurent executions. TE = Spot terminate event ST = Instance start event SC = Schedule event TE and ST are followed by instanceID, SC is followed by activation time.

* Improvements on event based instance replacement

476ba58

code cosmetic fix

* Improvements on event based instance replacement

621bb79

- no more need to "revert attach/detach order when running on minimum capacity". Defer changeAutoScalingMaxSize and use same logic of swapWithGroupMember.

* gofmt cosmetic changes

6a137cc

mello7tre added 27 commits July 8, 2020 15:35

Merge Pass --abort-on-container-exit to docker-compose [8f90fba]

71820f8

Merge Rename travisci make targets to ci [2d76a9a]

5bae2d5

Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828]

a086398

Merge Use paginated version of DescribeSpotPriceHistory [9a770b3]

83e4396

Merge Delete DescribeSecurityGroups mock [3bfd542]

3f849fb

Merge Move config loading out of main and add tests for it [30a4392]

13eba5b

Fixes to Merge Move config loading out of main and add tests for it

b83f6fd

Merge Move logs about incompatible instance types to debug [985d675]

acfa8ad

Merge Remove incorrect Makefile conditionals [a01ee26]

b5ade8b

Merge Actually fail the build if gofmt fails [c391d69]

80cfa8b

Merge Add tools to go.mod [8a9a90b]

eefe516

Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3…

42d3c71

…313298]

Merge Added spot premium [ecf31a5]

6de9c2e

Merge Update how bid price is calculated for premium instances [8d13fb0]

8cc90f7

Merge Update dependencies [caf373d]

bcdf509

Merge Cron timezone [8e04942]

165ce0e

Merge Spleling [a37aafc]

cb05c6f

Merge Update README.md [0109a0b,fb15aa4,4e03db7]

8907710

Merge Use the larger of min-OD-instances and (min-OD-percent * current)

c9c3ff1

[25ac31f]

gofmt changes

742e011

Merge branch 'feat/event-based-instance-replacement' into feat/event-…

a6e9bbb

…based-instance-replacement

merge

f9fb38f

merge

25a42c1

terminateUnneededSpotInstance - no need to call

cca4004

asg.terminateRandomSpotInstanceIfHavingEnough There is no need to call asg.terminateRandomSpotInstanceIfHavingEnough in terminateUnneededSpotInstance. We already execute spotInstance.terminate, on the unattached spot instance.

terminateInstanceInAutoScalingGroup is used for spot too

0df1be3

As terminateInstanceInAutoScalingGroup is used also for terminating spot instances, changed comment and logging message.

cristim approved these changes Aug 12, 2020

View reviewed changes

cristim merged commit da91697 into LeanerCloud:feat/event-based-instance-replacement Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/event based instance replacement #434

Feat/event based instance replacement #434

mello7tre commented Aug 10, 2020

cristim left a comment

Feat/event based instance replacement #434

Feat/event based instance replacement #434

Conversation

mello7tre commented Aug 10, 2020

Issue Type

Summary

Code contribution checklist

cristim left a comment

Choose a reason for hiding this comment