Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement event-based instance replacement #354

Merged
merged 19 commits into from
Oct 1, 2020

Conversation

cristim
Copy link
Member

@cristim cristim commented Jul 16, 2019

Context

This change attempts to replace on-demand instances with spot as soon as they
are launched (triggered by the on-demand instance entering the running state),
in order to avoid running expensive on-demand capacity in parallel with the
newly launched spot instances. This should cover new groups out of the box, as
well as scaling out or replacement of terminated spot instances.

It has a few nice side effects concerning cost, since on-demand instances are
likely to run for just a few seconds, as well as potentially tackling some
known issues with ElasticBeanstalk(#343, #344) and deletion of CloudFormation
stacks(#332).

In addition, it implements support for handling the start lifecycle hooks(#296) by
similarly intercepting the lifecycle hook completion API failure event received
from Cloudtrail, and will immediately replace any scaled out instances, partially
addressing #284.

The current/legacy replacement logic will still kick in when enabling
AutoSpotting for the first time on an existing group, as well as when spot
instances fail to be launched so the group must failover to on-demand capacity
in order to avoid downtime.

Once the legacy replacement logic is done on the current group, the group is
tagegd with a configuration tag that enables the new event-based logic. This tag
is also set for all groups newer than an hour so their instances are automatically
replaced by the event-based replacement logic, and can be used to manually revert
a group back to the legacy logic in case of problems with the event-based implementation.

Current status

  • changed regional CloudFormation stack to intercept EC2 instances
    as soon as entering the 'pending' and 'running' state'
  • implement handling code for these events, that launches a replacement spot
    instance for each on-demand instance currently in pending state and performs
    the usual instance swapping logic when a spot instance enters the running
    state, without waiting for it to exit its grace period.
  • largely kept the initial instance replacement logic as it is (waiting for
    the grace period) to avoid introducing regressions, and also in case the
    events are not properly propagated like in case of misconfigurations.
  • implemented a configuration override that allows users to disable the event-based logic on a per-group level by tagging the group.
  • this code was executed against against real infrastructure and is
    working as expected for a couple of groups I tested it against.
  • still missing test coverage for the new code and some existing tests were
    crashing and had to be disabled for now
  • implemented support for handling the AutoScaling Lifecycle hooks when
    replacing new on-demand instances in event-based execution mode based on
    CloudTrail events. As a caveat the lifecycle hook handling will be simulated with
    a delay of up to 15 minutes, because of the CloudTrail event propagation latency.
  • added support for running it locally against custom JSON event data in
    CloudFront Event format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type

ToDo

  • test against more infrastructure
  • iron out any potential crashes
  • write automated tests for the new code and re-enable older test cases so
    that coverage doesn't drop
  • port regional CloudFormation stack changes to the main template so
    it can also run from a StackSet
  • update the documentation to mention this new behavior

Issue Type

  • Feature Pull Request
  • Bugfix Pull Request
  • Documentation Pull Request

Summary

Code contribution checklist

  1. The contribution fixes a single existing github issue, and it is linked
    to it.
  2. The code is as simple as possible, readable and follows the idiomatic Go
    guidelines.
  3. All new functionality is covered by automated test cases so the overall
    test coverage doesn't decrease.
  4. No issues are reported when running make full-test.
  5. Functionality not applicable to all users should be configurable.
  6. Configurations should be exposed through Lambda function environment
    variables which are also passed as parameters to the
    CloudFormation
    and
    Terraform
    stacks defined as infrastructure code.
  7. Global configurations set from the infrastructure stack level should also
    support per-group overrides using tags.
  8. Tags names and expected values should be similar to the other existing
    configurations.
  9. Both global and tag-based configuration mechanisms should be tested and
    proven to work using log output from various test runs.
  10. The logs should be kept as clean as possible (use log levels as
    appropriate) and formatted consistently to the existing log output.
  11. The documentation is updated to cover the new behavior, as well as the
    new configuration options for both stack parameters and tag overrides.
  12. A code reviewer reproduced the problem and can confirm the code
    contribution actually resolves it.

// GetPendingReplaceableOnDemandInstance return a populated Instance data structure related
// to an on-demand instance that has just been launched for an AutoScaling Group
// managed by AutoSpotting and just entered the Pending state.
func GetPendingReplaceableOnDemandInstance(event events.CloudWatchEvent, cfg *Config) (*Instance, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function GetPendingReplaceableOnDemandInstance has 5 return statements (exceeds 4 allowed).

core/instance.go Outdated
@@ -94,7 +94,8 @@ func (is *instanceManager) instances() <-chan *instance {
return retC
}

type instance struct {
// Instance wraps an ec2.Instance and has some additional fields and functions
type Instance struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance has 27 methods (exceeds 20 allowed). Consider refactoring.

core/instance.go Outdated
@@ -527,7 +584,7 @@ func (i *instance) launchTemplateHasNetworkInterfaces(id, ver *string) (bool, []
return false, nil
}

func (i *instance) createRunInstancesInput(instanceType string, price float64) *ec2.RunInstancesInput {
func (i *Instance) createRunInstancesInput(instanceType string, price float64) *ec2.RunInstancesInput {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Instance.createRunInstancesInput has 69 lines of code (exceeds 50 allowed). Consider refactoring.

@cristim cristim force-pushed the feat/event-based-instance-replacement branch 2 times, most recently from aea60bd to 1439509 Compare July 25, 2019 15:45
@@ -75,7 +81,7 @@ func (a *autoScalingGroup) needReplaceOnDemandInstances() bool {
case DetachTerminationMethod:
randomSpot.terminate()
default:
a.terminateInstanceInAutoScalingGroup(randomSpot.Instance.InstanceId)
a.terminateInstanceInAutoScalingGroup(randomSpot.Instance.InstanceId, wait, false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.terminateInstanceInAutoScalingGroup is not checked (from errcheck)

return
}

logger.Println(a.region.name, "Found spot instance:", spotInstanceID,
"Attaching it to", a.name)

a.replaceOnDemandInstanceWithSpot(spotInstanceID)
a.replaceOnDemandInstanceWithSpot(nil, spotInstanceID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.replaceOnDemandInstanceWithSpot is not checked (from errcheck)

if i == nil {
continue
debug.Println("Missing instance data for ", *inst.InstanceId, "scanning it again")
a.region.scanInstance(inst.InstanceId)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.region.scanInstance is not checked (from errcheck)

if attachErr != nil {
logger.Println(a.name, "skipping detaching on-demand due to failure to",
"attach the new spot instance", *spotInst.InstanceId)
return nil
}
} else {
defer a.attachSpotInstance(spotInstanceID)
defer a.attachSpotInstance(spotInstanceID, true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.attachSpotInstance is not checked (from errcheck)

// temporarily increase AutoScaling group in case it's of static size
if *a.MinSize == *a.MaxSize || *a.DesiredCapacity == *a.MaxSize {
logger.Println(a.name, "Temporarily increasing MaxSize")
a.setAutoScalingMaxSize(*a.MaxSize + 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.setAutoScalingMaxSize is not checked (from errcheck)

autospotting.go Outdated Show resolved Hide resolved
}

func (a *autoScalingGroup) isEnabledForEventBasedInstanceReplacement() bool {
if time.Now().Sub(*a.CreatedTime) < time.Hour {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S1012: should use time.Since instead of time.Now().Sub (from gosimple)

@@ -516,3 +644,12 @@ func (a *autoScalingGroup) alreadyRunningInstanceCount(
logger.Println(a.name, "Found", count, instanceCategory, "instances running on a total of", total)
return count, total
}

func (a *autoScalingGroup) isEnabled() bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U1000: func (*autoScalingGroup).isEnabled is unused (from unused)

tests := []struct {
name string
i *instance
wantASG *autoScalingGroup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U1000: field wantASG is unused (from unused)

@@ -65,11 +87,12 @@ func addDefaultFilter(cfg *Config) {
}
}

func disableLogging() {
setupLogging(&Config{LogFile: ioutil.Discard})
func (cfg *Config) disableLogging() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U1000: func (*Config).disableLogging is unused (from unused)

@cristim
Copy link
Member Author

cristim commented Jul 25, 2019

Context:

This change attempts to replace on-demand instances with spot as soon as they
are launched (triggered by the on-demand instance entering the pending state),
in order to avoid runnin expensive on-demand capacity in parallel with the
newly launched spot instances. This should cover new groups out of the box, as
well as scaling out or replacement of terminated spot instances.

It has a few nice side effects concerning cost, since on-demand instances are
likely to run for just a few seconds, as well as potentially tackling some
known issues with ElasticBeanstalk(#343, #344) and deletion of CloudFormation
stacks(#332).

It may also trick the ASG into running the start lifecycle hooks(#296), and
will immediately replace any scaled out instances, partially addressing #284.

The current/legacy replacement logic will still kick in when enabling
AutoSpotting for the first time on an existing group, as well as when spot
instances fail to be launched so the group must failover to on-demand capacity
in order to avoid downtime.

Once the legacy replacement logic is done on the current group, the group is
tagegd with a configuration tag that enables the event logic. This tag is also
set for all groups newer than an hour so their instances are automatically
replaced by the event-based replacement logic.

Current status
- changed regional CloudFormation stack to intercept EC2 instances
  as soon as entering the 'pending' and 'running' state'
- implement handling code for these events, that launches a replacement spot
  instance for each on-demand instance currently in pending state and performs
  the usual instance swapping logic when a spot instance enters the running
  state, without waiting for it to exit its grace period.
- largely kept the initial instance replacement logic as it is (waiting for
  the grace period) to avoid introducing regressions, and also in case the
  events are not properly propagated like in case of misconfigurations.
- it compiles and was executed against against real infrastructure and is
  working as expected for a couple of groups I tested it against.
- still missing test coverage for the new code and some existing tests were
  crashing and hat to be disabled for now
- implemented support for handling the AutoScaling lifecycle hooks when
  replacing new on-demand instances in event-based execution mode
- added support for running it locally against custom JSON event data in
  CloudFront Event format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type

ToDo:
- test against more infrastructure
- iron out any crashes
- write automated tests for the new code and re-enable older test cases so
  that coverage doesn't drop
- port regional CloudFormation stack changes to the main template so
  it can also run from a StackSet
- update the documentation to mention this new behavior
Only handle the "running" events of new instances because the instance
information wasn't filled while the instance was in pending state.

Also run spot attachment in parallel with detachment of the on-demand
instance in order to avoid race conditions with the group's process that
enforces the desired capacity like when doing multiple instances in parallel
for new groups or multiple instance launches.
@cristim cristim force-pushed the feat/event-based-instance-replacement branch from 1439509 to cceb686 Compare July 31, 2019 19:41
@@ -114,19 +120,24 @@ func (a *autoScalingGroup) licensedToRun() (bool, error) {
return true, nil
}

func (a *autoScalingGroup) process() {
var spotInstanceID string
func (a *autoScalingGroup) processCronEvent() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method autoScalingGroup.processCronEvent has 8 return statements (exceeds 4 allowed).

core/instance.go Outdated
return false
}

func (i *instance) swapWithGroupMember() (*instance, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method instance.swapWithGroupMember has 9 return statements (exceeds 4 allowed).

core/main.go Outdated

//EventHandler implements the event handling logic and is the main entrypoint of
// AutoSpotting
func (a *AutoSpotting) EventHandler(event *json.RawMessage) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method AutoSpotting.EventHandler has 5 return statements (exceeds 4 allowed).

core/main.go Outdated

}

func (a *AutoSpotting) handleNewInstanceLaunch(regionName string, instanceID string, state string) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method AutoSpotting.handleNewInstanceLaunch has 8 return statements (exceeds 4 allowed).

core/instance.go Outdated
return false
}

func (i *instance) swapWithGroupMember() (*instance, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method instance.swapWithGroupMember has 56 lines of code (exceeds 50 allowed). Consider refactoring.

core/instance.go Outdated
if err := asg.attachSpotInstance(*i.InstanceId, true); err != nil {
logger.Printf("Spot instance %s couldn't be attached to the group %s, terminating it...",
*i.InstanceId, asg.name)
i.terminate()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of i.terminate is not checked (from errcheck)

- Suspend termination processes on ASG while swapping instances
- Handle launch lifecycle hooks of unattached spot instances by
  intercepting the API call failures from CloudTrail
strings.HasPrefix(ctEvent.ErrorMessage, "No active Lifecycle Action found with instance ID")
}

func (a *AutoSpotting) handleLifecycleHookEvent(event events.CloudWatchEvent) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method AutoSpotting.handleLifecycleHookEvent has 11 return statements (exceeds 4 allowed).

return false
}

func (i *instance) swapWithGroupMember(asg *autoScalingGroup) (*instance, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method instance.swapWithGroupMember has 8 return statements (exceeds 4 allowed).

return nil
}

func (a *AutoSpotting) handleNewInstanceLaunch(regionName string, instanceID string, state string) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method AutoSpotting.handleNewInstanceLaunch has 77 lines of code (exceeds 50 allowed). Consider refactoring.

return nil
}

func (a *AutoSpotting) handleNewInstanceLaunch(regionName string, instanceID string, state string) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method AutoSpotting.handleNewInstanceLaunch has 11 return statements (exceeds 4 allowed).

autospotting.go Outdated

func (c *cfgData) parseCommandLineFlags() {
flag.StringVar(&c.AllowedInstanceTypes, "allowed_instance_types", "",
func parseCommandLineFlags() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function parseCommandLineFlags has 67 lines of code (exceeds 50 allowed). Consider refactoring.

core/main.go Outdated

} else if eventType == "AWS API Call via CloudTrail" {
log.Println("Triggered by", eventType)
a.handleLifecycleHookEvent(cloudwatchEvent)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.handleLifecycleHookEvent is not checked (from errcheck)

Chris and others added 3 commits September 13, 2019 15:35
- Refactor the cron execution mode to support determining the action and
running the action as independent steps, which makes the code easier to
test.
- switch to Go 1.13
- Suspend the AutoScaling termination process for 5 minutes
- Reset the hourlySavings value between tests
- Improve mocking
- Print log statements during tests when the AUTOSPOTTING_DEBUG variable
is set to "true" when running tests in verbose mode
- change isProtectedFromTermination to return true in case of API
failures.
asg := tsi.target.asg

asg.enableForInstanceLaunchEventHandling()
asg.terminateRandomSpotInstanceIfHavingEnough(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of asg.terminateRandomSpotInstanceIfHavingEnough is not checked (from errcheck)

core/action.go Outdated
spotInstance := tusi.target.spotInstance
spotInstanceID := *spotInstance.InstanceId

asg.terminateRandomSpotInstanceIfHavingEnough(total, true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of asg.terminateRandomSpotInstanceIfHavingEnough is not checked (from errcheck)

asg.terminateRandomSpotInstanceIfHavingEnough(total, true)
logger.Println("Spot instance", spotInstanceID, "is not need anymore by ASG",
asg.name, "terminating the spot instance.")
spotInstance.terminate()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of spotInstance.terminate is not checked (from errcheck)

func (ssi swapSpotInstance) run() {
asg := ssi.target.asg
spotInstanceID := *ssi.target.spotInstance.InstanceId
asg.replaceOnDemandInstanceWithSpot(nil, spotInstanceID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of asg.replaceOnDemandInstanceWithSpot is not checked (from errcheck)

return
}
logger.Printf("Successfully launched spot instance %s, exiting...", *spotInstanceID)
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S1023: redundant return statement (from gosimple)


func Test_autoScalingGroup_enableForInstanceLaunchEventHandling(t *testing.T) {
type fields struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U1000: type fields is unused (from unused)

@mello7tre
Copy link
Contributor

Hi, i begin porting regional CloudFormation stack changes to the main template so new event-based replacement can be deployed using StackSet too.

As soon i found some time to test it i will notify you.
(I will try to test not only stackset creation but the whole new event-based replacement.)

…tance logic in Schedule run + specify ondemandprice multiplier on a ASG group level (#425)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>
} else {
defer a.attachSpotInstance(spotInstanceID)

a.suspendResumeProcess(*spotInst.InstanceId+"S", "suspend")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.suspendResumeProcess is not checked (from errcheck)

defer a.attachSpotInstance(spotInstanceID)

a.suspendResumeProcess(*spotInst.InstanceId+"S", "suspend")
defer a.suspendResumeProcess(*spotInst.InstanceId+"S", "resume")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.suspendResumeProcess is not checked (from errcheck)


increase, attachErr := a.attachSpotInstance(*spotInst.InstanceId, true)
if increase > 0 {
defer a.changeAutoScalingMaxSize(int64(-1*increase), *spotInst.InstanceId)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of a.changeAutoScalingMaxSize is not checked (from errcheck)

// var waiter sync.WaitGroup
// defer waiter.Wait()
// go asg.temporarilySuspendTerminations(&waiter)
asg.suspendResumeProcess(*i.InstanceId, "suspend")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of asg.suspendResumeProcess is not checked (from errcheck)

// defer waiter.Wait()
// go asg.temporarilySuspendTerminations(&waiter)
asg.suspendResumeProcess(*i.InstanceId, "suspend")
defer asg.suspendResumeProcess(*i.InstanceId, "resume")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of asg.suspendResumeProcess is not checked (from errcheck)

core/main.go Outdated
return nil
}

func (a *AutoSpotting) handleNewSpotInstanceLaunch(r region, i *instance) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copylocks: handleNewSpotInstanceLaunch passes lock by value: github.com/AutoSpotting/AutoSpotting/core.region contains sync.WaitGroup contains sync.noCopy (from govet)

return false
}

func (a *autoScalingGroup) temporarilySuspendTerminations(wg *sync.WaitGroup) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (*autoScalingGroup).temporarilySuspendTerminations is unused (from unused)

a.resumeTerminationProcess()
}

func (a *autoScalingGroup) isTerminationSuspended() bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (*autoScalingGroup).isTerminationSuspended is unused (from unused)

}
}

func (a *autoScalingGroup) resumeTerminationProcess() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (*autoScalingGroup).resumeTerminationProcess is unused (from unused)

}
}
logger.Println(a.name, "Found", count, instanceCategory, "instances running on a total of", total)
return count, total
}

func (a *autoScalingGroup) suspendTerminationProcess() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (*autoScalingGroup).suspendTerminationProcess is unused (from unused)

cristim referenced this pull request in traveloka-archive/AutoSpotting May 15, 2020
mello7tre referenced this pull request in mello7tre/AutoSpotting Jul 9, 2020
added contion:
"spot capacity is correct, skip termination"
…master (#433)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

* Merge Updating AWS SDK to latest version - [0ee95c8]

* Merge Relicense to OSL-3 - [972bc61]

* Merge Update readme, mention the relicensing to OSL-3 - [9b438dc]

* Merge Fix `make archive` on macOS [c929391]

* Merge Move some logs to debug [fa0c27b]

* Merge Delete old Gopkg files [5504e34]

* Merge Update dependencies [0e14a7b]

* Merge Support spot price buffer percentage of 0 [26ca955]

* Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d]

* Merge Allow specifying GOOS and GOARCH [8d67d6e]

* Merge Use /bin/bash for shell [0eed8cd]

* Merge Ignore terminating spot instances that don't belong to
AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c]

* Merge Revert Use /bin/bash for shell [6e45440]

* Merge Pass --abort-on-container-exit to docker-compose [8f90fba]

* Merge Rename travisci make targets to ci [2d76a9a]

* Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828]

* Merge Use paginated version of DescribeSpotPriceHistory [9a770b3]

* Merge Delete DescribeSecurityGroups mock [3bfd542]

* Merge Move config loading out of main and add tests for it [30a4392]

* Fixes to Merge Move config loading out of main and add tests for it

* Merge Move logs about incompatible instance types to debug [985d675]

* Merge Remove incorrect Makefile conditionals [a01ee26]

* Merge Actually fail the build if gofmt fails [c391d69]

* Merge Add tools to go.mod [8a9a90b]

* Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298]

* Merge Added spot premium [ecf31a5]

* Merge Update how bid price is calculated for premium instances [8d13fb0]

* Merge Update dependencies [caf373d]

* Merge Cron timezone [8e04942]

* Merge Spleling [a37aafc]

* Merge Update README.md [0109a0b,fb15aa4,4e03db7]

* Merge Use the larger of min-OD-instances and (min-OD-percent * current)
[25ac31f]

* gofmt changes

* cronEventAction - fix to logic

need to invert test for needReplaceOnDemandInstances and
onDemandInstance == nil

currently if onDemandInstance == nil methods return and execution stop.
This way the check to terminate a spotInstace if their number is more
than required is never done.

Assume that ASG scale down and terminate the onDemand instances that
autospotting is not terminating [autospotting_min_on_demand_number].
On the next runs onDemandInstance will be nil and spot instances in
excess will not be terminated.

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>
Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 7401 lines exceeds the maximum allowed for the inline comments feature.

@mello7tre
Copy link
Contributor

i know that this PR and the comment related to it are already a lot, but i found a bug and didn't want to open an issue as this is not have been yet merged in the master branch.
So i will explain it here, and then when i will be sure that the solution that i implemented is fine i will wil open a specific PR.

The problem is difficult to explain, but is revelant so i will try:

Setup

Consider an ASG with a Desired = 10 and a AS minOnDemand = 1
Over time we will arrive at the following state:
9 Spot and 1 OnDemand
Now if the ASG can scale based on load Autoscaling can terminate an instance and reduce the Desired to 9.
Let's take the case where the terminated one is the only one OnDemand.
We will have 9 Spot and 0 OnDemand.
On the next run AS (AutoSpotting)will terminate a spot, so that a new onDemand is launched, to obey the condition of minOnDemand = 1.
Buy if the ASG have a lifecyclehook triggered on instance terminating and that lifecyclehook take more time than AS scheduled frequency, a big issue can arise.

Issue

AS terminate spot instance(minOnDemand), instance is not terminated (LifeCycleHook) immediately, but Autoscaling Launch a new instance (onDemand).
On the next scheduled run, AS (cronEventAction):

                if need, total := a.needReplaceOnDemandInstances(); !need {
                        logger.Printf("Not allowed to replace any more of the running OD instances in %s", a.name)
                        return terminateSpotInstance{target{asg: a, totalInstances: total}}
                }

look if new onDemand must be replaced, but as onDemandRunning == a.minOnDemand need = false and he execute:
terminateSpotInstance that finally execute terminateRandomSpotInstanceIfHavingEnough

func (a *autoScalingGroup) terminateRandomSpotInstanceIfHavingEnough(totalRunning int64, wait bool) error {

        if totalRunning == 1 {
                logger.Println("Warning: blocking replacement of very last instance - consider raising ASG to >= 2")
                return nil
        }

        if allInstancesAreRunning, onDemandRunning := a.allInstancesRunning(); allInstancesAreRunning {
                if a.instances.count64() == *a.DesiredCapacity && onDemandRunning == a.minOnDemand {
                        logger.Println("Currently Spot running equals to the required number, skipping termination")
                        return nil
                }

                if a.instances.count64() < *a.DesiredCapacity {
                        logger.Println("Not enough capacity in the group")
                        return nil
                }
        }

        randomSpot := a.getAnySpotInstance()
        if randomSpot == nil {
                logger.Println("Couldn't pick a random spot instance")
                return nil
        }

        logger.Println("Terminating randomly-selected spot instance",
                *randomSpot.Instance.InstanceId)

        switch a.config.TerminationMethod {
        case DetachTerminationMethod:
                return randomSpot.terminate()
        default:
                return a.terminateInstanceInAutoScalingGroup(randomSpot.Instance.InstanceId, wait, false)
        }
}

the root problem is that the terminating instance is still in state running!
So if a.instances.count64() == *a.DesiredCapacity && onDemandRunning == a.minOnDemand { is not true and AS terminate another spot instance (can even happen that try to terminate the same already terminating!)
Instance begin terminating (LifeCycleHook) and Autoscaling launch a new instance, but on the next run the same problem arise, so we can end in a loop of launching and then terminating instances.
(i have observed all this in realtime)

Solution/Fix

When executing terminateRandomSpotInstanceIfHavingEnough we need to consider not the instance State but the Instance Autoscaling LifeCycle state.
I have added to autoScalingGroup struct two new properties:

        instancesOnDemandInService map[string]*instance
        instancesSpotInService     map[string]*instance

then created a func populateASGInstancesInService that create the previous map with the id of the instance inService as index and the instance herself as value.
I have then changed terminateRandomSpotInstanceIfHavingEnough:

func (a *autoScalingGroup) terminateRandomSpotInstanceIfHavingEnough(wait bool) error {
        onDemandRunning := int64(len(a.instancesOnDemandInService))
        totalRunning := int64(len(a.instancesOnDemandInService) + len(a.instancesSpotInService))

        if totalRunning == 1 {
                logger.Println("Warning: blocking replacement of very last instance - consider raising ASG to >= 2")
                return nil
        }

        if totalRunning < *a.DesiredCapacity {
                logger.Println("Not enough capacity in the group, skipping termination")
                return nil
        }

        if totalRunning == *a.DesiredCapacity && onDemandRunning == a.minOnDemand {
                logger.Println("Currently Spot running equals to the required number, skipping termination")
                return nil
        }
.....
......
......

And finally to avoid terminating a terminating in progress instance i changed getAnySptInstance:

func (a *autoScalingGroup) getAnySpotInstance() *instance {
        for _, i := range a.instancesSpotInService {
                return i
        }
        return nil
}

I have an AS version builded with this code running, and so far found no problem.
I will notify you if i found any.
If all goes well, if you like, i can open a PR with the changes/fixes.

Regards, Alberto.

@mello7tre
Copy link
Contributor

mello7tre commented Aug 8, 2020

Thinking again on this:
as the problem can arise even when terminating onDemand instance, maybe a better solution can be to exclude instances if their ASG lifecycle state begin with Terminating directly in scanInstances func ....

@mello7tre
Copy link
Contributor

made some test, the right solution is to exclude instances when their ASG lifecycle state begin with Terminating.
I will open a PR

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

* Merge Updating AWS SDK to latest version - [0ee95c8]

* Merge Relicense to OSL-3 - [972bc61]

* Merge Update readme, mention the relicensing to OSL-3 - [9b438dc]

* Merge Fix `make archive` on macOS [c929391]

* Merge Move some logs to debug [fa0c27b]

* Merge Delete old Gopkg files [5504e34]

* Merge Update dependencies [0e14a7b]

* Merge Support spot price buffer percentage of 0 [26ca955]

* Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d]

* Merge Allow specifying GOOS and GOARCH [8d67d6e]

* Merge Use /bin/bash for shell [0eed8cd]

* Merge Ignore terminating spot instances that don't belong to
AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c]

* Merge Revert Use /bin/bash for shell [6e45440]

* Merge Pass --abort-on-container-exit to docker-compose [8f90fba]

* Merge Rename travisci make targets to ci [2d76a9a]

* Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828]

* Merge Use paginated version of DescribeSpotPriceHistory [9a770b3]

* Merge Delete DescribeSecurityGroups mock [3bfd542]

* Merge Move config loading out of main and add tests for it [30a4392]

* Fixes to Merge Move config loading out of main and add tests for it

* Merge Move logs about incompatible instance types to debug [985d675]

* Merge Remove incorrect Makefile conditionals [a01ee26]

* Merge Actually fail the build if gofmt fails [c391d69]

* Merge Add tools to go.mod [8a9a90b]

* Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298]

* Merge Added spot premium [ecf31a5]

* Merge Update how bid price is calculated for premium instances [8d13fb0]

* Merge Update dependencies [caf373d]

* Merge Cron timezone [8e04942]

* Merge Spleling [a37aafc]

* Merge Update README.md [0109a0b,fb15aa4,4e03db7]

* Merge Use the larger of min-OD-instances and (min-OD-percent * current)
[25ac31f]

* gofmt changes

* cronEventAction - fix to logic

need to invert test for needReplaceOnDemandInstances and
onDemandInstance == nil

currently if onDemandInstance == nil methods return and execution stop.
This way the check to terminate a spotInstace if their number is more
than required is never done.

Assume that ASG scale down and terminate the onDemand instances that
autospotting is not terminating [autospotting_min_on_demand_number].
On the next runs onDemandInstance will be nil and spot instances in
excess will not be terminated.

* terminateUnneededSpotInstance - no need to call
asg.terminateRandomSpotInstanceIfHavingEnough

There is no need to call asg.terminateRandomSpotInstanceIfHavingEnough
in terminateUnneededSpotInstance.
We already execute spotInstance.terminate, on the unattached spot
instance.

* terminateInstanceInAutoScalingGroup is used for spot too

As terminateInstanceInAutoScalingGroup is used also for terminating spot
instances, changed comment and logging message.

* Avoid adding instance in Terminating Lifecycle State

Problem arise if ASG have a LifeCycle Hook on terminating instances and
the lifecycle action take more time than the Autospotting schedule.

When terminating instances LifeCycle Hook (LH) put the instance in the
lifecycle state Terminating:Wait (and later Terminating:Proceed).
But during that time, until LH complete the instance status is still running.

So AutoSpotting when counting running instances will consider that
instance too.

This will cause multiples problems:
* spot instance can be terminated even if needed
* instance can be terminated even if already in terminating state
* spot instance are launched even if not needed

The fix is very simple:
in "scanInstances" func do not execute "a.instances.add" if the instance
lifecycle begin with Terminating.

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>
Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 7405 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 7404 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 5305 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 5306 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 5394 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 5391 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 5413 lines exceeds the maximum allowed for the inline comments feature.

@codeclimate
Copy link

codeclimate bot commented Sep 21, 2020

Code Climate has analyzed commit 7c92d9d and detected 12 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 11
Bug Risk 1

View more on Code Climate.

@cristim cristim merged commit af0fec8 into master Oct 1, 2020
mello7tre referenced this pull request in mello7tre/AutoSpotting Apr 22, 2021
@cristim cristim deleted the feat/event-based-instance-replacement branch May 26, 2021 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants