Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading and writing of statefiles for outputs is not threadsafe #330

Open
1 of 3 tasks
sosafe-felix-bell opened this issue Sep 27, 2023 · 0 comments
Open
1 of 3 tasks
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@sosafe-felix-bell
Copy link

Checklist

  • Upgrade Terraspace: Are you using the latest version of Terraspace? This allows Terraspace to fix issues fast. There's an Upgrading Guide: https://terraspace.cloud/docs/misc/upgrading/
  • Reproducibility: Are you reporting a bug others will be able to reproduce and not asking a question. If you're unsure or want to ask a question, do so on https://community.boltops.com
  • Code sample: Have you put together a code sample to reproduce the issue and make it available? Code samples help speed up fixes dramatically. If it's an easily reproducible issue, then code samples are not needed. If you're unsure, please include a code sample.

My Environment

Software Version
Operating System Debian 11
Terraform 1.3.9
Terraspace latest
Ruby 3.1.4

Expected Behaviour

When running stacks concurrent in a batch with terraspace all up we expect it to be able to pull all remote states used for outputs in parallel without running into errors.

Current Behavior

When pulling and reading remote states for using with terraspace outputs it sometimes runs into an error when two stacks in the same batch are dependent on the same stack. In this case you get the error message as the actual outptut.
(Output <output> was not found for the <dependend_stack> tfvars file. Either <stack_with_output> stack has not been deployed yet or it does not have this output: <output>. Also, if local backend is being used and has been removed/cleaned, then it will also result zero-byte state.json with the 'terraform state pull' used to download the terraform state and output will not be found.)
This has lead to unwanted deletion of resources, as the attributes for resources changed from the actual value to the error message. As there was no assertion of this value the resources were deleted and terraform only failed when the cloud API asserted the value as false. This left us with the deleted resources.

Step-by-step reproduction instructions

This seems to be a race condition of rubys subprocess handling, so we could only reproduce it of running pipelines more often to force the triggering of the issue / bug. We determined it by checking the source code of terraspace for the error message in the value and followed the program stack. From this we could validate that it had to occur somewhere between terraform (called by terraspace) loading the statefile and terraspace reading it. We quickly had the suspicion that it is caused by two or more of the subprocesses created by the terraspace all command (see: deploy_batch.rb).
When executing a command like terraform state pull > /path/to/file it will empty the file as soon as the terraform command starts and flushes the output into the file when it is done. During the command the file is empty (zero-byte) when read by another process.
To verify this we added log statements including timestamps before and after the command execution in the fetcher.rb file:

diff --git a/lib/terraspace/terraform/remote_state/fetcher.rb b/lib/terraspace/terraform/remote_state/fetcher.rb
index b3f9ca3..de26b98 100644
--- a/lib/terraspace/terraform/remote_state/fetcher.rb
+++ b/lib/terraspace/terraform/remote_state/fetcher.rb
@@ -72,8 +72,11 @@ module Terraspace::Terraform::RemoteState
 
       FileUtils.mkdir_p(File.dirname(state_path))
       command = "cd #{@child.cache_dir} && terraform state pull > #{state_path}"
-      logger.debug "=> #{command}"
+      time_before = DateTime.now.strftime "%Y-%m-%d %H:%M:%S.%L"
+      logger.debug "[#{time_before}] => #{command}"
       success = system(command)
+      time_after = DateTime.now.strftime "%Y-%m-%d %H:%M:%S.%L"
+      logger.debug "[#{time_after}] => Finished #{command}"
       # Can error if using a old terraform version and the statefile was created with a newer version of terraform
       # IE: Failed to refresh state: state snapshot was created by Terraform v0.13.2, which is newer than current v0.12.29;
       #     upgrade to Terraform v0.13.2 or greater to work with this state

The next time the issue / bug occured we were able to extract the following log:

Log Output
Running:
    terraspace up ...                            # batch 1
    ...
    terraspace up stack1                         # batch 6
    terraspace up stack2                         # batch 6
...
Batch Run 6:
WARN: Could not find the pid in the logfile log/up/mq.log
WARN: Could not find the pid in the logfile log/up/mq_ec2.log
Created .terraspace-cache/region/account/stacks/stack2/backend.tf
Created .terraspace-cache/region/account/stacks/stack1/backend.tf
...
Layers for stack2:
Created .terraspace-cache/region/account/stacks/stack2/1-project-base.auto.tfvars
Created .terraspace-cache/region/account/stacks/stack2/2-project-account>.auto.tfvars
Downloading tfstate files for dependencies defined in tfvars...
Downloading tfstate for stack: dependency1
[2023-09-26 14:11:47.255] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json
Layers for stack1:
Created .terraspace-cache/region/account/stacks/stack1/1-project-base.auto.tfvars
Created .terraspace-cache/region/account/stacks/stack1/2-project-account>.auto.tfvars
Downloading tfstate files for dependencies defined in tfvars...
Downloading tfstate for stack: dependency1
[2023-09-26 14:11:47.272] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json
[2023-09-26 14:11:49.321] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json
[2023-09-26 14:11:49.321] => Started reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json
[2023-09-26 14:11:49.325] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json
Downloading tfstate for stack: dependency2
[2023-09-26 14:11:49.326] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:49.328] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json
[2023-09-26 14:11:49.329] => Started reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json
[2023-09-26 14:11:49.331] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json
Downloading tfstate for stack: dependency2
[2023-09-26 14:11:49.332] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:51.463] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:51.463] => Started reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:51.465] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json
Downloading tfstate for stack: dependency3
[2023-09-26 14:11:51.468] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread1 starts downloading state file for dependency3
[2023-09-26 14:11:51.518] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:51.518] => Started reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json
[2023-09-26 14:11:51.519] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json
Downloading tfstate for stack: dependency4
[2023-09-26 14:11:51.520] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:53.464] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:53.465] => Started reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:53.466] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json
Downloading tfstate for stack: dependency3
[2023-09-26 14:11:53.469] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread2 starts downloading the same state
[2023-09-26 14:11:53.472] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread1 finishes writing the state file
[2023-09-26 14:11:53.472] => Started reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json --- A thread reads the state file, but it is currently empty because thread2 is writing to it
[2023-09-26 14:11:53.472] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json
DEBUG: (Output output> was not found for the stack1 tfvars file. Either dependency3 stack has not been deployed yet or it does not have this output: output>. Also, if local backend is being used and has been removed/cleaned, then it will also result zero-byte state.json with the 'terraform state pull' used to download the terraform state and output will not be found.)
Downloading tfstate for stack: dependency4
[2023-09-26 14:11:53.473] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:55.439] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread2 finishes writing the state file
[2023-09-26 14:11:55.439] => Started reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json
[2023-09-26 14:11:55.440] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json
Downloading tfstate for stack: dependency5
[2023-09-26 14:11:55.441] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:55.457] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:55.457] => Started reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json
[2023-09-26 14:11:55.458] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json
Downloading tfstate for stack: dependency5
[2023-09-26 14:11:55.460] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:57.376] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:57.376] => Started reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:57.376] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json
Created .terraspace-cache/region/account/stacks/stack1/3-base.auto.tfvars
Created .terraspace-cache/region/account/stacks/stack1/4-account>.auto.tfvars
Built in .terraspace-cache/region/account/stacks/stack1
Running: terraspace up stack1 Logs: log/up/stack1.log
[2023-09-26 14:11:57.388] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:57.388] => Started reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json
[2023-09-26 14:11:57.388] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json
Created .terraspace-cache/region/account/stacks/stack2/3-base.auto.tfvars
Created .terraspace-cache/region/account/stacks/stack2/4-account>.auto.tfvars
Built in .terraspace-cache/region/account/stacks/stack2
Running: terraspace up stack2 Logs: log/up/stack2.log
Writing TFLint output to /__w/terraform/terraform/log/tflint/stack2.log
Writing TFLint output to /__w/terraform/terraform/log/tflint/stack1.log
WARN: Could not find the pid in the logfile log/up/stack2.log
WARN: Could not find the pid in the logfile log/up/stack1.log
Error running: terraspace up stack1. Fix the error above or check logs for the error.
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/bfbeddf0-5c75-11ee-bae9-b572f09d7d5f.sh], exit code 2
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

There are annotations at the line where the subprocesses call what.

Code Sample

Solution Suggestion

This could probably be handled with a lock. Either with a thread safe variable which can be check by all other subprocesses that resembles the availability of the state.json. Another way could be to create a lock file before any write operation which can also be checked by other subprocesses. If whatever used locking mechanism fails the program falls into a retry loop.

Also it would be helpful to fail the terraspace operation when the output could not be loaded instead of continuing with the error message as the value.

Other Issues

We found another issue referencing the same problem:
#300

@sosafe-felix-bell sosafe-felix-bell added the bug Something isn't working label Sep 27, 2023
@tongueroo tongueroo added the help wanted Extra attention is needed label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants