Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

Merged
merged 5 commits into from
Sep 29, 2023

Conversation

pxaws
Copy link
Contributor

@pxaws pxaws commented Sep 12, 2023

Reason:

The existing AWS SDK instrumentation generates a span for every SDK call. While this provides a general overview, there are scenarios where a more detailed insight into the SDK call is desired. Specifically, there is an interest in understanding individual HTTP requests, particularly their error codes and messages. Enhancing the span with this detailed information can be invaluable for debugging and monitoring purposes. Consider this scenario: We identify a specific SDK span with an unusually long duration and seek to understand the cause. At present, the SDK span lacks the necessary information for this analysis. However, by logging the error messages for each HTTP retry, we can deduce that the extended duration might be due to multiple retries, which are a result of backend throttling.

Implementation:

To achieve this, the pull request introduces the following changes:

  • The error code and error message of each failed HTTP request are added as events within the SDK span.
  • The afterTransmission hook, part of the AWS SDK's ExecutionInterceptor, is leveraged to achieve this enhancement.
  • After receiving an HTTP response, the afterTransmission hook is triggered. Within this hook, the HTTP response is inspected, and the error code and message (if any) are extracted.
  • These details are then packaged into an event, timestamped with the current time, and added to the SDK span.
  • We've introduced an experimental flag named experimental-record-individual-http-error, which defaults to false, to limit any potential impact. The described behaviors are only activated when this flag is set to true.

@pxaws pxaws requested a review from a team as a code owner September 12, 2023 23:15
@pxaws pxaws marked this pull request as draft September 12, 2023 23:32
@pxaws pxaws force-pushed the add-event-to-aws-sdk branch 3 times, most recently from b6c69d8 to 0a47a4f Compare September 19, 2023 18:18
@pxaws pxaws force-pushed the add-event-to-aws-sdk branch 7 times, most recently from 7ee8a0d to 36460a6 Compare September 19, 2023 23:00
@pxaws pxaws marked this pull request as ready for review September 20, 2023 00:23
@trask
Copy link
Member

trask commented Sep 20, 2023

cc @rapphil @wangzlei @srprash

@@ -49,6 +57,11 @@ final class TracingExecutionInterceptor implements ExecutionInterceptor {
private final Instrumenter<ExecutionAttributes, SdkHttpResponse> requestInstrumenter;
private final Instrumenter<ExecutionAttributes, SdkHttpResponse> consumerInstrumenter;
private final boolean captureExperimentalSpanAttributes;
private static final Logger logger = Logger.getLogger(PluginImplUtil.class.getName());

static final AttributeKey<String> HTTP_ERROR_MSG =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any plans to make this attribute part of the semantic conventions for the aws sdk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is in the long term plan. Actually I am not quite sure which part of semantic conventions that I should put into because this is the event attribute not the span attribute (the previous link that you give seems to be the conventions for span attributes). Any suggestions?

@rapphil
Copy link
Contributor

rapphil commented Sep 20, 2023

by logging the error messages for each HTTP retry, we can deduce that the extended duration might be due to multiple retries, which are a result of backend throttling.

is the http status code not enough for that purpose? 429 is the status code for throttling.

Are there other examples where the error message returned by the backend might be useful?

@pxaws
Copy link
Contributor Author

pxaws commented Sep 21, 2023

is the http status code not enough for that purpose? 429 is the status code for throttling.

Recording only the error code is not ideal because users need to go and figure out the meaning for the code. I can't assume that everyone knows 429 means throttling :) In addition, an error code can correspond to multiple types of the error cases.

Are there other examples where the error message returned by the backend might be useful?

Yes. I would say all the error messages from backend is kind of useful because it can tell you more details about backend and sometimes indicate some client-side issue. For example, the following one from the dynamodb public doc: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes

LimitExceededException
Message: Too many operations for a given subscriber.

There are too many concurrent control plane operations. The cumulative number of tables and indexes in the CREATING, >DELETING, or UPDATING state cannot exceed 500.

OK to retry? Yes

@rapphil
Copy link
Contributor

rapphil commented Sep 26, 2023

@trask this LGTM and is ready for maintainers to review/merge.

Span span = Span.fromContext(otelContext);
SdkHttpResponse response = context.httpResponse();

if (!response.isSuccessful()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can response ever be null on the context? Should a null check be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I will add a null check on response.

if (responseBody.isPresent()) {
String errorMsg =
new BufferedReader(
new InputStreamReader(responseBody.get(), Charset.defaultCharset()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the underlying InputStream still available for use by the SDK after calling responseBody.get()?

Copy link
Contributor Author

@pxaws pxaws Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this!

I think there is no guarantee that the InputStream will support mark()/reset() operations or seeking. So the next time if the same InputStream is read, it will not give you anything.

To verify, I did an simple experiment. Based on the pipeline stage in aws java sdk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/http/pipeline/stages/AfterTransmissionExecutionInterceptorsStage.java#L43-L46, it seems like modifyHttpResponse call is after afterTransmission call. So I tested the same InputStream in modifyHttpResponse hooks in ExecutionInterceptor and confirmed it's already empty (after it's read out in afterTransmission).

It's a valid concern that the inputStream for the response body will become useless after we read it out in afterTransmission hook. Fortunately there is another hook Optional<InputStream> modifyHttpResponseContent which allow us to modify the content of a response. We can copy the content out and generate a new InputStream from it. A similar approach has already been implemented in the aws java adk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/interceptor/HttpChecksumValidationInterceptor.java#L55-L73

I will make the change in next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pxaws

Is there a way to verify in the test that the InputStream is available to the client and has the expected value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I modified the test to do that. I basically defined another execution interceptor which is registered with AWS SDK after the TracingExecutionInterceptor used for SDK instrumentation and then verify from there to make sure the response body contains the expected content. Please see the next commit. Thank you!

@pxaws
Copy link
Contributor Author

pxaws commented Sep 27, 2023

@kenfinnigan Could you review it again? Thank you!

Copy link
Member

@kenfinnigan kenfinnigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @pxaws, looks good

Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kenfinnigan @rapphil for the reviews!

@trask trask merged commit 2724a87 into open-telemetry:main Sep 29, 2023
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants