Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to fetch queue messages #147

Open
heretogo opened this issue Nov 3, 2022 · 8 comments
Open

Unable to fetch queue messages #147

heretogo opened this issue Nov 3, 2022 · 8 comments

Comments

@heretogo
Copy link

heretogo commented Nov 3, 2022

Hello. I installed the WPA using the script in hack/install.sh.

I am encountering the following error which I believe are permissions or
namespace related. I am running a v1.23 cluster in Amazon EKS.

E1103 15:50:14.855014       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/myAccount/myQueueName

The WPA scaler runs in the kube-system namespace and the WPA and example deployment run in a test namespace called eks-sample-app
The WPA queueURI was configured manually using kubectl edit

$k get pods -n kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
workerpodautoscaler-8667d55684-9zs6l   1/1     Running   0          72m

$k get wpa -n eks-sample-app
NAME          AGE
example-wpa   85m

$k get deployment -n eks-sample-app
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
example-deployment            1/1     1            1           7m34s

I have attached the following policy to the cluster service role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "WPA",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricData",
                "sqs:ReceiveMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "*"
        }
    ]
}

Any idea how to proceed in debugging this?
I have looked through the documentation in the repo's README.md

An unrelated note: what is the context for the WPA Controller section of the
docs? In which context would workerpodautoscaler run be invoked? Is this a standalone
binary?

@alok87
Copy link
Contributor

alok87 commented Nov 4, 2022

Possible to share the complete log?

@heretogo
Copy link
Author

heretogo commented Nov 4, 2022

Hi @alok87 : please see the example included in my issue. The WPA starts spitting out Unable to fetch no of messages messages as soon as the container starts. There are no other kinds of log messages.

Note that I have anonymized the account number and queue name.

E1104 12:42:26.926463       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926476       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926498       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926513       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926527       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926540       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue
.
E1104 12:42:26.926552       1 sqs.go:406] Unable to fetch no of messages to the queue "queue", Client not found for queue: https://sqs.us-east-1.amazonaws.com/<account>/queue

@alok87
Copy link
Contributor

alok87 commented Nov 4, 2022

Does the queue exist in sqs? Possible to try using sqs client with same creds and see data comes?

Just want to rule out the possibility of configuration issue first

@heretogo
Copy link
Author

I generated temporary credentials manually using the AssumeRole. I believe it is working now.

Previously my node's role permissions included the following as per this policy:

"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:ReceiveMessage"

It was resolved by granting all read permissions on SQS:

"cloudwatch:GetMetricData"
"sqs:GetQueueAttributes"
"sqs:GetQueueUrl"
"sqs:ListDeadLetterSourceQueues"
"sqs:ListQueueTags"
"sqs:ListQueues"
"sqs:ReceiveMessage"

My restarted WPA no longer logs any errors.

@alok87
Copy link
Contributor

alok87 commented Nov 15, 2022

can we close this?

@alok87
Copy link
Contributor

alok87 commented Nov 15, 2022

Do you think we should update something in the doc here on policy, https://github.com/practo/k8s-worker-pod-autoscaler#install

@heretogo
Copy link
Author

I feel like there may be something else missing.

Even though I get no permissions errors, I am unable to trigger a scaling operation on the deployment. Any ideas?

I have 10000+ messages in the queue and only one deployment pod running.
image

k get pods
NAME                                 READY   STATUS    RESTARTS   AGE
example-deployment-795d868d4-8nzfv   1/1     Running   0          7m19s

Does the WPA require some kind of write or tag attributes?

I can submit a PR for the documentation once I confirm this is working.

@alok87
Copy link
Contributor

alok87 commented Nov 17, 2022

WPA has verbosity in logs, may be try that. -v=4

  • Also share the output of WPA yaml

k get wpa -o yaml <wpa_object>

  • check if deployment replicas changed with queue length
  • check the queue length in AWS shows the 1000 messages? sqs metrics picture if posted here can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants