Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube2iam never recovers after node failures #78

Closed
rkoval opened this issue Jun 1, 2017 · 7 comments
Closed

kube2iam never recovers after node failures #78

rkoval opened this issue Jun 1, 2017 · 7 comments

Comments

@rkoval
Copy link

rkoval commented Jun 1, 2017

I'm relatively new to Kubernetes and kube2iam, but I've been using both for a little over a month now. Earlier today, I attempted to test the fault tolerance of my cluster (created via kops) by terminating nodes and letting the ASG bring them back. My test cluster had one master and two worker nodes, and I terminated two worker nodes at the same time.

However, when kube2iam's Pod's were recreated within the newly started nodes, they did not not appear to restore the right metadata back from the EC2 metadata service as it was before the outage. There are no errors reported in the log for any of the Pods either (all debug or info). I can confirm that kube2iam was definitely forwarding credentials properly before.

Below is an example config that I was using to verify before/after which will now error consistently:

# simple aws job that will verify kube2iam is working
---
apiVersion: batch/v1
kind: Job
metadata:
  name: aws-cli
  labels:
    name: aws-cli
  namespace: kube-system
spec:
  template:
    metadata:
      annotations:
        # role with s3 get access
        iam.amazonaws.com/role: arn:aws:iam::${account_id}:role/route53-kubernetes-role
    spec:
      restartPolicy: Never
      containers:
        - image: fstab/aws-cli
          command:
            - "/home/aws/aws/env/bin/aws"
            - "s3api"
            - "get-object"
            - "--bucket"
            - "my-bucket-yay"
            - "--key"
            - "feeds/ryan-koval/feed.json"
            - "feed.json"
          name: aws-cli

Below is the config I'm using for kube2iam:

# used to enable IAM access to docker containers within k8s
# https://github.com/jtblin/kube2iam
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: kube2iam
  labels:
    app: kube2iam
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        name: kube2iam
    spec:
      hostNetwork: true
      containers:
        - image: jtblin/kube2iam:0.6.1
          name: kube2iam
          args:
            - "--iptables=true"
            - "--host-ip=$(HOST_IP)"
            - "--verbose"
            - "--debug"
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          ports:
            - containerPort: 8181
              hostPort: 8181
              name: http
          securityContext:
            privileged: true

This is what the debug service was outputting at localhost:8181/debug/store, if it helps:

{
  "namespaceByIP": {
    "": "default",
    "100.96.3.130": "kube-system",
    "100.96.3.5": "kube-system",
    "100.96.3.6": "kube-system",
    "100.96.3.7": "kube-system",
    "100.96.4.2": "kube-system",
    "100.96.4.3": "kube-system",
    "100.96.4.6": "kube-system",
    "100.96.4.8": "kube-system",
    "172.20.36.35": "kube-system",
    "172.20.44.110": "kube-system",
    "172.20.57.20": "kube-system"
  },
  "rolesByIP": {
    "100.96.3.130": "arn:aws:iam::${account_id}:role/route53-kubernetes-role"
  },
  "rolesByNamespace": {}
}

... and the Kubernetes cluster is running on 1.5.2.

This is all of the info I can think to provide for now, but please let me know if there's anything else I can provide to help us troubleshoot.

@rkoval
Copy link
Author

rkoval commented Jun 1, 2017

Also, I'm fairly certain that this is not the same as #46 because I had a simple workaround script for that problem that would wait for the role to be present on my actual application containers (tested before and consistently worked against 1000 concurrently scheduled jobs):

metadata_url="http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::${account_id}:role/route53-kubernetes-role"
until curl -f --silent "$metadata_url" --output /dev/null; do
  echo "waiting for kube2iam..."
  sleep 3
done

@jtblin
Copy link
Owner

jtblin commented Jun 1, 2017

Looking at the output from debug, did you manually change the account id to ${account_id} or was is really like this in the output?

{
"rolesByIP": {
    "100.96.3.130": "arn:aws:iam::${account_id}:role/route53-kubernetes-role"
  }
}

@rkoval
Copy link
Author

rkoval commented Jun 1, 2017

Yes, sorry. I forgot to mention that ${account_id} was manually removed so as not to expose it in this post. It was not really like that in the output.

@jtblin
Copy link
Owner

jtblin commented Jun 5, 2017

I can confirm that when I kill the container i.e. sudo docker kill <container-id> and the kube2iam container is recreated, it processes the roles again and adds to the store accordingly. After that, subsequent hits to the ec2 metadata API returns the credentials correctly.

I am not sure why restarting a node would be any different than adding a new node. Could it be an issue with the iptables rule or something else that did not happen when the node was restarted? How do you run the daemon i.e. flags etc.?

@rkoval
Copy link
Author

rkoval commented Jun 5, 2017

Could it be an issue with the iptables rule or something else that did not happen when the node was restarted?

I'm not an expert at iptables, but my initial thought is that the routing is fine. The metadata service is definitely still resolvable from within my containers when the new node comes up; however, it no longer returns my annotated role from iam.amazonaws.com/role like it did before. I get a 404 not found when curling the metadata service with the URL from my earlier comment vs. a JSON payload with the tokens attached to it. For what it's worth, I've also tried nuking the DaemonSet for my kube2iam before and after this node comes back, and that hasn't worked either.

My only real guess is that the metadata service or the state around it is somehow different when an actual node dies/restarts. Afterwards, kube2iam isn't getting properly updated to read the same roles/permissions from it when the new node is up.

How do you run the daemon i.e. flags etc.?

You mean the kube2iam DaemonSet? My config is in my original post

@rgroothuijsen
Copy link

I've been able to partially replicate this by repeatedly terminating instances and using the above templates, but the error only shows up every now and then. When it does, the errors also appear to stop within a few minutes.

@rkoval
Copy link
Author

rkoval commented Dec 11, 2017

For what it's worth, I haven't tested this issue out since I reported it in early June because we didn't end up using kube2iam. However, newer versions may have fixed the problem or at least made it more self-healing. Feel free to close this if there's confidence that the issue is not prevalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants