I have an EKS cluster with public/private access on a VPC with public and private subnets. I've setup my ALB in the public subnets on port 80, internet-facing and ip and installed the AWS controller following example through AWS docs and 2048 deployment example. I am using GPU nodes and also set up Kubernetes GPU operator. I have a deployment and service for a flask rest api.
After getting everything setup, I expected the EKS cluster node instances I have running to register into my target group but its empty and the pods have no instances to join.
I'm struggling to find an answer as to why this is happening. I've been messing with my ingress and deployment yaml files and thought it was maybe a selector/label issue but that doesn't seem to be the case. My deployment is running a flask api on port 5000 and I am setting a /health path to hit the flask api server /health endpoint and return response.
Deployment.yaml:
---
apiVersion: v1
kind: Namespace
metadata:
name: flask-api-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-api-deployment
namespace: flask-api-app
labels:
app.kubernetes.io/name: flask-app
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: flask-app
template:
metadata:
labels:
app.kubernetes.io/name: flask-app
spec:
containers:
- name: flask-app
image: xxxxxxxxxxxxxxxxxxxxxxx
imagePullPolicy: Always
ports:
- containerPort: 5000
volumeMounts:
- name: persistent-storage
mountPath: /data
restartPolicy: Always
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
replicas: 1
---
apiVersion: v1
kind: Service
metadata:
name: flask-api-app-service
namespace: flask-api-app
labels:
app.kubernetes.io/name: flask-app
spec:
type: NodePort
selector:
app.kubernetes.io/name: flask-app
ports:
- name: http
port: 80
targetPort: 5000
protocol: TCP
ingress.yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: flask-api-app
name: flask-ingress-3
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/is-default-class: "true"
labels:
app.kubernetes.io/name: flask-app
spec:
ingressClassName: alb
rules:
- http:
paths:
- path: /health
pathType: Prefix
backend:
service:
name: flask-api-app-service
port:
number: 80
This is the dockerfile that I built for the deployment:
# start by pulling the python image
FROM python:3.9
# copy the requirements file into the image
COPY ./requirements.txt /app/requirements.txt
# switch working directory
WORKDIR /app
# install the dependencies and packages in the requirements file
RUN pip install -r requirements.txt
# copy every content from the local file to the image
COPY . /app
# Expose port 5000 for Gunicorn
EXPOSE 5000
# Configure the container to run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "main:app"]
I also ran the command kubectl describe targetgroupbindings -n flask-api-app
and this was the result:
Name: k8s-flaskapi-flaskapi-c99c751836
Namespace: flask-api-app
Labels: ingress.k8s.aws/stack-name=flask-ingress-3
ingress.k8s.aws/stack-namespace=flask-api-app
Annotations: <none>
API Version: elbv2.k8s.aws/v1beta1
Kind: TargetGroupBinding
Metadata:
Creation Timestamp: xxxxxxxxxxxxxxxx
Finalizers:
elbv2.k8s.aws/resources
Generation: 1
Resource Version: 1802318
UID: xxxxxxxxxxxxxxxxxxxxxxxxx
Spec:
Ip Address Type: ipv4
Networking:
Ingress:
From:
Security Group:
Group ID: xxxxxxxxxxxxxxxxxxxx
Ports:
Port: 5000
Protocol: TCP
Service Ref:
Name: flask-api-app-service
Port: 80
Target Group ARN: xxxxxxxxxxxxxxxxxxxxxxxx
Target Type: ip
Status:
Observed Generation: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfullyReconciled 10m (x2 over 10m) targetGroupBinding Successfully reconciled
namespaces:
kubectl get namespaces
NAME STATUS AGE
default Active 8d
flask-api-app Active 40m
gpu-operator Active 7d7h
kube-node-lease Active 8d
kube-public Active 8d
kube-system Active 8d
kubectl get all -n kube-system
NAME READY STATUS RESTARTS AGE
pod/aws-load-balancer-controller-6bf4b948d6-c2h9s 1/1 Running 0 40m
pod/aws-load-balancer-controller-6bf4b948d6-h4sqp 1/1 Running 0 40m
pod/aws-node-25wtp 2/2 Running 0 51m
pod/aws-node-mfgjn 2/2 Running 0 51m
pod/coredns-6c857f58b4-hhq74 1/1 Running 0 50m
pod/coredns-6c857f58b4-mn2k2 1/1 Running 0 65m
pod/efs-csi-controller-bb6f8464b-tjd4j 3/3 Running 0 65m
pod/efs-csi-controller-bb6f8464b-zzrjl 3/3 Running 0 65m
pod/efs-csi-node-6rj6n 3/3 Running 0 51m
pod/efs-csi-node-kdfmh 3/3 Running 0 51m
pod/eks-pod-identity-agent-pbk84 1/1 Running 0 51m
pod/eks-pod-identity-agent-qnh8b 1/1 Running 0 51m
pod/kube-proxy-d59bz 1/1 Running 0 51m
pod/kube-proxy-n4vjr 1/1 Running 0 51m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/aws-load-balancer-webhook-service ClusterIP 172.20.157.47 <none> 443/TCP 40m
service/kube-dns ClusterIP 172.20.0.10 <none> 53/UDP,53/TCP,9153/TCP 8d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/aws-node 2 2 2 2 2 <none> 8d
daemonset.apps/efs-csi-node 2 2 2 2 2 kubernetes.io/os=linux 6d
daemonset.apps/eks-pod-identity-agent 2 2 2 2 2 <none> 8d
daemonset.apps/kube-proxy 2 2 2 2 2 <none> 8d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/aws-load-balancer-controller 2/2 2 2 40m
deployment.apps/coredns 2/2 2 2 8d
deployment.apps/efs-csi-controller 2/2 2 2 6d
NAME DESIRED CURRENT READY AGE
replicaset.apps/aws-load-balancer-controller-6bf4b948d6 2 2 2 40m
replicaset.apps/coredns-6556f9967c 0 0 0 8d
replicaset.apps/coredns-6c857f58b4 2 2 2 8d
replicaset.apps/efs-csi-controller-bb6f8464b 2 2 2 6d
kubectl get all -n gpu-operator
NAME READY STATUS RESTARTS AGE
pod/gpu-feature-discovery-nh94s 0/1 Init:0/1 0 51m
pod/gpu-feature-discovery-v8fgf 0/1 Init:0/1 0 51m
pod/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd66t7t7j 1/1 Running 0 66m
pod/gpu-operator-1714659266-node-feature-discovery-master-5598gsztr 1/1 Running 0 66m
pod/gpu-operator-1714659266-node-feature-discovery-worker-229sp 1/1 Running 0 52m
pod/gpu-operator-1714659266-node-feature-discovery-worker-5z6kj 1/1 Running 0 52m
pod/gpu-operator-cc9db7497-l2s89 1/1 Running 0 66m
pod/nvidia-container-toolkit-daemonset-6dt46 0/1 Init:0/1 0 51m
pod/nvidia-container-toolkit-daemonset-mx4w4 0/1 Init:0/1 0 51m
pod/nvidia-dcgm-exporter-6nh2x 0/1 Init:0/1 0 51m
pod/nvidia-dcgm-exporter-96hww 0/1 Init:0/1 0 51m
pod/nvidia-device-plugin-daemonset-jg4d9 0/1 Init:0/1 0 51m
pod/nvidia-device-plugin-daemonset-r524n 0/1 Init:0/1 0 51m
pod/nvidia-driver-daemonset-rfj5c 0/1 ImagePullBackOff 0 52m
pod/nvidia-driver-daemonset-rgpgh 0/1 ImagePullBackOff 0 52m
pod/nvidia-operator-validator-4mkt9 0/1 Init:0/4 0 51m
pod/nvidia-operator-validator-9kj2s 0/1 Init:0/4 0 51m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gpu-operator ClusterIP 172.20.82.101 <none> 8080/TCP 7d7h
service/nvidia-dcgm-exporter ClusterIP 172.20.248.145 <none> 9400/TCP 7d7h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 2 2 0 2 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 7d7h
daemonset.apps/gpu-operator-1714659266-node-feature-discovery-worker 2 2 2 2 2 <none> 7d7h
daemonset.apps/nvidia-container-toolkit-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.container-toolkit=true 7d7h
daemonset.apps/nvidia-dcgm-exporter 2 2 0 2 0 nvidia.com/gpu.deploy.dcgm-exporter=true 7d7h
daemonset.apps/nvidia-device-plugin-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.device-plugin=true 7d7h
daemonset.apps/nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 7d7h
daemonset.apps/nvidia-driver-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.driver=true 7d7h
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 7d7h
daemonset.apps/nvidia-operator-validator 2 2 0 2 0 nvidia.com/gpu.deploy.operator-validator=true 7d7h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpu-operator 1/1 1 1 7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-gc 1/1 1 1 7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-master 1/1 1 1 7d7h
NAME DESIRED CURRENT READY AGE
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd6644 1 1 1 7d7h
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-master-559868b8df 1 1 1 7d7h
replicaset.apps/gpu-operator-cc9db7497 1 1 1 7d7h
kubectl get all -n flask-api-app
NAME READY STATUS RESTARTS AGE
pod/flask-api-deployment-59c668dcf8-wzl6p 0/1 Pending 0 44m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/flask-api-app-service NodePort 172.20.201.77 <none> 80:32235/TCP 44m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/flask-api-deployment 0/1 1 0 44m
NAME DESIRED CURRENT READY AGE
replicaset.apps/flask-api-deployment-59c668dcf8 1 1 0 44m
Environment Amazon Linux 2 Ubuntu
- AWS Load Balancer controller version 2.7.2(???)
this is the output of
kubectl describe deployment -n kube-system aws-load-balancer-controller
Name: aws-load-balancer-controller
Namespace: kube-system
CreationTimestamp: Thu, 09 May 2024 17:00:58 -0400
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.7.2
helm.sh/chart=aws-load-balancer-controller-1.7.2
Annotations: deployment.kubernetes.io/revision: 1
meta.helm.sh/release-name: aws-load-balancer-controller
meta.helm.sh/release-namespace: kube-system
Selector: app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas: 2 desired | 2 updated | 2 total | 0 available | 2 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: aws-load-balancer-controller
Containers:
aws-load-balancer-controller:
Image: public.ecr.aws/eks/aws-load-balancer-controller:v2.7.2
Ports: 9443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--cluster-name=Veras-EKS-Test-Cluster
--ingress-class=alb
Liveness: http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
Readiness: http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
Environment: <none>
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: aws-load-balancer-tls
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: aws-load-balancer-controller-6bf4b948d6 (2/2 replicas created)
Events: <none>
- Kubernetes version 1.29
- Using EKS (yes/no), if so version? yes, EKS.6