validation-webhook-troubleshooting

Validation WebHook troubleshooting, how low can you go?

Share this article

I'm Alex Movergan, DevOps team lead at Altenar. I focus on automation in general and on improving troubleshooting skills within my team. In this article, I'll share a captivating tale that revolves around Kubernetes, validation webhooks, kubespray, and Calico.

Join me on this DevOps journey as we explore real-world scenarios and practical solutions, unraveling the intricacies of troubleshooting in a Kubernetes environment. Continue reading to delve into the world of Kubernetes troubleshooting.

Introduction

Every time a new engineer joins our team, they undergo an onboarding process before diving into production. As part of this process, they build a small instance of our Sportsbook. Our Sportsbook operates on Kubernetes, which is deployed on several VMWare clusters. This particular story takes place during the onboarding of one of our new engineers, highlighting essential technological aspects that will provide a better understanding of our environment.

Our infrastructure is based on the following technologies:

  • Kubernetes cluster is deployed by kubespray on top of VMWare.
  • Calico in IPVS mode is used as CNI.
  • MetaLLB used for internal load balancers
  • Flux for Kubernetes infrastructure settings such as network policy, namespaces, and PSS. 


The task is to build a Kubernetes cluster using our internal terraform module that invokes VMware provider and Kubespray under the hood. The cluster consists of six workers and one master.

Symptoms

It all began when our new engineer approached me and my senior colleagues for assistance after days of troubleshooting. Initially, the issue seemed simple: the MetaLLB AddressPool deployment was failing, accompanied by the following error:

IPAddressPool/netallb/4p-address-pool dry-run failed,

reason: InternalError, error: Internal error occurred:

        failed calling webhook "paddresspoolvalida 

        [metallb.to*: falled to call webhook: 

        Post "https://metallb-webhook-service.metallb.svc:443/validate-metallb-to-vibetai-ipaddresspool?timeout=10...:

        context deadline exceeded

"What is happening here?" you may ask. Well, during installation, MetaLLB creates a ValidatingAdmissionWebhook. This webhook is called by the kube-api server to validate an object before creation. However, it is currently encountering a timeout, leading to the failure. 


Troubleshooting

Let's start with the basics. While calling the same URL fr om other pods works fine, the kube-api continues to flood logs with the error. This leads us to the conclusion that the MetaLLB controller is present and functioning correctly.

To isolate the issue, we eliminated other components. We scaled down the flux pods to 0 and removed all network policies from the cluster. However, the result remains the same: the webhook service is accessible from other pods but kube-api. Therefore, we can conclude that it is not a network policy problem.

To replicate the issue and proceed further, we face a challenge. The kube-api pod does not have a shell, so we cannot directly access it and make the same URL call. To overcome this, we added a sidecar container. Since kube-api is not a regular pod but a static pod on the master node, we SSHed into the VM and modified the /etc/kubernetes/manifests/kube-apiserver.yaml file. Specifically, we addd netshoot as a sidecar container:

apiVersion: v1

kind: Pod

metadata:

  annotations:

    kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 192.168.144.170:6443

  creationTimestamp: null

  labels:

    component: kube-apiserver

    tier: control-plane

  name: kube-apiserver

  namespace: kube-system

spec:

  containers:

  - name: netshoot

    image: nicolaka/netshoot

    command: ["/bin/bash"]

    args: ["-c", "while true; do ping localhost; sleep 60;done"]

  - command:

    - kube-apiserver

  ...

  hostNetwork: true

  priorityClassName: system-node-critical

  securityContext:

    seccompProfile:

      type: RuntimeDefault

  volumes:

  - hostPath:

      path: /etc/ssl/certs

An important aspect to note is the presence of the hostNetwork: true attribute. This attribute indicates that the kube-api pod is running on a node and has access to the node's network.

To summarize the situation, we encountered the same outcome: the curl command failed with a timeout:

tstxknm000:~# curl https://
192.168.145.97:443


curl: (28) Failed to connect to 192.168.145.97 port 443 after 131395 ms: Couldn't connect to server

At this point, we no longer needed to rely solely on kube-api logs; we can now reproduce the issue ourselves, which simplifies the troubleshooting process. Moving forward, let's employ heavy artillery a powerful tool for network troubleshooting: tcpdump. Personally, I find tcpdump indispensable in network troubleshooting scenarios (and it always comes to network troubleshooting, doesn't it?). Open tcpdump in one terminal tab and run the curl command in another:

tstxknm000:~# tcpdump -i vxlan.calico port 9443

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode

listening on vxlan.calico, link-type EN10MB (Ethernet), snapshot length 262144 bytes

10:38:19.641205 IP 192.168.146.129.59577 > 192.168.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177509761 ecr 0,nop,wscale 10], length 0

10:38:20.666577 IP 192.168.146.129.59577 > 192.168.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177510787 ecr 0,nop,wscale 10], length 0

10:38:22.714509 IP 192.168.146.129.59577 > 192.168.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177512835 ecr 0,nop,wscale 10], length 0

10:39:35.197070 IP 192.168.146.129.30507 > 192.168.146.97.9443: Flags [S], seq 79391636, win 43690, options [mss 65495,sackOK,TS val 177585317 ecr 0,nop,wscale 10], length 0


10:39:36.250518 IP 192.168.146.129.30507 > 192.168.146.97.9443: Flags [S], seq 79391636, win 43690, options [mss 65495,sackOK,TS val 177586371 ecr 0,nop,wscale 10], length 0

10:41:17.163901 IP 192.168.146.129.62428 > 192.168.146.97.9443: Flags [S], seq 2885711851, win 43690, options [mss 65495,sackOK,TS val 177687284 ecr 0,nop,wscale 10], length 0


10:41:18.203494 IP 192.168.146.129.62428 > 192.168.146.97.9443: Flags [S], seq 2885711851, win 43690, options [mss 65495,sackOK,TS val 177688324 ecr 0,nop,wscale 10], length 0

If you paid close attention, you may have noticed that we used curl to connect to 192.168.145.97:443, while tcpdump was running on 192.168.146.97.9443. This distinction arises from the way Calico operates, utilizing iptables and ipset to convert service IP:port into pod IP:port. In this case, 192.168.146.97.9443 represents the address and port of the metallb-controller pod. 

In conclusion, we are sending traffic but not receiving any response.

The next step involves identifying the network segment wh ere the traffic is being lost. To achieve this, we add Netshoot to the metallb-controller pod, allowing us to verify if the traffic reaches the pod.

   containers:

        - name: netshoot

          image: nicolaka/netshoot

          
command:

            - /bin/bash

          args:

            - '-c'

            - while true; do ping localhost; sleep 60;done

        - name: metallb-controller

          image: docker.io/bitnami/metallb-controller:0.13.7-debian-11-r29


Let's repeat the experiment, this time running tcpdump on both ends of the traffic pipeline:

On kube-api pod:

tstxknm000:~# tcpdump -i vxlan.calico port 9443

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode

listening on vxlan.calico, link-type EN10MB (Ethernet), snapshot length 262144 bytes

10:34:27.081229 IP 192.168.146.129.32229 > 192.168.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 177277201 ecr 0,nop,wscale 10], length 0

10:34:28.090506 IP 192.168.146.129.32229 > 192.168.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 177278211 ecr 0,nop,wscale 10], length 0


On metallb-controller pod:

metallb-controller-667f54487b-24bv5:~# tcpdump -nnn port 9443

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

11:18:50.381973 IP 192.168.146.129.32229 > 192.168.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 179940505 ecr 0,nop,wscale 10], length 0

11:18:50.382000 IP 192.168.146.97.9443 > 192.168.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365562262 ecr 179940505,nop,wscale 10], length 0

11:18:51.383183 IP 192.168.146.129.32229 > 192.168.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 179941507 ecr 0,nop,wscale 10], length 0

11:18:51.383208 IP 192.168.146.97.9443 > 192.168.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365563263 ecr 179940505,nop,wscale 10], length 0

11:18:52.445353 IP 192.168.146.97.9443 > 192.168.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365564326 ecr 179940505,nop,wscale 10], length 0

This is becoming quite intriguing. It appears that the traffic successfully reaches the final destination, with the metallb-controller providing replies each time. However, the issue lies in delivering packets back to the kube-api pod.

After spending approximately two hours on trial and error, we decided to take a lunch break. Despite our break, we couldn't stop discussing the issue and managed to come up with a few other tests we can attempt to resolve it.

Let's try to create a simple pod with netshoot and 'hostNetwork' option set to 'true'. A few "ctr+c" "ctr+v" and we have another static pod running on the same master node.

yaml definition of that pod

We proceeded with the same steps, and as expected, curl failed with a timeout once again.

To simplify the testing further, we attempted to ping the metallb-controller fr om our netshoot container.

On netshoot:

tstxknm000:~# ping 192.168.146.97

PING 192.168.146.97 (192.168.146.97) 56(84) bytes of data.

^C

--- 192.168.146.97 ping statistics ---

3 packets transmitted, 0 received, 100% packet loss, time 2079ms


On metallb-controller:

metallb-controller-667f54487b-24bv5:~# tcpdump

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

12:33:05.680242 IP 192.168.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2060, length 64

12:33:05.680249 IP metallb-controller-667f54487b-24bv5 > 192.168.146.129: ICMP echo reply, id 10, seq 2060, length 64

12:33:05.691276 IP 192.168.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2061, length 64

12:33:05.691290 IP metallb-controller-667f54487b-24bv5 > 192.168.146.129: ICMP echo reply, id 10, seq 2061, length 64

12:33:05.702445 IP 192.168.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2062, length 64

12:33:05.702456 IP metallb-controller-667f54487b-24bv5 > 192.168.146.129: ICMP echo reply, id 10, seq 2062, length 64

The results were quite similar to what we observed earlier with curl. If you have reached this point, it is evident that the issue lies on the network side. Without delving into the specific documentation we explored regarding Calico, IPVS, and Kubernetes networking, I must say it has been quite an adventure.

The next idea we had was to ping the kube-api server pod from the metallb-controller:

ping 192.168.146.129

connect: Invalid argument

Wait, what? Excuse me? What on earth is happening here? We attempted the same command multiple times and even manually entered the correct IP address, but the outcome remained unchanged. What could possibly be the issue? Not to mention, we also tried using telnet and curl, and they all failed in the same manner.

Another tool that I find quite handy is strace. So, let's strace telnet and connect it to the kube-api pod IP:

strace telnet 192.168.146.129 8443


...

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3

setsockopt(3, SOL_IP, IP_TOS, [16], 4)  = 0

connect(3, {sa_family=AF_INET, sin_port=htons(8443), sin_addr=inet_addr("192.168.146.129")}, 16) = -1 EINVAL (Invalid argument)

write(2, "telnet: connect to address 10.20"..., 60telnet: connect to address 192.168.146.129: Invalid argument

) = 60

close(3)                                = 0

exit_group(1)                           = ?

+++ exited with 1 +++


Upon examination, we noticed that the
connect() function returned an error of "-1 EINVAL (Invalid argument)". Intrigued, we decided to dive into the Linux kernel functions documentation. How exciting! Funny enough, the first result from Google did not contain the description of this failure result, Luckily, the second result proved to be much more informative:

EINVAL

The address_len argument is not a valid length for the address family;

or invalid address family in the sockaddr structure.

In conclusion, there seems to be a network configuration issue with the IP addresses. The question remains, though: Wh ere exactly is the problem originating from?

At this point, we knew we are on the right track. We started from Calico IPPool configuration:

apiVersion: crd.projectcalico.org/v1

kind: IPPool

metadata:

  annotations:

  generation: 2

  name: default-pool

  resourceVersion: '4060638'

  uid: fe5070d5-4078-4265-87b1-5e24b964a37e

  selfLink: /apis/crd.projectcalico.org/v1/ippools/default-pool

spec:

  allowedUses:

    - Workload

    - Tunnel

  blockSize: 26

  cidr: 192.168.146.0/24

  ipipMode: Never

  natOutgoing: true

  nodeSelector: all()

  vxlanMode: Always


Then terraform code we used for deployment:

module "k8s_cluster" {

...

  k8s_cluster_name                  = var.cluster_name

  k8s_kubespray_url                 = var.k8s_kubespray_url

  k8s_kubespray_version             = var.k8s_kubespray_version

  k8s_version                       = var.k8s_version

  k8s_pod_cidr                      = var.k8s_pod_cidr

  k8s_service_cidr                  = var.k8s_service_cidr

  k8s_network_node_prefix           = var.k8s_network_node_prefix

  k8s_api_lb_vip                    = var.k8s_api_lb_vip

  k8s_metrics_server_enabled        = var.k8s_metrics_server_enabled

  k8s_vsphere_username              = var.k8s_vsphere_username

  k8s_vsphere_password              = var.k8s_vsphere_password

  k8s_zone_a_major_index            = var.k8s_zone_a_major_index

  k8s_zone_b_major_index            = var.k8s_zone_b_major_index

  k8s_zone_c_major_index            = var.k8s_zone_c_major_index

  container_manager                 = var.container_manager

  etcd_deployment_type              = var.etcd_deployment_type

  action                            = var.action

}

variable "k8s_pod_cidr" {

  description = "Subnet for Kubernetes pod IPs, should be /22 or wider"

  default     = "192.168.146.0/24"

  type        = string

}

variable "k8s_network_node_prefix" {

  description = "subnet allocated per-node for pod IPs. Also read the comments in templates/kubespray_extra.tpl"

  default     = "25"

  type        = string

}


Next pod cidr on nodes: 

kubectl describe node tstxknm000.nix.tech.altenar.net  | grep PodCIDR

PodCIDR:                     192.168.146.0/25

PodCIDRs:                     192.168.146.0/25


Ah, we have identified the discrepancy! It appears that Calico has a "blockSize: 26" setting, whereas the "k8s_network_node_prefix" is set to 25.

Therefore, the network configuration is incorrect. I must mention that this is a test cluster created solely for the purpose of exploring cluster building with our toolchain. Hence, all the networks in the cluster are "/24". However, even though it is a test cluster, it was deployed with six workers and one master. Simple network calculations reveal that a "/24" network cannot be divided into seven "/26" subnets, let alone seven "/25" subnets:

All 4 of the Possible /26 Networks for 192.168.146.*

Network Address Usable Host Range Broadcast Address:

192.168.146.0 192.168.146.1 - 192.168.146.62 192.168.146.63

192.168.146.64 192.168.146.65 - 192.168.146.126 192.168.146.127

192.168.146.128 192.168.146.129 - 192.168.146.190 192.168.146.191

192.168.146.192 192.168.146.193 - 192.168.146.254 192.168.146.255

Now, the question arises: Why did this occur? We passed a specific value for a parameter. To uncover the root cause, let's delve into the code of kubespray:  This part:

- name: Calico | Set kubespray calico network pool

      set_fact:

        _calico_pool: >

          {

            "kind": "IPPool",

            "apiVersion": "projectcalico.org/v3",

            "metadata": {

              "name": "{{ calico_pool_name }}",

            },

            "spec": {

              "blockSize": {{ calico_pool_blocksize | default(kube_network_node_prefix) }},

              "cidr": "{{ calico_pool_cidr | default(kube_pods_subnet) }}",

              "ipipMode": "{{ calico_ipip_mode }}",

              "vxlanMode": "{{ calico_vxlan_mode }}",

              "natOutgoing": {{ nat_outgoing|default(false) }}

            }

          }


And
this part too:

# add default ippool blockSize (defaults kube_network_node_prefix)


calico_pool_blocksize:
26

What we realized is that even though we used

kube_network_node_prefix: ${k8s_network_node_prefix}

in our terraform module code, this variable was never used by kubespray because calico_pool_blocksize comes first and it has a default value:

             "blockSize": {{ calico_pool_blocksize | default(kube_network_node_prefix) }},


At this juncture, we concluded our investigations and successfully identified the root cause. With this newfound clarity, our action plan is crystal clear:

  1. Obtain a larger network range. 
  2. Rectify the Terraform module for Kubernetes deployment.
  3. Perform a complete redeployment of the cluster from scratch. 


Even though the root cause is trivial as always, as a result of our troubleshooting experience, we have decided to document our steps in this article, in the hope that it may prove valuable to others facing similar challenges.

Thank you very much for your attention. I remain available to address any questions or comments you may have.

Previous Next

  • what-is-an-igaming-operator-your-essential-guide-for-2024

    What is an iGaming Operator? Your Essential Guide for 2024

  • igaming-vs-egaming-do-you-know-the-difference

    iGaming Vs eGaming – Do You Know the Difference?

  • gaming-licences-which-one-should-you-get-in-2024

    Gaming Licences - Which One Should You Get in 2024

  • altenar-and-fast-track-enter-strategic-partnership-to-revolutionise-player-engagement

    Altenar and Fast Track enter strategic partnership to revolutionise player engagement

  • navigating-gambling-regulations-in-brazil-a-guide-for-2024

    Navigating Gambling Regulations in Brazil: A Guide for 2024

Fill out the form and we’ll be in touch as soon as possible

Enquiry Type
How can we reach you?
Region of Operation
How did you hear about us?

This form collects your data so that we can correspond with you. Read our Privacy Policy for more information