I am not entirely sure whether this sub is also intended for asking questions, but after opening the question on stack overflow and only getting an AI answer, I thought it would be worth a shot to ask here. What follows is a rather long question, but most of it is just debugging information to avoid obvious questions.
I have created a small kubernetes cluster (6 nodes) using kubeadm and flannel as the CNI in an openstack project. This is my first time using more than a single node kubernetes cluster.
I set up the kubernetes cluster’s master via
# tasks file for kubernetes_master
- name: Install required packages
apt:
name:
- curl
- gnupg2
- software-properties-common
- apt-transport-https
- ca-certificates
state: present
update_cache: yes
- name: Install Docker
apt:
name: docker.io
state: present
update_cache: yes
- name: Remove Keyrings Directory (if it exists)
ansible.builtin.shell: rm -rf /etc/apt/keyrings
- name: Remove Existing Kubernetes Directory (if it exists)
ansible.builtin.shell: sudo rm -rf /etc/apt/sources.list.d/pkgs_k8s_io_core_stable_v1_30_deb.list
- name: Disable swap
ansible.builtin.command:
cmd: swapoff -a
#- name: Ensure swap is disabled on boot
# ansible.builtin.command:
# cmd: sudo sed -i -e '/\/swap.img\s\+none\s\+swap\s\+sw\s\+0\s\+0/s/^/#/' /etc/fstab
- name: Ensure all swap entries are disabled on boot
ansible.builtin.command:
cmd: sudo sed -i -e '/\s\+swap\s\+/s/^/#/' /etc/fstab
- name: Add kernel modules for Containerd
ansible.builtin.copy:
dest: /etc/modules-load.d/containerd.conf
content: |
overlay
br_netfilter
- name: Load kernel modules for Containerd
ansible.builtin.shell:
cmd: modprobe overlay && modprobe br_netfilter
become: true
- name: Add kernel parameters for Kubernetes
ansible.builtin.copy:
dest: /etc/sysctl.d/kubernetes.conf
content: |
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
- name: Load kernel parameter changes
ansible.builtin.command:
cmd: sudo sysctl --system
- name: Configuring Containerd (building the configuration file)
ansible.builtin.command:
cmd: sudo sh -c "containerd config default > /opt/containerd/config.toml"
- name: Configuring Containerd (Setting SystemdCgroup Variable to True)
ansible.builtin.command:
cmd: sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /opt/containerd/config.toml
- name: Reload systemd configuration
ansible.builtin.command:
cmd: systemctl daemon-reload
- name: Restart containerd service
ansible.builtin.service:
name: containerd
state: restarted
- name: Allow 6443/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 6443/tcp
- name: Allow 2379:2380/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 2379:2380/tcp
- name: Allow 22/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 22/tcp
- name: Allow 8080/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 8080/tcp
- name: Allow 10250/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 10250/tcp
- name: Allow 10251/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 10251/tcp
- name: Allow 10252/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 10252/tcp
- name: Allow 10255/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 10255/tcp
- name: Allow 5473/tcp through firewall
ansible.builtin.command:
cmd: sudo ufw allow 5473/tcp
- name: Enable the firewall
ansible.builtin.ufw:
state: enabled
- name: Reload the firewall
ansible.builtin.command:
cmd: sudo ufw reload
- name: Prepare keyrings directory and update permissions
file:
path: /etc/apt/keyrings
state: directory
mode: '0755'
- name: Download Kubernetes GPG key securely
ansible.builtin.shell: curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
- name: Add Kubernetes repository
ansible.builtin.apt_repository:
repo: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /"
state: present
- name: Install kubeadm, kubelet, kubectl
ansible.builtin.apt:
name:
- kubelet
- kubeadm
- kubectl
state: present
update_cache: yes
- name: Hold kubelet, kubeadm, kubectl packages
ansible.builtin.command:
cmd: sudo apt-mark hold kubelet kubeadm kubectl
- name: Replace /etc/default/kubelet contents
ansible.builtin.copy:
dest: /etc/default/kubelet
content: 'KUBELET_EXTRA_ARGS="--cgroup-driver=cgroupfs"'
- name: Reload systemd configuration
ansible.builtin.command:
cmd: systemctl daemon-reload
- name: Restart kubelet service
ansible.builtin.service:
name: kubelet
state: restarted
- name: Update System-Wide Profile for Kubernetes
ansible.builtin.copy:
dest: /etc/profile.d/kubernetes.sh
content: |
export KUBECONFIG=/etc/kubernetes/admin.conf
export ANSIBLE_USER="sysadmin"
# only works if not executing on master
#- name: Reboot the system
# ansible.builtin.reboot:
# msg: "Reboot initiated by Ansible for Kubernetes setup"
# reboot_timeout: 150
- name: Replace Docker daemon.json configuration
ansible.builtin.copy:
dest: /etc/docker/daemon.json
content: |
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
- name: Reload systemd configuration
ansible.builtin.command:
cmd: systemctl daemon-reload
- name: Restart Docker service
ansible.builtin.service:
name: docker
state: restarted
- name: Update Kubeadm Environment Variable
ansible.builtin.command:
cmd: sudo sed -i -e '/^\[Service\]/a Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"' /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
- name: Reload systemd configuration
ansible.builtin.command:
cmd: systemctl daemon-reload
- name: Restart kubelet service
ansible.builtin.service:
name: kubelet
state: restarted
- name: Pull kubeadm container images
ansible.builtin.command:
cmd: sudo kubeadm config images pull
- name: Initialize Kubernetes control plane
ansible.builtin.command:
cmd: kubeadm init --pod-network-cidr=10.244.0.0/16
creates: /tmp/kubeadm_output
register: kubeadm_init_output
become: true
changed_when: false
- name: Set permissions for Kubernetes Admin
file:
path: /etc/kubernetes/admin.conf
state: file
mode: '0755'
- name: Store Kubernetes initialization output to file
copy:
content: "{{ kubeadm_init_output.stdout }}"
dest: /tmp/kubeadm_output
become: true
delegate_to: localhost
- name: Generate the Join Command
ansible.builtin.shell: cat /tmp/kubeadm_output | tail -n 2 | sed ':a;N;$!ba;s/\\\n\s*/ /g' > /tmp/join-command
delegate_to: localhost
- name: Set permissions for the Join Executable
file:
path: /tmp/join-command
state: file
mode: '0755'
delegate_to: localhost
manually reboot the node and installed flannel via kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
. Worker’s are created in a similar way (without flannel). I omit their script for now but I can add it if it seems important.
I then had dns resolution issues with a helm chart which is why I tried to investigate network issues and noticed that instances are unable to ping each other.
I am unsure how to debug this issue further.
Debug Info
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-0 Ready control-plane 4h38m v1.30.14
k8s-worker-0 Ready <none> 4h35m v1.30.14
k8s-worker-1 Ready <none> 4h35m v1.30.14
k8s-worker-2 Ready <none> 4h35m v1.30.14
k8s-worker-3 Ready <none> 4h35m v1.30.14
k8s-worker-4 Ready <none> 4h35m v1.30.14
k8s-worker-5 Ready <none> 4h34m v1.30.14
kube-flannel-ds-275hx 1/1 Running 0 150m 192.168.33.149 k8s-worker-0 <none> <none>
kube-flannel-ds-2rplc 1/1 Running 0 150m 192.168.33.38 k8s-worker-5 <none> <none>
kube-flannel-ds-2w98x 1/1 Running 0 150m 192.168.33.113 k8s-worker-1 <none> <none>
kube-flannel-ds-g4vb6 1/1 Running 0 150m 192.168.33.167 k8s-worker-4 <none> <none>
kube-flannel-ds-mpwbz 1/1 Running 0 150m 192.168.33.163 k8s-worker-2 <none> <none>
kube-flannel-ds-qmbgc 1/1 Running 0 150m 192.168.33.117 k8s-master-0 <none> <none>
kube-flannel-ds-sgdgs 1/1 Running 0 150m 192.168.33.243 k8s-worker-3 <none> <none>
ip addr show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default
link/ether a2:4a:11:1f:84:ef brd ff:ff:ff:ff:ff:ff
inet 10.244.0.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::a04a:11ff:fe1f:84ef/64 scope link
valid_lft forever preferred_lft forever
ip route
default via 192.168.33.1 dev ens3 proto dhcp src 192.168.33.117 metric 100
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink
10.244.4.0/24 via 10.244.4.0 dev flannel.1 onlink
10.244.5.0/24 via 10.244.5.0 dev flannel.1 onlink
10.244.6.0/24 via 10.244.6.0 dev flannel.1 onlink
169.254.169.254 via 192.168.33.3 dev ens3 proto dhcp src 192.168.33.117 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.33.0/24 dev ens3 proto kernel scope link src 192.168.33.117 metric 100
192.168.33.1 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.2 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.3 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.4 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
kubectl run -it --rm dnsutils --image=busybox:1.28 --restart=Never -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
pod "dnsutils" deleted
pod default/dnsutils terminated (Error)
kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-55cb58b774-6vb7p 1/1 Running 1 (4h19m ago) 4h38m
coredns-55cb58b774-wtrz6 1/1 Running 1 (4h19m ago) 4h38m
Ping Test
ubuntu@k8s-master-0:~$ kubectl run pod1 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod1 created
ubuntu@k8s-master-0:~$ kubectl run pod2 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod2 created
ubuntu@k8s-master-0:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1 1/1 Running 0 15m 10.244.5.2 k8s-worker-1 <none> <none>
pod2 1/1 Running 0 15m 10.244.4.2 k8s-worker-3 <none> <none>
ubuntu@k8s-master-0:~$ kubectl exec -it pod1 -- sh
/ # ping 10.244.5.2
PING 10.244.5.2 (10.244.5.2): 56 data bytes
64 bytes from 10.244.5.2: seq=0 ttl=64 time=0.107 ms
64 bytes from 10.244.5.2: seq=1 ttl=64 time=0.091 ms
64 bytes from 10.244.5.2: seq=2 ttl=64 time=0.090 ms
^C
--- 10.244.5.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.090/0.096/0.107 ms
/ # 10.244.4.2
sh: 10.244.4.2: not found
/ # ping 10.244.4.2
PING 10.244.4.2 (10.244.4.2): 56 data bytes
^C
--- 10.244.4.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
/ # exit
command terminated with exit code 1
If I understand flannel correctly, it is fine that the pods are in other subnets as the ip routes manage the forwarding.
Is the firewall config blocking any ports flannel might use? As far as Google can tell vxlan uses port 8472.
That’s a good idea. While I have opened some ports that are commonly used by kubernetes, I didn’t think about flannel requiring additional open ports. I will look into this. It’s also mentioned in the troubleshooting guide, but I looked for debugging advice there and only scanned the other sections: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md#firewalls