One of the newer features in Kubernetes (1.30 and later) is the Kubelet Checkpoint API. This new API allows users to create a stateful copy of a running container, a functionality which is often used for forensics or for debugging.
In Kubernetes installations where this feature is enabled, a checkpoint can be created by accessing the respective Kubelet API via curl
or similar. In the following example I am also using the Kubernetes API /proxy
endpoint (the same can also be done on the Node locally via localhost:10250/checkpoint/...
):
$ curl -k -X POST --header "Authorization: Bearer $TOKEN" "$KUBERNETES_API_URL/api/v1/nodes/$NODE_NAME/proxy/checkpoint/$NAMESPACE_NAME/$POD_NAME/$CONTAINER_NAME"
{"items":["/var/lib/kubelet/checkpoints/checkpoint-fedora-74d79dd7f4-csrmg_skrenger-container-2024-12-12T12:56:19Z.tar"]}
Read the rest of this entry
Well, so I tried installing a new ARM-based OpenShift Container Platform cluster on AWS. To prepare, I created an install-config.yaml
file and changed the controlPlane.architecture
and the compute.architecture
field to “arm64
” and then launched the installer. That did not work, it still complains about the architecture:
$ ./openshift-install create cluster --dir=.
INFO Credentials loaded from the "default" profile in file "/home/simon/.aws/credentials"
INFO Consuming Install Config from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 11:07AM) for the Kubernetes API at https://api.skrenger-arm.lab.example.com:6443...
INFO Pulling VM console logs
INFO Pulling debug logs from the bootstrap machine
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.skrenger-arm.lab.example.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 3.64.25.143:6443: connect: connection refused
ERROR Bootstrap failed to complete: Get "https://api.skrenger-arm.lab.example.com:6443/version": dial tcp 3.68.144.150:6443: connect: connection refused
ERROR Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane.
ERROR The bootstrap machine failed to download the release image
INFO Pulling quay.io/openshift-release-dev/ocp-release@sha256:9ffb17b909a4fdef5324ba45ec6dd282985dd49d25b933ea401873183ef20bf8...
INFO cfce1ab124f59e93a0f67d7e85283d524ddfd73a27d0535319d69d1dce746488
INFO ERROR: release image arch amd64 does not match host arch arm64
INFO Bootstrap gather logs captured here "/home/simon/Downloads/arm/log-bundle-20221124110737.tar.gz"
Read the rest of this entry
In OpenShift Container Platform (OCP) 4, most of the functionality is controlled by Operators. To see the currently installed Operators and also their status, use the following command:
$ oc get clusteroperators
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.4 True False False 12m
cloud-credential 4.6.4 True False False 38m
cluster-autoscaler 4.6.4 True False False 32m
config-operator 4.6.4 True False False 33m
console 4.6.4 True False False 21m
csi-snapshot-controller 4.6.4 True False False 27m
dns 4.6.4 True False False 31m
etcd 4.6.4 True False False 32m
image-registry 4.6.4 True False False 25m
ingress 4.6.4 True False False 24m
insights 4.6.4 True False False 33m
kube-apiserver 4.6.4 True False False 30m
kube-controller-manager 4.6.4 True False False 31m
kube-scheduler 4.6.4 True False False 31m
kube-storage-version-migrator 4.6.4 True False False 24m
machine-api 4.6.4 True False False 27m
machine-approver 4.6.4 True False False 32m
machine-config 4.6.4 True False False 32m
marketplace 4.6.4 True False False 32m
monitoring 4.6.4 True False False 23m
network 4.6.4 True False False 33m
node-tuning 4.6.4 True False False 33m
openshift-apiserver 4.6.4 True False False 27m
openshift-controller-manager 4.6.4 True False False 24m
openshift-samples 4.6.4 True False False 26m
operator-lifecycle-manager 4.6.4 True False False 32m
operator-lifecycle-manager-catalog 4.6.4 True False False 32m
operator-lifecycle-manager-packageserver 4.6.4 True False False 27m
service-ca 4.6.4 True False False 33m
storage 4.6.4 True False False 32m
You can find the description of the default Operators in the documentation.
This will only list the Red Hat Operators that are installed as part of the cluster. These are all controlled by the ClusterVersionOperator
, which is the “Master-Operator” of the cluster controlling all others.
If you want to list all Operators that were installed via the Operator Lifecycle Manager (OLM), you can use the following command:
$ oc get subscriptions --all-namespaces
Getting training and exams done in 2020 has been challenging. After reaching my RHCE mid-February, I am now proud to say that I achieved my Red Hat Certified Architect in Infrastructure certification less than 9 months later.
To reach my RHCA, I took the following Red Hat exams. As you can see, it is OpenShift and Ansible all the way down:
- EX180 Red Hat Certified Specialist in Containers and Kubernetes
- EX280 Red Hat Certified Specialist in OpenShift Administration
- EX288 Red Hat Certified Specialist in OpenShift Application Development
- EX407 Red Hat Certified Specialist in Ansible Automation
- EX447 Red Hat Certified Specialist in Ansible Best Practices
Of course, the journey does not end here as there are quite a few interesting topics still to learn!
Tags:
Ansible,
Certification,
EX180,
EX280,
EX288,
EX407,
EX447,
exams,
OpenShift,
Red Hat Certified Architect,
Red Hat Certified Architect in Infrastructure,
RHCA
With OpenShift 4, Red Hat introduced Red Hat Enterprise Linux CoreOS. It is a very minimalist operating system, focused on running container workload.
This new minimalism comes with some challenges. There are no more RPM packages and most of the tools we know and love are missing! Luckily, there is the Red Hat supplied toolbox
container that contains all the necessary tools and is nicely integrated.
So to start the toolbox, use oc debug node/<nodename>
. This will start a privileged container on the node you specify, mount the host file system on /host
and drop you into a shell:
$ oc debug node/worker-0.lab.openshift.krenger.ch
Starting pod/worker-0labopenshiftkrengerch-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# toolbox
Container started successfully. To exit, type 'exit'.
sh-4.2#
Now we are running in the toolbox
container on our CoreOS host with all the tools we know at our disposal, for example sosreport
:
sh-4.2# sosreport
Running sosreport
will generate a sosreport in /host/var/tmp/
, which means it will be accessible in /var/tmp/
on the CoreOS host itself.
For OpenShift 4, the upgrade paths are kept in the cincinnati-graph-data repository as YAML files and then exposed via an API.
There is a Red Hat Solution describing how this data can be queried via api.openshift.com and how you can use this data in your automation:
$ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=fast-4.2&arch=amd64' | jq .
While this data is quite helpful for automation (the Solution also describes helpful queries), it is not very nice to look at the raw data. If you are looking for a graphical presentation of that data, you should check out this wonderful website that is maintained by a Red Hat colleague with hourly generated data: www.ocp-upgrade.net
So here is another one from the trenches.
More than once one of our OpenShift Container Platform customers approached us and said something along the lines of: “Help, I cannot see the X-Forwarded-For header in my application, our OpenShift Router is probably configured incorrectly!”.
In such cases, it is often a good idea to check what is really being forwarded to the Pods in the cluster. For this, I typically use my simonkrenger/echoenv container to print the headers received by the application. In many cases, it turns out that the application affected is a Spring Boot application and the header is passed correctly to the Pod itself. But the Spring Boot application does not show the header anyway.
We have observed a behaviour of Spring Boot that leads to the X-Forwarded-For
header not being passed to the application, as it is consumed by Spring Boot. In the application.properties
of a Spring Boot application, the following setting controls this:
server.use-forward-headers: true
This configuration leads to the header being consumed by Spring Boot and the header not being available in the application. See also the relevant sections in Spring documentation. Good to know.
Kubernetes uses etcd as the persistent store for API data. As etcd is a distributed key-value store, we can also use command line tools to query this store. The examples in this post are for OpenShift 3.x.
Apart from just using get
, there is also the possibility to perform the following actions on certain keys:
put
to write to a key – unless you know what you are doing, don’t touch the Kubernetes data in etcd, as this will manifest in very strange Kubernetes behaviour.del
to delete a key – also, this may break your Kubernetes cluster by introducing inconsistencies.watch
to keep a watch on an object. This is very helpful to track changes on a certain object.
The get
action is probably the most helpful functionality for in-depth API debugging directly within etcd.
Read the rest of this entry
For editing YAML, be it for OpenShift / Kubernetes or Ansible, having your editor set up right can help to avoid common mistakes. So here is the minimalistic config in my ~/.vimrc
to make working with YAML files easier. I am sure there are even more plugins or settings available, but this minimal set of commands works fine for me:
set ts=2
set sts=2
set sw=2
set expandtab
syntax on
filetype indent plugin on
set ruler
Read the rest of this entry
Some time ago, I had a curious case of very slow DNS resolution in a container on OpenShift. The symptoms were as follows:
- In the PHP application in the container, DNS resolution was very slow with a 5 second delay before the lookup was resolved
- In the container itself, DNS resolution for
curl
was very slow, with a 5 second timeout before the lookup was resolved - However, using
dig
in the container itself, DNS resolution was instant - Also, on the worker node, the DNS resolution was instant (using both
dig
and curl
)
TL;DR: Since glibc 2.10, glibc performs IPv4 and IPv6 lookups in parallel. When IPv6 fails, there is a 5 second timeout in many cases before the lookup is returned. Disable IPv6 DNS lookups by setting “single-request” in “resolv.conf” or disable the IPv6 stack completely.
Read the rest of this entry