MLOps platform on Rancher RKE2 Kubernetes Cluster — Bare Metal environment

Kubeflow installation documents cover the environment setup through packaged distribution or public cloud environments
November 14, 2022
Kubeflow installation documents cover the environment setup through packaged distribution or public cloud environments. This blog covers the prerequisite environment setup and kubeflow 1.6.0 installation on Rancher RKE2 Kubernetes environment in a bare-metal server.
Overview:
MLOps Platform covers the deployment procedure of Kubeflow on Rancher RKE2 Kubernetes cluster deployed in a bare metal environment. #RKE2 #Kubeflow
Kubernetes deprecates support for Docker as a container runtime starting with Kubernetes version 1.20. So decided to use RKE2 as the Kubernetes cluster distro focusing on security and support for “Container runtime (Special mention to the Rancher community support)”.
Note: RKE2 Kubernetes v1.22.15+rke2r1 supported for latest Kubeflow release v1.6.0. RKE2 kubernetes latest release is v1.25, but not supported for Kubeflow v1.6.0.

Prerequisites:
- Install Ubuntu 20.04 in all 3 nodes(1server + 2agent).
- Following ports are open according to CNI selection and depend on the server or agent. https://docs.rke2.io/install/requirements/#networking and here we are going to set up the Kubernetes platform in an air gap environment behind the proxy.
Following are the steps we’ll go through:
- RKE2 Server setup
- RKE2 Agent setup
- Storage Class setup
- Kustomize setup
- Kubeflow setup
RKE2 Server setup:
Download RKE2 images & manifest source for RKE2 server setup by executing the following commands:
mkdir /home/user/rke2-artifacts && cd /home/user/rke2-artifactswget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/rke2-images.linux-amd64.tar.zstwget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/rke2.linux-amd64.tar.gzwget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/sha256sum-amd64.txtcurl -sfL https://get.rke2.io --output install.sh
In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-server” file.
>> vim /etc/default/rke2-serverHTTP_PROXY="http://<proxy server ip>:<proxy port>"HTTPS_PROXY="http://<proxy server ip>:<proxy port>"NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"CONTAINERD_HTTP_PROXY="http://<proxy server ip>:<proxy port>"CONTAINERD_HTTPS_PROXY="http://<proxy server ip>:<proxy port>"CONTAINERD_NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"
Create a RKE2 server config file.
>> mkdir -p /etc/rancher/rke2>> vim /etc/rancher/rke2/config.yamltls-san: - Add all server ips - Add all expected TLS san url - svc - cluster.localcni: - <cni of your choice or leave blank for default canal>advertise-address: <server ip/lb incase of HA>write-kubeconfig-mode: 644node-label: - "type=gpu-node"
Install RKE2 server using the following command:
INSTALL_RKE2_VERSION=v1.22.15+rke2r1INSTALL_RKE2_ARTIFACT_PATH=/home/user/rke2-artifacts sh install.sh`
Start rke2-server services and setup:
systemctl enable rke2-server.servicesystemctl start rke2-server.service
Kubeconfig is located in “/etc/rancher/rke2/rke2.yaml” and binary files are in “/var/lib/rancher/rke2/bin”.
Execute the below commands to set environment variables to use the kubectl command and interact with the RKE2 cluster:
export PATH=/var/lib/rancher/rke2/bin:$PATHexport KUBECONFIG=/etc/rancher/rke2/rke2.yaml
RKE2 Agent setup:
Download RKE2 images & manifest source for RKE2 agent setup by executing the following commands:
mkdir /home/user/rke2-artifacts && cd /home/user/rke2-artifactswget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2-images.linux-amd64.tar.zstwget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2.linux-amd64.tar.gzwget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/sha256sum-amd64.txtcurl -sfL https://get.rke2.io — output install.shInstall rke2-agent using the following command.
Install RKE2 agent using the following command:
export CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sockexport CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sockexport INSTALL_RKE2_TYPE="agent"INSTALL_RKE2_VERSION=v1.22.15+rke2r1INSTALL_RKE2_ARTIFACT_PATH=/home/user/rke2-artifacts sh install.sh
In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-agent” file.
>> vim /etc/default/rke2-agentHTTP_PROXY="http://<proxy server ip>:<proxy port>"HTTPS_PROXY="http://<proxy server ip>:<proxy port>"NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"CONTAINERD_HTTP_PROXY="http://<proxy server ip>:<proxy port>"CONTAINERD_HTTPS_PROXY="http://<proxy server ip>:<proxy port>" CONTAINERD_NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"
Create RKE2 config file with the rke2-server token to join the cluster:
>> mkdir -p /etc/rancher/rke2/>> vim /etc/rancher/rke2/config.yamltoken: <copy token from rke2-server node /var/lib/rancher/rke2/token>server: https://<rke2-server ip / lb incase of HA>:9345node-label: - "type=gpu-node"
Start rke2-agent services:
systemctl enable rke2-agent.servicesystemctl start rke2-agent.service
Install helm with the following command in RKE2 server:
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/nullsudo apt-get install apt-transport-https --yesecho "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.listsudo apt-get updatesudo apt-get install helm
Storage class setup:
Execute the below command to set up “localpath” storage class setup.
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml
Kustomize setup:
Kustomize 3.2.0 is supported for Kubeflow 1.6.0, don’t install the latest version.Execute the below command to setup kustomize 3.2.0 :
curl -Lo kustomize https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64 && chmod +x kustomize && sudo mv kustomize /usr/local/bin/
Kubeflow setup:
Most of the installation procedures covered in kubeflow installation documents are towards cloud providers. It’s good to use kustomize for bare metal and any Kubernetes distro.
Clone the kubeflow manifest files or download kubeflow manifest files from https://github.com/kubeflow/manifests/tree/v1.6-branch
git clone [email protected]:kubeflow/manifests.git
checkout the “v1.6-branch” & Generate a password hash using the below command:
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
- update the generated hash in “common/dex/base/config-map.yaml”.
- add storage class in the following files, if we plan to use other than the default one.
common/oidc-authservice/base/pvc.yamlapps/katib/upstream/components/mysql/pvc.yamlapps/pipeline/upstream/third-party/minio/base/minio-pvc.yamlapps/pipeline/upstream/third-party/mysql/base/mysql-pv-claim.yaml
- modify the size of minio-pvc based on expected artifacts size the needed and storage availability.
Execute the following command to install kubeflow:
while ! kustomize build example | sudo kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
After running this, sometimes Kubernetes bugs, out and the terminal outputs “Retrying to apply resources.” If this happens, it will automatically keep retrying until all the pods get the green light to spin up. Wait until all the pods have a RUNNING status before proceeding.
Check the status of the pods running by executing following commands:
kubectl get pods -n cert-managerkubectl get pods -n istio-systemkubectl get pods -n authkubectl get pods -n knative-eventingkubectl get pods -n knative-servingkubectl get pods -n kubeflowkubectl get pods -n kubeflow-user-example-com
Patch the ingress gateway to nodeport:
kubectl patch svc istio-ingressgateway -n istio-system -p '{"spec": {"type": "NodePort"}}'
Note: If you have LoadBalancer in the environment instead of NodePort, set it as LoadBalancer.
Now we can access kubeflow using default credentials “[email protected]” and password as “12341234”
Troubleshooting:
As mentioned earlier, the RKE2 cluster built on top of the Containerd plane, and we can use the crictl command for troubleshooting if necessary. To use the crictl command perform the following setting in the node.
vim /etc/crictl.yamlruntime-endpoint: unix:///run/k3s/containerd/containerd.sockimage-endpoint: unix:///run/k3s/containerd/containerd.socktimeout: 10
Author’s Bio:Shanker JJ is a Senior InfraOps Engineer, part of the AI Engineering team at AI Inside Inc., Japan. He focuses on building production-ready ML Operations Infrastructure, ML services, tools, and data pipelines.
Dive in
Related
52:42
video
On Structuring an ML Platform 1 Pizza Team, On Structuring an ML Platform 1 Pizza Team
Jan 9th, 2022 • Views 446
58:58
video
Declarative MLOps - Streamlining Model Serving on Kubernetes
By Joselito Balleta • Apr 18th, 2023 • Views 759
50:12
video
Kubernetes, AI Gateways, and the Future of MLOps
By Joselito Balleta • Mar 7th, 2025 • Views 340
58:58
video
Declarative MLOps - Streamlining Model Serving on Kubernetes
By Joselito Balleta • Apr 18th, 2023 • Views 759
50:12
video
Kubernetes, AI Gateways, and the Future of MLOps
By Joselito Balleta • Mar 7th, 2025 • Views 340
52:42
video
On Structuring an ML Platform 1 Pizza Team, On Structuring an ML Platform 1 Pizza Team
Jan 9th, 2022 • Views 446