MLOps platform on Rancher RKE2 Kubernetes Cluster — Bare Metal environment
Kubeflow installation documents cover the environment setup through packaged distribution or public cloud environments
November 14, 2022Kubeflow installation documents cover the environment setup through packaged distribution or public cloud environments. This blog covers the prerequisite environment setup and kubeflow 1.6.0 installation on Rancher RKE2 Kubernetes environment in a bare-metal server.
Overview:
MLOps Platform covers the deployment procedure of Kubeflow on Rancher RKE2 Kubernetes cluster deployed in a bare metal environment. #RKE2 #Kubeflow
Kubernetes deprecates support for Docker as a container runtime starting with Kubernetes version 1.20. So decided to use RKE2 as the Kubernetes cluster distro focusing on security and support for “Container runtime (Special mention to the Rancher community support)”.
Note: RKE2 Kubernetes v1.22.15+rke2r1 supported for latest Kubeflow release v1.6.0. RKE2 kubernetes latest release is v1.25, but not supported for Kubeflow v1.6.0.
Prerequisites:
- Install Ubuntu 20.04 in all 3 nodes(1server + 2agent).
- Following ports are open according to CNI selection and depend on the server or agent. https://docs.rke2.io/install/requirements/#networking and here we are going to set up the Kubernetes platform in an air gap environment behind the proxy.
Following are the steps we’ll go through:
- RKE2 Server setup
- RKE2 Agent setup
- Storage Class setup
- Kustomize setup
- Kubeflow setup
RKE2 Server setup:
Download RKE2 images & manifest source for RKE2 server setup by executing the following commands:
In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-server” file.
Create a RKE2 server config file.
Install RKE2 server using the following command:
Start rke2-server services and setup:
Kubeconfig is located in “/etc/rancher/rke2/rke2.yaml” and binary files are in “/var/lib/rancher/rke2/bin”.
Execute the below commands to set environment variables to use the kubectl command and interact with the RKE2 cluster:
RKE2 Agent setup:
Download RKE2 images & manifest source for RKE2 agent setup by executing the following commands:
Install RKE2 agent using the following command:
In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-agent” file.
Create RKE2 config file with the rke2-server token to join the cluster:
Start rke2-agent services:
Install helm with the following command in RKE2 server:
Storage class setup:
Execute the below command to set up “localpath” storage class setup.
Kustomize setup:
Kustomize 3.2.0 is supported for Kubeflow 1.6.0, don’t install the latest version.Execute the below command to setup kustomize 3.2.0 :
Kubeflow setup:
Most of the installation procedures covered in kubeflow installation documents are towards cloud providers. It’s good to use kustomize for bare metal and any Kubernetes distro.
Clone the kubeflow manifest files or download kubeflow manifest files from https://github.com/kubeflow/manifests/tree/v1.6-branch
checkout the “v1.6-branch” & Generate a password hash using the below command:
- update the generated hash in “common/dex/base/config-map.yaml”.
- add storage class in the following files, if we plan to use other than the default one.
- modify the size of minio-pvc based on expected artifacts size the needed and storage availability.
Execute the following command to install kubeflow:
After running this, sometimes Kubernetes bugs, out and the terminal outputs “Retrying to apply resources.” If this happens, it will automatically keep retrying until all the pods get the green light to spin up. Wait until all the pods have a RUNNING status before proceeding.
Check the status of the pods running by executing following commands:
Patch the ingress gateway to nodeport:
Note: If you have LoadBalancer in the environment instead of NodePort, set it as LoadBalancer.
Now we can access kubeflow using default credentials “[email protected]” and password as “12341234”
Troubleshooting:
As mentioned earlier, the RKE2 cluster built on top of the Containerd plane, and we can use the crictl command for troubleshooting if necessary. To use the crictl command perform the following setting in the node.
Feel free to reach me on LinkedIn if you have some questions.
Author’s Bio:Shanker JJ is a Senior InfraOps Engineer, part of the AI Engineering team at AI Inside Inc., Japan. He focuses on building production-ready ML Operations Infrastructure, ML services, tools, and data pipelines.