We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here’s what we found out
Late last year, we did a number of tests that looked at deploying close to 1000 #OpenStack nodes on a pre-installed #Kubernetes cluster as a way of finding out what problems you might run into, and fixing them, if at all possible. In all we found several, and though in general, we were able to fix them, we thought it would still be good to go over the types of things you need to look for. Overall we deployed an OpenStack cluster that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline. As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate. Here’s what we found. The setup We started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node. Each bare metal node had the following specifications: HP ProLiant DL380 Gen9 CPU – 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz RAM – 264G Storage – 3.0T on RAID on HP Smart Array P840 Controller, HDD – 12 x HP EH0600JDYTL Network – 2x Intel Corporation Ethernet 10G 2P X710 The running OpenStack cluster (as far as Kubernetes is concerned) consists of: OpenStack control plane services running on close to 150 pods over 6 nodes Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node One major Prometheus problem During the experiments we used Prometheus monitoring tool to verify resource consumption and the load put on the core system, Kubernetes, and OpenStack services. One note of caution when using Prometheus: Deleting old data from Prometheus storage will indeed improve the Prometheus API speed — but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly! Thankfully, we had in fact done that documentation, but one thing we’ve decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs. Problems we experienced in our testing Huge load on kube-apiserver Symptoms Initially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn’t function at all so they were moved to bare metal. Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes. Root cause All services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this it makes kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow. Solution Set the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before te requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms). Upstream issue status: fixed Make the Kargo deployment tool set proxy_timeout to 10 minutes: issue fixed with pull request by Fuel CCP team.