When the DFC soft cap is exceeded, it can lead to cluster performance issues or even failure. In this article, we will explore some of the known causes of the DFC soft cap being exceeded, including legacy Read Connector or Write Connector tasks, custom connectors, other variable workloads, and Kubernetes-related issues.
Legacy Connector Tasks
Legacy Read Connector or Write Connector tasks can cause the DFC soft cap to be exceeded because compute workers scale with the amount of tasks performed in parallel, and they are not accounted for in DFC calculations. Additionally, legacy custom read connectors and custom parsers can be particularly problematic as they require a compute worker and a separate pod.
Thriftserver is a component of Ascend that provides JDBC/ODBC interfaces for data analysts to query data in the Ascend environment using third-party tools. In some environments, thriftserver clusters may scale up based on usage. This means that the number of instances of thriftserver will increase when more queries are made.
Query Service Clusters are another component of Ascend that provides query services to clients. In some environments, Query Service Clusters may scale up based on usage. This means that the number of instances of Query Service Clusters will increase when more queries are made.
The High Performance Autoscaling (HPA) Services in Ascend include API, Frontend, and File-based Access (public-datafeed-gateway). If these services have high usage, they will scale using compute nodes. This means that the number of compute nodes dedicated to these services will increase to handle the workload.
Scale-down failure occurs when some persistent pods are not marked as safe to remove. This can prevent Kubernetes from scaling down until that pod is removed. This issue can result in unnecessary resource usage and increased costs. It's essential to regularly monitor and identify persistent pods that need to be removed to avoid this issue.
Packing failure is another issue that can occur when Spark pods do not fit cleanly on a node. In this case, Kubernetes is unable to allocate the full amount of CPU and memory requested, and it can only fit fewer pods on the node. For instance, if you request 4 CPUs, you may only fit 3 pods on a "16" CPU node. This can lead to underutilization of resources and decreased performance.
Zombie pods are another issue that can arise when extra pods remain alive even when they should have been garbage collected. For example, custom parser function pods can keep nodes running, possibly exceeding the cap. It's essential to monitor and identify any zombie pods to avoid resource waste and increased costs. Regular garbage collection is also necessary to prevent this issue.