Kubernetes 调度策略
今天来看一下 Kubernetes
的调度策略
备注
每个版本的默认调度策略可能不太一样,具体可以查看相应版本的代码。
一、调度策略分类
pkg/scheduler/framework/plugins/legacy_registry.go
1 // Used as the default set of predicates if Policy was specified, but predicates was nil.
2 DefaultPredicates: sets.NewString(
3 NoVolumeZoneConflictPred,
4 MaxEBSVolumeCountPred,
5 MaxGCEPDVolumeCountPred,
6 MaxAzureDiskVolumeCountPred,
7 MaxCSIVolumeCountPred,
8 MatchInterPodAffinityPred,
9 NoDiskConflictPred,
10 GeneralPred,
11 PodToleratesNodeTaintsPred,
12 CheckVolumeBindingPred,
13 CheckNodeUnschedulablePred,
14 EvenPodsSpreadPred,
15 ),
16
17 // Used as the default set of predicates if Policy was specified, but priorities was nil.
18 DefaultPriorities: map[string]int64{
19 SelectorSpreadPriority: 1,
20 InterPodAffinityPriority: 1,
21 LeastRequestedPriority: 1,
22 BalancedResourceAllocation: 1,
23 NodePreferAvoidPodsPriority: 10000,
24 NodeAffinityPriority: 1,
25 TaintTolerationPriority: 1,
26 ImageLocalityPriority: 1,
27 EvenPodsSpreadPriority: 2,
28 },
通过查看源代码,可以看到默认启用的预选调度策略(Predicates),和优选调度策略(Priorities)
预选调度:筛选出可以被调度的节点。像是节点有故障或者资源不够Pod的申请量不满足等情况肯定是不能作为被调度的节点的。
优选调度:从满足调度要求的节点中再一次进行筛选。选出一个最适合被调度的节点。
二、预选调度策略
2.1 预选策略常量的定义
1const (
2 // MatchInterPodAffinityPred defines the name of predicate MatchInterPodAffinity.
3 MatchInterPodAffinityPred = "MatchInterPodAffinity"
4 // CheckVolumeBindingPred defines the name of predicate CheckVolumeBinding.
5 CheckVolumeBindingPred = "CheckVolumeBinding"
6 // GeneralPred defines the name of predicate GeneralPredicates.
7 GeneralPred = "GeneralPredicates"
8 // HostNamePred defines the name of predicate HostName.
9 HostNamePred = "HostName"
10 // PodFitsHostPortsPred defines the name of predicate PodFitsHostPorts.
11 PodFitsHostPortsPred = "PodFitsHostPorts"
12 // MatchNodeSelectorPred defines the name of predicate MatchNodeSelector.
13 MatchNodeSelectorPred = "MatchNodeSelector"
14 // PodFitsResourcesPred defines the name of predicate PodFitsResources.
15 PodFitsResourcesPred = "PodFitsResources"
16 // NoDiskConflictPred defines the name of predicate NoDiskConflict.
17 NoDiskConflictPred = "NoDiskConflict"
18 // PodToleratesNodeTaintsPred defines the name of predicate PodToleratesNodeTaints.
19 PodToleratesNodeTaintsPred = "PodToleratesNodeTaints"
20 // CheckNodeUnschedulablePred defines the name of predicate CheckNodeUnschedulablePredicate.
21 CheckNodeUnschedulablePred = "CheckNodeUnschedulable"
22 // CheckNodeLabelPresencePred defines the name of predicate CheckNodeLabelPresence.
23 CheckNodeLabelPresencePred = "CheckNodeLabelPresence"
24 // CheckServiceAffinityPred defines the name of predicate checkServiceAffinity.
25 CheckServiceAffinityPred = "CheckServiceAffinity"
26 // MaxEBSVolumeCountPred defines the name of predicate MaxEBSVolumeCount.
27 // DEPRECATED
28 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
29 MaxEBSVolumeCountPred = "MaxEBSVolumeCount"
30 // MaxGCEPDVolumeCountPred defines the name of predicate MaxGCEPDVolumeCount.
31 // DEPRECATED
32 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
33 MaxGCEPDVolumeCountPred = "MaxGCEPDVolumeCount"
34 // MaxAzureDiskVolumeCountPred defines the name of predicate MaxAzureDiskVolumeCount.
35 // DEPRECATED
36 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
37 MaxAzureDiskVolumeCountPred = "MaxAzureDiskVolumeCount"
38 // MaxCinderVolumeCountPred defines the name of predicate MaxCinderDiskVolumeCount.
39 // DEPRECATED
40 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
41 MaxCinderVolumeCountPred = "MaxCinderVolumeCount"
42 // MaxCSIVolumeCountPred defines the predicate that decides how many CSI volumes should be attached.
43 MaxCSIVolumeCountPred = "MaxCSIVolumeCountPred"
44 // NoVolumeZoneConflictPred defines the name of predicate NoVolumeZoneConflict.
45 NoVolumeZoneConflictPred = "NoVolumeZoneConflict"
46 // EvenPodsSpreadPred defines the name of predicate EvenPodsSpread.
47 EvenPodsSpreadPred = "EvenPodsSpread"
48)
2.2 名词解释
- NoVolumeZoneConflictPred: 检查给定的
Zone
限制前提下,如果在此主机中部署Pod
是否存在卷冲突。 - MaxEBSVolumeCountPred: (过时)确保已挂载的
EBS
存储卷不超过设置的最大值。 - MaxGCEPDVolumeCountPred: (过时)确保已挂载的
GCE
存储卷不超过设置的最大值。 - MaxAzureDiskVolumeCountPred: (过时)确保已挂载的
Azure
存储卷不超过设置的最大值。 - MaxCSIVolumeCountPred: 检测
Node
的Volume
数量是否超过最大值 - MatchInterPodAffinityPred: 检查
Pod
和其他Pod
是否符合亲和性规则。 - NoDiskConflictPred: 检测挂载的卷和已经存在的卷是否有冲突。
- GeneralPred: 检测资源是否充足,包含
HostName
检查、PodFitsHostPorts
主机端口是否被占用、MatchNodeSelector
节点、PodFitsResources
Pod
依赖的资源配额是否满足。 - PodToleratesNodeTaintsPred: 确保
Pod
定义的tolerates
能接纳Node
定义的taints
。 - CheckVolumeBindingPred: 检查该
Node
的PV
是否满足PVC
。 - CheckNodeUnschedulablePred:
Node
是否可调度。 - EvenPodsSpreadPred:
Node
是否满足拓扑传播限制。
三、优选调度策略
3.1 优选调度策略定义
1const (
2 // EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
3 EqualPriority = "EqualPriority"
4 // MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
5 MostRequestedPriority = "MostRequestedPriority"
6 // RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
7 RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
8 // SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
9 // the number of pods (belonging to the same service or replication controller) on the same node.
10 SelectorSpreadPriority = "SelectorSpreadPriority"
11 // ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
12 ServiceSpreadingPriority = "ServiceSpreadingPriority"
13 // InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
14 // should not be placed in the same topological domain as some other pods.
15 InterPodAffinityPriority = "InterPodAffinityPriority"
16 // LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
17 // requested utilization.
18 LeastRequestedPriority = "LeastRequestedPriority"
19 // BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
20 // to help achieve balanced resource usage.
21 BalancedResourceAllocation = "BalancedResourceAllocation"
22 // NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
23 // the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
24 NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
25 // NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
26 // matching NodeAffinity.
27 NodeAffinityPriority = "NodeAffinityPriority"
28 // TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
29 // with taint which pod can tolerate.
30 TaintTolerationPriority = "TaintTolerationPriority"
31 // ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
32 // requested by the pod present.
33 ImageLocalityPriority = "ImageLocalityPriority"
34 // EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
35 // which have pods and labels matching the incoming pod's topologySpreadConstraints.
36 EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
37)
3.2 名词解释
- SelectorSpreadPriority: 按
Service
和Replicaset
归属计算Node
上分布最少的同类Pod
数量,得分计算:数量越少得分越高 - InterPodAffinityPriority:
Pod
亲和性选择策略,类似NodeAffinityPriority
,提供两种选择器支持: - LeastRequestedPriority: 计算
Pods
需要的CPU
和内存在当前节点可用资源的百分比,具有最小百分比的节点就是最优,得分计算公式:cpu((capacity – sum(requested)) * 10 / capacity) + memory((capacity – sum(requested)) * 10 / capacity) / 2
- BalancedResourceAllocation: 节点上各项资源(CPU、内存)使用率最均衡的为最优,得分计算公式:
10 – abs(totalCpu/cpuNodeCapacity-totalMemory/memoryNodeCapacity)*10
- NodePreferAvoidPodsPriority: 根据
Node
的annotation: scheduler.alpha.kubernetes.io/preferAvoidPods
进行调度 - NodeAffinityPriority: 节点亲和性选择策略,提供两种选择器支持:
requiredDuringSchedulingIgnoredDuringExecution
(保证所选的主机必须满足所有Pod对主机的规则要求)、preferresDuringSchedulingIgnoredDuringExecution
(调度器会尽量但不保证满足NodeSelector
的所有要求) - TaintTolerationPriority: 类似于
Predicates
策略中的PodToleratesNodeTaints
,优先调度到标记了Taint
的节点。 - ImageLocalityPriority: 根据主机上是否已具备
Pod
运行的环境来打分,得分计算:不存在所需镜像,返回0分,存在镜像,镜像越大得分越高。 - EvenPodsSpreadPriority: 满足拓扑传递限制的
Pod
的个数计算得分
四、修改默认策略
pkg/scheduler/framework/plugins/registry.go
1func NewInTreeRegistry() runtime.Registry {
2 fts := plfeature.Features{
3 EnablePodAffinityNamespaceSelector: feature.DefaultFeatureGate.Enabled(features.PodAffinityNamespaceSelector),
4 EnablePodDisruptionBudget: feature.DefaultFeatureGate.Enabled(features.PodDisruptionBudget),
5 EnablePodOverhead: feature.DefaultFeatureGate.Enabled(features.PodOverhead),
6 EnableReadWriteOncePod: feature.DefaultFeatureGate.Enabled(features.ReadWriteOncePod),
7 EnableVolumeCapacityPriority: feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority),
8 EnableCSIStorageCapacity: feature.DefaultFeatureGate.Enabled(features.CSIStorageCapacity),
9 EnableGenericEphemeralVolume: feature.DefaultFeatureGate.Enabled(features.GenericEphemeralVolume),
10 }
11...
通过查看源码,看到可以通过 Featrues
来调整调度策略。
修改 kube-scheduler
的启动参数,加入--feature-gates
进行调整。
1APIListChunking=true|false (BETA - default=true)
2APIPriorityAndFairness=true|false (BETA - default=true)
3APIResponseCompression=true|false (BETA - default=true)
4APIServerIdentity=true|false (ALPHA - default=false)
5APIServerTracing=true|false (ALPHA - default=false)
6AllAlpha=true|false (ALPHA - default=false)
7AllBeta=true|false (BETA - default=false)
8AnyVolumeDataSource=true|false (ALPHA - default=false)
9AppArmor=true|false (BETA - default=true)
10CPUManager=true|false (BETA - default=true)
11CPUManagerPolicyOptions=true|false (ALPHA - default=false)
12CSIInlineVolume=true|false (BETA - default=true)
13CSIMigration=true|false (BETA - default=true)
14CSIMigrationAWS=true|false (BETA - default=false)
15CSIMigrationAzureDisk=true|false (BETA - default=false)
16CSIMigrationAzureFile=true|false (BETA - default=false)
17CSIMigrationGCE=true|false (BETA - default=false)
18CSIMigrationOpenStack=true|false (BETA - default=true)
19CSIMigrationvSphere=true|false (BETA - default=false)
20CSIStorageCapacity=true|false (BETA - default=true)
21CSIVolumeFSGroupPolicy=true|false (BETA - default=true)
22CSIVolumeHealth=true|false (ALPHA - default=false)
23CSRDuration=true|false (BETA - default=true)
24ConfigurableFSGroupPolicy=true|false (BETA - default=true)
25ControllerManagerLeaderMigration=true|false (BETA - default=true)
26CustomCPUCFSQuotaPeriod=true|false (ALPHA - default=false)
27DaemonSetUpdateSurge=true|false (BETA - default=true)
28DefaultPodTopologySpread=true|false (BETA - default=true)
29DelegateFSGroupToCSIDriver=true|false (ALPHA - default=false)
30DevicePlugins=true|false (BETA - default=true)
31DisableAcceleratorUsageMetrics=true|false (BETA - default=true)
32DisableCloudProviders=true|false (ALPHA - default=false)
33DownwardAPIHugePages=true|false (BETA - default=false)
34EfficientWatchResumption=true|false (BETA - default=true)
35EndpointSliceTerminatingCondition=true|false (BETA - default=true)
36EphemeralContainers=true|false (ALPHA - default=false)
37ExpandCSIVolumes=true|false (BETA - default=true)
38ExpandInUsePersistentVolumes=true|false (BETA - default=true)
39ExpandPersistentVolumes=true|false (BETA - default=true)
40ExpandedDNSConfig=true|false (ALPHA - default=false)
41ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)
42GenericEphemeralVolume=true|false (BETA - default=true)
43GracefulNodeShutdown=true|false (BETA - default=true)
44HPAContainerMetrics=true|false (ALPHA - default=false)
45HPAScaleToZero=true|false (ALPHA - default=false)
46IPv6DualStack=true|false (BETA - default=true)
47InTreePluginAWSUnregister=true|false (ALPHA - default=false)
48InTreePluginAzureDiskUnregister=true|false (ALPHA - default=false)
49InTreePluginAzureFileUnregister=true|false (ALPHA - default=false)
50InTreePluginGCEUnregister=true|false (ALPHA - default=false)
51InTreePluginOpenStackUnregister=true|false (ALPHA - default=false)
52InTreePluginvSphereUnregister=true|false (ALPHA - default=false)
53IndexedJob=true|false (BETA - default=true)
54IngressClassNamespacedParams=true|false (BETA - default=true)
55JobTrackingWithFinalizers=true|false (ALPHA - default=false)
56KubeletCredentialProviders=true|false (ALPHA - default=false)
57KubeletInUserNamespace=true|false (ALPHA - default=false)
58KubeletPodResources=true|false (BETA - default=true)
59KubeletPodResourcesGetAllocatable=true|false (ALPHA - default=false)
60LocalStorageCapacityIsolation=true|false (BETA - default=true)
61LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false)
62LogarithmicScaleDown=true|false (BETA - default=true)
63MemoryManager=true|false (BETA - default=true)
64MemoryQoS=true|false (ALPHA - default=false)
65MixedProtocolLBService=true|false (ALPHA - default=false)
66NetworkPolicyEndPort=true|false (BETA - default=true)
67NodeSwap=true|false (ALPHA - default=false)
68NonPreemptingPriority=true|false (BETA - default=true)
69PodAffinityNamespaceSelector=true|false (BETA - default=true)
70PodDeletionCost=true|false (BETA - default=true)
71PodOverhead=true|false (BETA - default=true)
72PodSecurity=true|false (ALPHA - default=false)
73PreferNominatedNode=true|false (BETA - default=true)
74ProbeTerminationGracePeriod=true|false (BETA - default=false)
75ProcMountType=true|false (ALPHA - default=false)
76ProxyTerminatingEndpoints=true|false (ALPHA - default=false)
77QOSReserved=true|false (ALPHA - default=false)
78ReadWriteOncePod=true|false (ALPHA - default=false)
79RemainingItemCount=true|false (BETA - default=true)
80RemoveSelfLink=true|false (BETA - default=true)
81RotateKubeletServerCertificate=true|false (BETA - default=true)
82SeccompDefault=true|false (ALPHA - default=false)
83ServiceInternalTrafficPolicy=true|false (BETA - default=true)
84ServiceLBNodePortControl=true|false (BETA - default=true)
85ServiceLoadBalancerClass=true|false (BETA - default=true)
86SizeMemoryBackedVolumes=true|false (BETA - default=true)
87StatefulSetMinReadySeconds=true|false (ALPHA - default=false)
88StorageVersionAPI=true|false (ALPHA - default=false)
89StorageVersionHash=true|false (BETA - default=true)
90SuspendJob=true|false (BETA - default=true)
91TTLAfterFinished=true|false (BETA - default=true)
92TopologyAwareHints=true|false (ALPHA - default=false)
93TopologyManager=true|false (BETA - default=true)
94VolumeCapacityPriority=true|false (ALPHA - default=false)
95WinDSR=true|false (ALPHA - default=false)
96WinOverlay=true|false (BETA - default=true)
97WindowsHostProcessContainers=true|false (ALPHA - default=false)
更多 feature-gates
参数请查看kube-scheduler
- 原文作者:黄忠德
- 原文链接:https://huangzhongde.cn/post/Kubernetes/kubernetes_scheduler/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。