今天来看一下 Kubernetes 的调度策略

备注

每个版本的默认调度策略可能不太一样,具体可以查看相应版本的代码。

一、调度策略分类

pkg/scheduler/framework/plugins/legacy_registry.go

 1  // Used as the default set of predicates if Policy was specified, but predicates was nil.
 2  DefaultPredicates: sets.NewString(
 3   NoVolumeZoneConflictPred,
 4   MaxEBSVolumeCountPred,
 5   MaxGCEPDVolumeCountPred,
 6   MaxAzureDiskVolumeCountPred,
 7   MaxCSIVolumeCountPred,
 8   MatchInterPodAffinityPred,
 9   NoDiskConflictPred,
10   GeneralPred,
11   PodToleratesNodeTaintsPred,
12   CheckVolumeBindingPred,
13   CheckNodeUnschedulablePred,
14   EvenPodsSpreadPred,
15  ),
16
17  // Used as the default set of predicates if Policy was specified, but priorities was nil.
18  DefaultPriorities: map[string]int64{
19   SelectorSpreadPriority:      1,
20   InterPodAffinityPriority:    1,
21   LeastRequestedPriority:      1,
22   BalancedResourceAllocation:  1,
23   NodePreferAvoidPodsPriority: 10000,
24   NodeAffinityPriority:        1,
25   TaintTolerationPriority:     1,
26   ImageLocalityPriority:       1,
27   EvenPodsSpreadPriority:      2,
28  },

通过查看源代码,可以看到默认启用的预选调度策略(Predicates),和优选调度策略(Priorities)

预选调度:筛选出可以被调度的节点。像是节点有故障或者资源不够Pod的申请量不满足等情况肯定是不能作为被调度的节点的。

优选调度:从满足调度要求的节点中再一次进行筛选。选出一个最适合被调度的节点。

二、预选调度策略

2.1 预选策略常量的定义

 1const (
 2 // MatchInterPodAffinityPred defines the name of predicate MatchInterPodAffinity.
 3 MatchInterPodAffinityPred = "MatchInterPodAffinity"
 4 // CheckVolumeBindingPred defines the name of predicate CheckVolumeBinding.
 5 CheckVolumeBindingPred = "CheckVolumeBinding"
 6 // GeneralPred defines the name of predicate GeneralPredicates.
 7 GeneralPred = "GeneralPredicates"
 8 // HostNamePred defines the name of predicate HostName.
 9 HostNamePred = "HostName"
10 // PodFitsHostPortsPred defines the name of predicate PodFitsHostPorts.
11 PodFitsHostPortsPred = "PodFitsHostPorts"
12 // MatchNodeSelectorPred defines the name of predicate MatchNodeSelector.
13 MatchNodeSelectorPred = "MatchNodeSelector"
14 // PodFitsResourcesPred defines the name of predicate PodFitsResources.
15 PodFitsResourcesPred = "PodFitsResources"
16 // NoDiskConflictPred defines the name of predicate NoDiskConflict.
17 NoDiskConflictPred = "NoDiskConflict"
18 // PodToleratesNodeTaintsPred defines the name of predicate PodToleratesNodeTaints.
19 PodToleratesNodeTaintsPred = "PodToleratesNodeTaints"
20 // CheckNodeUnschedulablePred defines the name of predicate CheckNodeUnschedulablePredicate.
21 CheckNodeUnschedulablePred = "CheckNodeUnschedulable"
22 // CheckNodeLabelPresencePred defines the name of predicate CheckNodeLabelPresence.
23 CheckNodeLabelPresencePred = "CheckNodeLabelPresence"
24 // CheckServiceAffinityPred defines the name of predicate checkServiceAffinity.
25 CheckServiceAffinityPred = "CheckServiceAffinity"
26 // MaxEBSVolumeCountPred defines the name of predicate MaxEBSVolumeCount.
27 // DEPRECATED
28 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
29 MaxEBSVolumeCountPred = "MaxEBSVolumeCount"
30 // MaxGCEPDVolumeCountPred defines the name of predicate MaxGCEPDVolumeCount.
31 // DEPRECATED
32 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
33 MaxGCEPDVolumeCountPred = "MaxGCEPDVolumeCount"
34 // MaxAzureDiskVolumeCountPred defines the name of predicate MaxAzureDiskVolumeCount.
35 // DEPRECATED
36 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
37 MaxAzureDiskVolumeCountPred = "MaxAzureDiskVolumeCount"
38 // MaxCinderVolumeCountPred defines the name of predicate MaxCinderDiskVolumeCount.
39 // DEPRECATED
40 // All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
41 MaxCinderVolumeCountPred = "MaxCinderVolumeCount"
42 // MaxCSIVolumeCountPred defines the predicate that decides how many CSI volumes should be attached.
43 MaxCSIVolumeCountPred = "MaxCSIVolumeCountPred"
44 // NoVolumeZoneConflictPred defines the name of predicate NoVolumeZoneConflict.
45 NoVolumeZoneConflictPred = "NoVolumeZoneConflict"
46 // EvenPodsSpreadPred defines the name of predicate EvenPodsSpread.
47 EvenPodsSpreadPred = "EvenPodsSpread"
48)

2.2 名词解释

  • NoVolumeZoneConflictPred: 检查给定的 Zone 限制前提下,如果在此主机中部署 Pod 是否存在卷冲突。
  • MaxEBSVolumeCountPred: (过时)确保已挂载的 EBS 存储卷不超过设置的最大值。
  • MaxGCEPDVolumeCountPred: (过时)确保已挂载的 GCE 存储卷不超过设置的最大值。
  • MaxAzureDiskVolumeCountPred: (过时)确保已挂载的 Azure 存储卷不超过设置的最大值。
  • MaxCSIVolumeCountPred: 检测 NodeVolume 数量是否超过最大值
  • MatchInterPodAffinityPred: 检查 Pod 和其他 Pod 是否符合亲和性规则。
  • NoDiskConflictPred: 检测挂载的卷和已经存在的卷是否有冲突。
  • GeneralPred: 检测资源是否充足,包含 HostName 检查、PodFitsHostPorts 主机端口是否被占用、MatchNodeSelector 节点、PodFitsResources Pod 依赖的资源配额是否满足。
  • PodToleratesNodeTaintsPred: 确保 Pod 定义的 tolerates 能接纳 Node 定义的 taints
  • CheckVolumeBindingPred: 检查该 NodePV 是否满足 PVC
  • CheckNodeUnschedulablePred: Node 是否可调度。
  • EvenPodsSpreadPred: Node 是否满足拓扑传播限制。

三、优选调度策略

3.1 优选调度策略定义

 1const (
 2 // EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
 3 EqualPriority = "EqualPriority"
 4 // MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
 5 MostRequestedPriority = "MostRequestedPriority"
 6 // RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
 7 RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
 8 // SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
 9 // the number of pods (belonging to the same service or replication controller) on the same node.
10 SelectorSpreadPriority = "SelectorSpreadPriority"
11 // ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
12 ServiceSpreadingPriority = "ServiceSpreadingPriority"
13 // InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
14 // should not be placed in the same topological domain as some other pods.
15 InterPodAffinityPriority = "InterPodAffinityPriority"
16 // LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
17 // requested utilization.
18 LeastRequestedPriority = "LeastRequestedPriority"
19 // BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
20 // to help achieve balanced resource usage.
21 BalancedResourceAllocation = "BalancedResourceAllocation"
22 // NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
23 // the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
24 NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
25 // NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
26 // matching NodeAffinity.
27 NodeAffinityPriority = "NodeAffinityPriority"
28 // TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
29 // with taint which pod can tolerate.
30 TaintTolerationPriority = "TaintTolerationPriority"
31 // ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
32 // requested by the pod present.
33 ImageLocalityPriority = "ImageLocalityPriority"
34 // EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
35 // which have pods and labels matching the incoming pod's topologySpreadConstraints.
36 EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
37)

3.2 名词解释

  • SelectorSpreadPriority: 按 ServiceReplicaset 归属计算 Node 上分布最少的同类 Pod 数量,得分计算:数量越少得分越高
  • InterPodAffinityPriority: Pod 亲和性选择策略,类似 NodeAffinityPriority,提供两种选择器支持:
  • LeastRequestedPriority: 计算 Pods 需要的 CPU 和内存在当前节点可用资源的百分比,具有最小百分比的节点就是最优,得分计算公式:cpu((capacity – sum(requested)) * 10 / capacity) + memory((capacity – sum(requested)) * 10 / capacity) / 2
  • BalancedResourceAllocation: 节点上各项资源(CPU、内存)使用率最均衡的为最优,得分计算公式:10 – abs(totalCpu/cpuNodeCapacity-totalMemory/memoryNodeCapacity)*10
  • NodePreferAvoidPodsPriority: 根据 Nodeannotation: scheduler.alpha.kubernetes.io/preferAvoidPods 进行调度
  • NodeAffinityPriority: 节点亲和性选择策略,提供两种选择器支持:requiredDuringSchedulingIgnoredDuringExecution(保证所选的主机必须满足所有Pod对主机的规则要求)、preferresDuringSchedulingIgnoredDuringExecution(调度器会尽量但不保证满足 NodeSelector 的所有要求)
  • TaintTolerationPriority: 类似于 Predicates 策略中的 PodToleratesNodeTaints,优先调度到标记了 Taint 的节点。
  • ImageLocalityPriority: 根据主机上是否已具备 Pod 运行的环境来打分,得分计算:不存在所需镜像,返回0分,存在镜像,镜像越大得分越高。
  • EvenPodsSpreadPriority: 满足拓扑传递限制的 Pod 的个数计算得分

四、修改默认策略

pkg/scheduler/framework/plugins/registry.go

 1func NewInTreeRegistry() runtime.Registry {
 2    fts := plfeature.Features{
 3        EnablePodAffinityNamespaceSelector: feature.DefaultFeatureGate.Enabled(features.PodAffinityNamespaceSelector),
 4        EnablePodDisruptionBudget:          feature.DefaultFeatureGate.Enabled(features.PodDisruptionBudget),
 5        EnablePodOverhead:                  feature.DefaultFeatureGate.Enabled(features.PodOverhead),
 6        EnableReadWriteOncePod:             feature.DefaultFeatureGate.Enabled(features.ReadWriteOncePod),
 7        EnableVolumeCapacityPriority:       feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority),
 8        EnableCSIStorageCapacity:           feature.DefaultFeatureGate.Enabled(features.CSIStorageCapacity),
 9        EnableGenericEphemeralVolume:       feature.DefaultFeatureGate.Enabled(features.GenericEphemeralVolume),
10    }
11...

通过查看源码,看到可以通过 Featrues 来调整调度策略。

修改 kube-scheduler 的启动参数,加入--feature-gates 进行调整。

 1APIListChunking=true|false (BETA - default=true)
 2APIPriorityAndFairness=true|false (BETA - default=true)
 3APIResponseCompression=true|false (BETA - default=true)
 4APIServerIdentity=true|false (ALPHA - default=false)
 5APIServerTracing=true|false (ALPHA - default=false)
 6AllAlpha=true|false (ALPHA - default=false)
 7AllBeta=true|false (BETA - default=false)
 8AnyVolumeDataSource=true|false (ALPHA - default=false)
 9AppArmor=true|false (BETA - default=true)
10CPUManager=true|false (BETA - default=true)
11CPUManagerPolicyOptions=true|false (ALPHA - default=false)
12CSIInlineVolume=true|false (BETA - default=true)
13CSIMigration=true|false (BETA - default=true)
14CSIMigrationAWS=true|false (BETA - default=false)
15CSIMigrationAzureDisk=true|false (BETA - default=false)
16CSIMigrationAzureFile=true|false (BETA - default=false)
17CSIMigrationGCE=true|false (BETA - default=false)
18CSIMigrationOpenStack=true|false (BETA - default=true)
19CSIMigrationvSphere=true|false (BETA - default=false)
20CSIStorageCapacity=true|false (BETA - default=true)
21CSIVolumeFSGroupPolicy=true|false (BETA - default=true)
22CSIVolumeHealth=true|false (ALPHA - default=false)
23CSRDuration=true|false (BETA - default=true)
24ConfigurableFSGroupPolicy=true|false (BETA - default=true)
25ControllerManagerLeaderMigration=true|false (BETA - default=true)
26CustomCPUCFSQuotaPeriod=true|false (ALPHA - default=false)
27DaemonSetUpdateSurge=true|false (BETA - default=true)
28DefaultPodTopologySpread=true|false (BETA - default=true)
29DelegateFSGroupToCSIDriver=true|false (ALPHA - default=false)
30DevicePlugins=true|false (BETA - default=true)
31DisableAcceleratorUsageMetrics=true|false (BETA - default=true)
32DisableCloudProviders=true|false (ALPHA - default=false)
33DownwardAPIHugePages=true|false (BETA - default=false)
34EfficientWatchResumption=true|false (BETA - default=true)
35EndpointSliceTerminatingCondition=true|false (BETA - default=true)
36EphemeralContainers=true|false (ALPHA - default=false)
37ExpandCSIVolumes=true|false (BETA - default=true)
38ExpandInUsePersistentVolumes=true|false (BETA - default=true)
39ExpandPersistentVolumes=true|false (BETA - default=true)
40ExpandedDNSConfig=true|false (ALPHA - default=false)
41ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)
42GenericEphemeralVolume=true|false (BETA - default=true)
43GracefulNodeShutdown=true|false (BETA - default=true)
44HPAContainerMetrics=true|false (ALPHA - default=false)
45HPAScaleToZero=true|false (ALPHA - default=false)
46IPv6DualStack=true|false (BETA - default=true)
47InTreePluginAWSUnregister=true|false (ALPHA - default=false)
48InTreePluginAzureDiskUnregister=true|false (ALPHA - default=false)
49InTreePluginAzureFileUnregister=true|false (ALPHA - default=false)
50InTreePluginGCEUnregister=true|false (ALPHA - default=false)
51InTreePluginOpenStackUnregister=true|false (ALPHA - default=false)
52InTreePluginvSphereUnregister=true|false (ALPHA - default=false)
53IndexedJob=true|false (BETA - default=true)
54IngressClassNamespacedParams=true|false (BETA - default=true)
55JobTrackingWithFinalizers=true|false (ALPHA - default=false)
56KubeletCredentialProviders=true|false (ALPHA - default=false)
57KubeletInUserNamespace=true|false (ALPHA - default=false)
58KubeletPodResources=true|false (BETA - default=true)
59KubeletPodResourcesGetAllocatable=true|false (ALPHA - default=false)
60LocalStorageCapacityIsolation=true|false (BETA - default=true)
61LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false)
62LogarithmicScaleDown=true|false (BETA - default=true)
63MemoryManager=true|false (BETA - default=true)
64MemoryQoS=true|false (ALPHA - default=false)
65MixedProtocolLBService=true|false (ALPHA - default=false)
66NetworkPolicyEndPort=true|false (BETA - default=true)
67NodeSwap=true|false (ALPHA - default=false)
68NonPreemptingPriority=true|false (BETA - default=true)
69PodAffinityNamespaceSelector=true|false (BETA - default=true)
70PodDeletionCost=true|false (BETA - default=true)
71PodOverhead=true|false (BETA - default=true)
72PodSecurity=true|false (ALPHA - default=false)
73PreferNominatedNode=true|false (BETA - default=true)
74ProbeTerminationGracePeriod=true|false (BETA - default=false)
75ProcMountType=true|false (ALPHA - default=false)
76ProxyTerminatingEndpoints=true|false (ALPHA - default=false)
77QOSReserved=true|false (ALPHA - default=false)
78ReadWriteOncePod=true|false (ALPHA - default=false)
79RemainingItemCount=true|false (BETA - default=true)
80RemoveSelfLink=true|false (BETA - default=true)
81RotateKubeletServerCertificate=true|false (BETA - default=true)
82SeccompDefault=true|false (ALPHA - default=false)
83ServiceInternalTrafficPolicy=true|false (BETA - default=true)
84ServiceLBNodePortControl=true|false (BETA - default=true)
85ServiceLoadBalancerClass=true|false (BETA - default=true)
86SizeMemoryBackedVolumes=true|false (BETA - default=true)
87StatefulSetMinReadySeconds=true|false (ALPHA - default=false)
88StorageVersionAPI=true|false (ALPHA - default=false)
89StorageVersionHash=true|false (BETA - default=true)
90SuspendJob=true|false (BETA - default=true)
91TTLAfterFinished=true|false (BETA - default=true)
92TopologyAwareHints=true|false (ALPHA - default=false)
93TopologyManager=true|false (BETA - default=true)
94VolumeCapacityPriority=true|false (ALPHA - default=false)
95WinDSR=true|false (ALPHA - default=false)
96WinOverlay=true|false (BETA - default=true)
97WindowsHostProcessContainers=true|false (ALPHA - default=false)

更多 feature-gates 参数请查看kube-scheduler