166 lines
		
	
	
		
			7.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			166 lines
		
	
	
		
			7.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
| 		   Goals, Design and Implementation of the
 | |
| 		      new ultra-scalable O(1) scheduler
 | |
| 
 | |
| 
 | |
|   This is an edited version of an email Ingo Molnar sent to
 | |
|   lkml on 4 Jan 2002.  It describes the goals, design, and
 | |
|   implementation of Ingo's new ultra-scalable O(1) scheduler.
 | |
|   Last Updated: 18 April 2002.
 | |
| 
 | |
| 
 | |
| Goal
 | |
| ====
 | |
| 
 | |
| The main goal of the new scheduler is to keep all the good things we know
 | |
| and love about the current Linux scheduler:
 | |
| 
 | |
|  - good interactive performance even during high load: if the user
 | |
|    types or clicks then the system must react instantly and must execute
 | |
|    the user tasks smoothly, even during considerable background load.
 | |
| 
 | |
|  - good scheduling/wakeup performance with 1-2 runnable processes.
 | |
| 
 | |
|  - fairness: no process should stay without any timeslice for any
 | |
|    unreasonable amount of time. No process should get an unjustly high
 | |
|    amount of CPU time.
 | |
| 
 | |
|  - priorities: less important tasks can be started with lower priority,
 | |
|    more important tasks with higher priority.
 | |
| 
 | |
|  - SMP efficiency: no CPU should stay idle if there is work to do.
 | |
| 
 | |
|  - SMP affinity: processes which run on one CPU should stay affine to
 | |
|    that CPU. Processes should not bounce between CPUs too frequently.
 | |
| 
 | |
|  - plus additional scheduler features: RT scheduling, CPU binding.
 | |
| 
 | |
| and the goal is also to add a few new things:
 | |
| 
 | |
|  - fully O(1) scheduling. Are you tired of the recalculation loop
 | |
|    blowing the L1 cache away every now and then? Do you think the goodness
 | |
|    loop is taking a bit too long to finish if there are lots of runnable
 | |
|    processes? This new scheduler takes no prisoners: wakeup(), schedule(),
 | |
|    the timer interrupt are all O(1) algorithms. There is no recalculation
 | |
|    loop. There is no goodness loop either.
 | |
| 
 | |
|  - 'perfect' SMP scalability. With the new scheduler there is no 'big'
 | |
|    runqueue_lock anymore - it's all per-CPU runqueues and locks - two
 | |
|    tasks on two separate CPUs can wake up, schedule and context-switch
 | |
|    completely in parallel, without any interlocking. All
 | |
|    scheduling-relevant data is structured for maximum scalability.
 | |
| 
 | |
|  - better SMP affinity. The old scheduler has a particular weakness that
 | |
|    causes the random bouncing of tasks between CPUs if/when higher
 | |
|    priority/interactive tasks, this was observed and reported by many
 | |
|    people. The reason is that the timeslice recalculation loop first needs
 | |
|    every currently running task to consume its timeslice. But when this
 | |
|    happens on eg. an 8-way system, then this property starves an
 | |
|    increasing number of CPUs from executing any process. Once the last
 | |
|    task that has a timeslice left has finished using up that timeslice,
 | |
|    the recalculation loop is triggered and other CPUs can start executing
 | |
|    tasks again - after having idled around for a number of timer ticks.
 | |
|    The more CPUs, the worse this effect.
 | |
| 
 | |
|    Furthermore, this same effect causes the bouncing effect as well:
 | |
|    whenever there is such a 'timeslice squeeze' of the global runqueue,
 | |
|    idle processors start executing tasks which are not affine to that CPU.
 | |
|    (because the affine tasks have finished off their timeslices already.)
 | |
| 
 | |
|    The new scheduler solves this problem by distributing timeslices on a
 | |
|    per-CPU basis, without having any global synchronization or
 | |
|    recalculation.
 | |
| 
 | |
|  - batch scheduling. A significant proportion of computing-intensive tasks
 | |
|    benefit from batch-scheduling, where timeslices are long and processes
 | |
|    are roundrobin scheduled. The new scheduler does such batch-scheduling
 | |
|    of the lowest priority tasks - so nice +19 jobs will get
 | |
|    'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
 | |
|    in essence SCHED_IDLE, from an interactiveness point of view.
 | |
| 
 | |
|  - handle extreme loads more smoothly, without breakdown and scheduling
 | |
|    storms.
 | |
| 
 | |
|  - O(1) RT scheduling. For those RT folks who are paranoid about the
 | |
|    O(nr_running) property of the goodness loop and the recalculation loop.
 | |
| 
 | |
|  - run fork()ed children before the parent. Andrea has pointed out the
 | |
|    advantages of this a few months ago, but patches for this feature
 | |
|    do not work with the old scheduler as well as they should,
 | |
|    because idle processes often steal the new child before the fork()ing
 | |
|    CPU gets to execute it.
 | |
| 
 | |
| 
 | |
| Design
 | |
| ======
 | |
| 
 | |
| The core of the new scheduler contains the following mechanisms:
 | |
| 
 | |
|  - *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
 | |
|    array and an 'expired' array. The active array contains all tasks that
 | |
|    are affine to this CPU and have timeslices left. The expired array
 | |
|    contains all tasks which have used up their timeslices - but this array
 | |
|    is kept sorted as well. The active and expired array is not accessed
 | |
|    directly, it's accessed through two pointers in the per-CPU runqueue
 | |
|    structure. If all active tasks are used up then we 'switch' the two
 | |
|    pointers and from now on the ready-to-go (former-) expired array is the
 | |
|    active array - and the empty active array serves as the new collector
 | |
|    for expired tasks.
 | |
| 
 | |
|  - there is a 64-bit bitmap cache for array indices. Finding the highest
 | |
|    priority task is thus a matter of two x86 BSFL bit-search instructions.
 | |
| 
 | |
| the split-array solution enables us to have an arbitrary number of active
 | |
| and expired tasks, and the recalculation of timeslices can be done
 | |
| immediately when the timeslice expires. Because the arrays are always
 | |
| access through the pointers in the runqueue, switching the two arrays can
 | |
| be done very quickly.
 | |
| 
 | |
| this is a hybride priority-list approach coupled with roundrobin
 | |
| scheduling and the array-switch method of distributing timeslices.
 | |
| 
 | |
|  - there is a per-task 'load estimator'.
 | |
| 
 | |
| one of the toughest things to get right is good interactive feel during
 | |
| heavy system load. While playing with various scheduler variants i found
 | |
| that the best interactive feel is achieved not by 'boosting' interactive
 | |
| tasks, but by 'punishing' tasks that want to use more CPU time than there
 | |
| is available. This method is also much easier to do in an O(1) fashion.
 | |
| 
 | |
| to establish the actual 'load' the task contributes to the system, a
 | |
| complex-looking but pretty accurate method is used: there is a 4-entry
 | |
| 'history' ringbuffer of the task's activities during the last 4 seconds.
 | |
| This ringbuffer is operated without much overhead. The entries tell the
 | |
| scheduler a pretty accurate load-history of the task: has it used up more
 | |
| CPU time or less during the past N seconds. [the size '4' and the interval
 | |
| of 4x 1 seconds was found by lots of experimentation - this part is
 | |
| flexible and can be changed in both directions.]
 | |
| 
 | |
| the penalty a task gets for generating more load than the CPU can handle
 | |
| is a priority decrease - there is a maximum amount to this penalty
 | |
| relative to their static priority, so even fully CPU-bound tasks will
 | |
| observe each other's priorities, and will share the CPU accordingly.
 | |
| 
 | |
| the SMP load-balancer can be extended/switched with additional parallel
 | |
| computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
 | |
| can be supported easily by changing the load-balancer. Right now it's
 | |
| tuned for my SMP systems.
 | |
| 
 | |
| i skipped the prev->mm == next->mm advantage - no workload i know of shows
 | |
| any sensitivity to this. It can be added back by sacrificing O(1)
 | |
| schedule() [the current and one-lower priority list can be searched for a
 | |
| that->mm == current->mm condition], but costs a fair number of cycles
 | |
| during a number of important workloads, so i wanted to avoid this as much
 | |
| as possible.
 | |
| 
 | |
| - the SMP idle-task startup code was still racy and the new scheduler
 | |
| triggered this. So i streamlined the idle-setup code a bit. We do not call
 | |
| into schedule() before all processors have started up fully and all idle
 | |
| threads are in place.
 | |
| 
 | |
| - the patch also cleans up a number of aspects of sched.c - moves code
 | |
| into other areas of the kernel where it's appropriate, and simplifies
 | |
| certain code paths and data constructs. As a result, the new scheduler's
 | |
| code is smaller than the old one.
 | |
| 
 | |
| 	Ingo
 |