Unix Internals - California State University, Northridge



Unix Internals

Process Scheduling 5

Unix scheduler

• policy

• implementation

context switch

• hardware execution context of current process is saved in it’s

process control block (PCB)

which is traditionally part of the

u area of the process

• context of the next process is retrieved from it’s PCB,

loaded into the hardware registers

• architecture-specific tasks

o flush data cache after context switch

o flush instruction cache after context switch

o flush address translation cache after context switch

o pipelined architectures (e.g., RISC)

flush instruction pipeline prior to context switch

hardware clock

• fixed time interval interrupt clock tick 10 milliseconds

• major tick n clock ticks n depends upon specific Unix variant

• clock frequency

o number of ticks per second

o constant HZ defined in param.h file

• kernel functions measure time in number of ticks

clock interrupt handler

• highly system-dependent

• priority second only to power-failure interrupt

• updates CPU usage statistics for current process

• performs scheduler-related functions

o priority recomputation

o time-slice expiration handling

• sends SIGXCPU signal to current process

if it has exceeded its CPU usage quota

• updates

o time-of-day clock

o SVR4 variable lbolt – number of ticks elapsed since system last booted

• handles callouts

• wakes system processes when required

o swapper

o pagedaemon

• handles alarms

callouts

• records a function that the kernel will invoke after a specific number of ticks

• int to_ID = timeout(void(*fn)(), caddr_t arg, long delta);

o fn() – kernel function to be invoked

o arg – argument to be passed to the function

o delta – time interval measured in number of ticks

• kernel invokes callout in system context

o function cannot sleep

o function cannot access the process context

• void untimeout( int to_ID);

• periodic tasks

o retransmission of network packets

o scheduler functions

o memory management functions

o monitoring devices to avoid losing interrupts

o polling devices

• normal kernel operation – do not execute at interrupt priority

o clock handler checks if any callouts are due

o sets flag indicating that callout handler must run

o when system returns from base interrupt priority

▪ checks flag

▪ if flag is set, system invokes callout handler

▪ handler invokes each callout that is due to run

o once due, callouts run,

but only after all pending interrupts have been serviced

pending callout list

• checking time

o every CPU tick

o high interrupt priority

o optimize checking time

• insertion time

o low priority

o much less frequently than one per tick

• implementation

o 4.3BSD sort list in order of “time to fire”

▪ stores differential time of expiration for each entry

difference between its time to fire and that of the previous callout

▪ decrements time difference in first entry at each clock

o sort list in order of “time to fire”

▪ stores absolute time of expiration for each entry

▪ compares current absolute time with time in first entry

o timing wheel

▪ fixed-size circular array of callout queues

▪ at each tick, clock interrupt handler advances current time pointer

to the next element in the array,

wrapping around at the end of the array

▪ if any callouts are in the queue, their expiration time is checked

▪ new callouts are inserted into the queue which is N elements away from current queue, where N is the time to fire measured in ticks

▪ timing wheel hashes callouts based on the expiration time

▪ within each queue, callouts may be sorted or unsorted

▪ sorting callouts

• reduces time required to process non-empty queues but

• increases insertion time

alarms

• real time alarm

o measures actual elapsed time

o notifies process via SIGALRM signal

• profiling alarm

o measures amount of time process has been executing

o notifies process via SIGPROF signal

• virtual-time alarm

• monitors time spent in user mode

• notifies process via SIGVALRM signal

BSD Unix

setitimer() system call

• allows process to request any type of alarm

• specify time interval in microseconds

SVRx

alarm() system call

• requests real time alarm

• specify time interval in whole number of seconds

SVR4

hrtsys() system call

• allows time to be specified in microseconds

high resolution of real time alarms does not imply high accuracy

kernel delivers SIGALRM signal to process,

process will not respond to signal until it is next scheduled to run

high-resolution timers used by high priority processes are less likely to have scheduling delays

but if current process is executing in kernel mode and does not reach an preemption point

profiling & virtual alarms

clock interrupt handler charges whole tick to process even though it uses only a portion of it

time measured by these alarms reflect number of clock interrupts that have occurred while process was running

long run – average out good indicator of time used

for any single alarm, significant inaccuracy

scheduler goals

interactive process

reduce average time and variance between user action and application response sufficiently so that users cannot readily detect the delay

batch process

measure of scheduling efficiency is the task completion time in the presence of other activity

real time process -- time critical

predictable scheduling behavior with guaranteed bounds on response times

kernel functions – paging, interrupt handling, process management

execute promptly when required

mix of processes

all application processes must continue to progress

no application should prevent another from progressing unless user has explicitly permitted it

system should always be able to receive and process interactive user input

Traditional Unix Scheduling

SVR3, 4.3BSD

• time-sharing, interactive environments

• multiple users

• several batch & foreground processes simultaneously

• improve response times of interactive users

• ensuring that low-priority, background jobs do not starve

• priority-based scheduling

• each process has a scheduling priority that changes with time

• scheduler selects highest-priority runnable process

• uses preemptive time-slicing to schedule processes of equal priority

• dynamically varies process priority based on their CPU usage patterns

• kernel is nonpreemptible

• running process

o may voluntarily relinquish CPU when blocking on a resource

o can be preempted when it returns to user mode

process priorities

kernel priority numbers 0 – 49

user mode priority numbers 50 – 127

proc structure

p_pri current scheduling priority

p_usrpri user-mode priority

p_cpu measure of recent CPU usage

p_nice user-controllable nice factor

• scheduler uses p_pri to decide which process to schedule

• process in user mode ( p_pri == p_usrpri

• process wakes up after blocking in a system call

• scheduler uses p_pri to store the temporary kernel priority

sleep priority

• terminal input 28

• disk I/O 20

• process wakes up after blocking

o kernel sets p_pri ( sleep priority of the event or resource

o process scheduled ahead of other user processes

• process completes system call, about to return to user mode

• kernel sets scheduling priority p_pri ( p_usrpri

user-mode priority p_usrpri depends upon

• recent CPU usage p_cpu

• user-controllable nice factor p_nice range 0 – 39 , default 20

1 user mode: Δ↑ p_nice ( Δ↓ priority

o superuser: Δ↓ p_nice ( Δ↑ priority

process priority algorithm

• process creation ( p_cpu = 0

• clock tick ( clock handler increments p_cpu for current process (max 127)

• every second ( kernel invokes schedcpu() -- scheduled by callout

o reduces p_cpu of each process by decay factor

o SVR3

• fixed decay factor ½

• undesirable side effect – system load Δ↑ ( Δ↑ priorities

o 4.3BSD computes

• decay = (2 * load_average) / (2 * load_average + 1);

• load_average == average number of runnable processes over last second

• p_usrpri = PUSER + ( p_cpu / 4 ) + ( 2 * p_nice );

• PUSER is the baseline user priority of 50

scheduler implementation

whichgs (global variable – bitmask 32 cells – one bit for each queue)

|0 |0 |0 |1 |0 |1 |0 |… |

qs (32 rows)

|0 – 3 |

|4 -- 7 |

|8 -- 11 |

|12--15 |

|16--19 |

|20--23 |

| |

|… |

|124-127 |

proc structures

runnable processes

context switch - swtch() function

• examines whichgs to find index of the first set bit

• bit identifies scheduler queue containing highest priority runnable process

• removes process from head of queue & switches context to it

• proc structure -- p_addr field points to page table entries of the u area

• process control block is a part of the u area

• VAX-11

o special instructions – 32 bit fields

▪ FFS – find First Set

▪ FFC – Find First Clear

o special instructions – double linked lists

▪ INSQHI – insert elements into doubly linked lists

▪ REMQHI– remove elements from doubly linked lists

o special instructions – process context

▪ LDPCTX – load process context

▪ SVPCTX – save process context

run queue manipulation

• highest priority process always runs

unless current process is executing in kernel mode

• fixed time quantum -- 4.3BSD 100 milliseconds

• every 100 ms, kernel invokes roundrobin() function

to schedule next process from same queue

• if higher priority process were runnable,

preferentially scheduled without waiting for roundrobin() function

• if other runnable processes are on a lower priority queue,

the current process continues to run even though

its quantum has expired

• every second, schedcpu() recomputes the priority of each process

o removes the process from the queue

o recomputed process priority

o inserts process into a (possibly) different run queue

• every four ticks, clock interrupt handler computes

priority of the current process

• context switch

o current process (voluntary context switch)

▪ blocks on a resource

▪ exits

o priority recomputation ( priority of another process > current priority

o current process or interrupt handler wakes up a higher priority process

• voluntary context switch

o kernel directly calls swtch() from sleep() or exit() functions

• involuntary switches

o system is in kernel mode, cannot directly preempt the process

o kernel sets runrun flag

o when process is about to return to user mode,

kernel checks the runrun flag

o if runrun flag is set, kernel transfers control to swtch() routine

limitations

• large number of processes ( inefficient to recompute priorities every second

• no way to guarantee CPU resources to specific group of processes

• no guarantee of response time

• little control over priorities by applications

• nonpreemptive kernel -- priority inversion -- higher-priority process may have to wait significant amount of time for lower-priority process to complete

SVR4 Scheduler

scheduling class defines scheduling policy for set of processes

SVR4 scheduling classes

• time-sharing

• real-time

class-independent routines -- implement common services

• context switching

• run queue manipulation

• preemption

• procedural interface

class-dependent functions – pure virtual functions

• priority recomputation

o real-time class – fixed priorities

o time-sharing class -- varies process priorities dynamically

• inheritance

class independent layer

dqactmap (global variable – bitmask 160 cells – one bit for each queue)

|0 |0 |1 |0 |0 |1 |0 |… |

dispq (160 rows)

|160 |

|159 |

|158 |

|157 |

|156 |

|155 |

| |

|… |

|0 |

proc structures

runnable processes

setfrontdq() – insert process at front of queue,

e.g., process preempted before time quantum had expired

setbackdq() – insert process at back of queue, e.g., newly runnable process

dispdeq() – remove process from queue

dispatch latency

delay between the

time that process becomes runnable

and

time that process actually begins running

SVR4 kernel preemption points

preemption point -- places in kernel code where

• all kernel data structures are in stable state

• kernel is about to embark on a lengthy computation

at preemption point

• kernel macro PREEMPT() checks kprunrun flag

• if kprunrun flag is set,

o real-time process is requesting access to run

o PREEMPT() macro calls preempt() kernel routine to preempt process

• bounds the amount of time that real-time process must wait before being scheduled

• preempt() kernel routine

o invokes CL_PREEMPT operation

to perform class-dependent processing

then

o calls swtch() to initiate context switch

• swtch() function

o calls pswtch() to perform machine-independent part of context switch

o pswtch()

▪ clears runrun and kprunrun flags,

▪ selects highest priority runnable process

▪ removes highest priority runnable process from the dispatch queue

▪ updates dqactmap

▪ sets state of process to SONPROC (running on processor)

▪ updates memory management registers to map the u area

▪ updates virtual address translation registers of the new process

o invokes assembly language code to

▪ manipulate register content

▪ flush translation buffers

▪ etc

interface to scheduling classes

generic interface provides virtual functions that are implemented by each scheduling class

rt_classfuncs

rt_init

sys_classfunc

sys_init

ts_classfunc

ts_init

proc structures proc structures

classfuncs structure

vector of pointers to the functions

that implement the class-dependent interface for any class

global class table

contains one entry for each class

• class name

• pointer to an initialization function

• pointer to the classfuncs vector for that class

p_cid index into global class table class ID

p_clfuncs pointer to the classfuncs vector for the process class

p_clproc pointer to class-dependent private data structure

accessing class-dependent functions

• set of macros are used to

o resolve generic interface function calls and to

o invoke the correct class-dependent function

• #define CL_SLEEP(procp, clproc, … ) \

(*(procp)->p_clfuncs->cl_sleep)(clprocp, … )

scheduling classes

determine the

• policies for priority computation

o range of priorities for the processes

o if and under what conditions the priority can change

o size of the time slice each time a process runs

o whether time slice size will depend upon the priority class

e.g., infinite time quantum for real-time tasks

• scheduling of the processes in the class

entry points of the class-dependent interface

CL_TICK clock interrupt handler initiates call

monitors time slice, recomputes priorities,

handles time quantum expiration, etc.

CL_FORK

CL_FORKRET fork() initiates call

CL_FORK initializes child-specific data structure

CL_FORKRET may set runrun, allowing child process to run before parent

CL_ENTERCLASS

CL_EXITCLASS call is initiated when a process enters or exits

a scheduling class; responsible for allocating & deallocating the class-dependent data structures

CL_SLEEP sleep() initiates call; may recompute process priority

CL_WAKEUP wakeprocs() initiates call; places the process on the

appropriate run queue; may set runrun or kprunrun

scheduling class determines the specific actions to be accomplished by a particular function, e.g., clock interrupt handler calls CL_TICK routine for the class to which the process belongs; the routine determines the exact actions to be performed

Time-Sharing Class

• default class

• dynamic priorities

• round-robin scheduling is used for processes with equal priorities

• static dispatcher parameter table controls process priorities and time slices

• time slice depends upon scheduling priority

• ↓Δ priority ( ↑Δ time slice

|index |globpri |quantum |tqexp |slpret |maxwait |lwait |

|0 |0 |100 |0 |10 |5 |10 |

|1 |1 |100 |0 |11 |5 |11 |

|… |… |… |… |… |… |… |

|15 |15 |80 |7 |25 |5 |25 |

|… |… |… |… |… |… |… |

|40 |40 |20 |30 |50 |5 |50 |

|… |… |… |… |… |… |… |

|59 |59 |10 |49 |59 |5 |59 |

ts_globpri global priority

ts_quantum time quantum

ts_tqexp ts_cpupri value when time quantum expires

before using ts_maxwait seconds

ts_slpret ts_cpupri value when returning to user mode after sleeping

ts_maxwait seconds to wait for time quantum expiry

ts_lwait ts_cpupri value when time quantum does not expire

before using ts_maxwait seconds

• event-driven scheduling ,i.e., process priority changed in response

to specific event related to process

• scheduler

o reduces priority each time it uses its entire time slice

o boosts priority if

▪ process blocks on event or resource

▪ takes a long time to use its time slice

struct tsproc class-dependent data

ts_timeleft time remaining in quantum

ts_cpupri system part of priority

ts_upri user part of priority (nice value), range [ -20, +19 ], default 0

ts_umdpri user mode priority == min { max { ts_cpupri + ts_upri, 0 } 59 }

ts_diswait clock time (seconds) since start of quantum

process returns from sleeping in kernel

• priority is kernel priority determined by sleep condition

• return to user mode ( priority restored from ts_umdpri

• ts_umdpri may be changed by priocntl(), but only a superuser increase it

• ts_cpupri is adjusted according to dispatcher parameter table

dispatcher parameter table access

• current ts_cpupri ( ts_tqexp, to_slpret, ts_lwait ( new ts_cpupri

• ts__umdpri ( ts_globpri, ts_quantum, ts_maxwait

Real-Time Class

• priority range [100 – 159]

• fixed priority

• fixed quantum

• bounded dispatch latency

• bounded response time

• real-time processes are scheduled before any kernel services

process executing in kernel mode; real-time process becomes runnable;

real-time process must wait until current process is about to

return to user mode or until it reaches a kernel preemption point

process explicitly makes a priocntl call to change priority or time quantum

only super processes can enter real-time class;

priocntl – specify priority and time quantum

default parameter table

↓Δ priority ( ↑Δ time slice

struct rtproc

• current time quantum

• time remaining in quantum

• current priority

dispatch latency

time between

• process becomes runnable

• process begins to run

response time

time between

• occurrence of event

• response to event

time required by

• interrupt handler to process the event

• dispatch latency

• real-time process to respond to the event

preemption points

divide kernel algorithms into small bounded work units

real-time process becomes runnable (

rt_wakeup() [ class-dependent wakeup processing routine]

• sets kernel flag kprunrun

current process executing kernel code

• reaches preemption point

• checks flag

• initiates context switch to real-time process

latency bounds guaranteed only when

real-time process is highest-priority runnable process on system

• real-time process is running on system

• higher-priority real-time process becomes available

• higher-priority real-time process is preferentially scheduled

• latency calculation of lower-priority process

is recalculated after process regains CPU

priocntl system call -- single process

• change priority class of process

• set ts_upri for time-sharing process

• reset priority and quantum for real-time process

• obtaining current values of scheduling parameters

priocntlset system call -- set of related process

• change priority class of processes

• set ts_upri for time-sharing processes

• reset priority and quantum for real-time processes

• obtaining current values of scheduling parameters

applies to all processes

• in system

• in process group or session

• in scheduling class

• owned by particular user

• having same parent

Analysis

• allows addition of scheduling classes to system

• time-sharing class changes priorities based on events related to process

• fast and scalable dynamic priority calculation

• event-driven scheduling

favors I/O bound and interactive jobs over CPU-bound jobs

• interactive job with large computations ( system will be unresponsive

• total system load and job mix characteristics determines system response

• system administer can alter system behavior by changing dispatcher table settings and rebuilding kernel; such retuning may be required to keep system efficient and responsive

• adding a scheduling class does not require access to kernel source code

o provide implementation of each class-dependent scheduling function

o initialize classfuncs vector to point to these functions

o provide initialization function to perform setup tasks, e.g., allocating internal data structures

o add entry for this class in the class table in a master configuration file, typically located in the master.d directory of the kernel build directory

o entry contains pointers to the initialization function and classfuncs vector

o rebuild kernel

• priocntl system call restricted to superuser

• no good way for a time-sharing class process to switch to a different class

• no provision for deadline-driven scheduling

• code path between preemption points is too long for many time-critical processes

• hard real-time systems require a fully preemptible kernel

• extremely difficult to tune system for a mixed set of applications

• system load varies constantly hence it is not reasonable to require careful manual tuning each time the load changes

Solaris 2.x Scheduling

multithreaded, symmetric multiprocessing operating system

preemptive kernel

fully preemptive kernel

global kernel data structures are protected by synchronization objects

• mutual exclusion locks (mutexes)

• semaphores

implementation of interrupts using special kernel threads

use the standard synchronization primitives of the kernel

• rarely needs to raise the interrupt level to protect critical regions

• few nonpreemptible code segments

• higher-priority process is scheduled as soon as it is runnable

interrupt threads run at the highest priority

scheduling classes can be dynamically loaded

priorities of the interrupt threads are recomputed to ensure that they remain at the highest possible level

blocked thread can only be restarted on the same processor

multiprocessor support

single dispatch queue for all processors

slelected threads can be restricted to a single, specific processor

processors communicate by sending cross-processor interrupts

per-processor data structure -- scheduling variables

cpu_thread thread currently running on this processor

cpu_dispthread thread last selected to run on this processor

cpu_idle idle thread for this processor

cpu_runrun preemption flag used for time-sharing threads

cpu_kprunrun kernel preemption flag set by real-time threads

cpu_chosen_level priority of thread that is going to preempt

the current thread on this processor

Thread T1, priority 120 Processor P1

Thread T2, priority 130 Processor P2

Thread T3, priority 100 Processor P3

Thread T4, priority 132 Processor P4

Thread T5, priority 135 Processor P5

Thread T6, priority 130

Thread T7, priority 115

• event on processor P1 makes thread T6 runnable

• kernel

o places T6 on dispatch queue

o calls cpu_choose() to find processor with the lowest priority thread

cpu_choose()

▪ marks selected processor for preemption

▪ sets cpu_chosen_level to priority T6 value130

▪ send cross-processor interrupt

• event on processor P2 makes thread T7 runnable

• kernel

o places T7 on dispatch queue

o calls cpu_choose() to find processor with the lowest priority thread

cpu_choose()

▪ compares cpu_chosen_level to priority T7 value115

▪ does not mark selected processor for preemption

• processor P3

o handles the interrupt

o preempts thread T3

hidden scheduling

priority inversion

implementation of priority inheritance

limitations of priority inheritance

turnstiles

analysis

Mach Scheduling

[pic][pic]

-----------------------

P

P

P

P

P

P

P

P

P

P

preemption points

• pathname parsing routine lookuppn()

before beginning to parse

each individual pathname component

• open() system call

before creating a file if it does not exist

• memory subsystem

before freeing process pages

time-sharing

system

real-time

$%4KZtš³äêV r x ‚ ž ¯ Ë ä ú ' ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download