UNIX Unleashed, System Administrator's Edition
- 19 -
Kernel Basics and Configuration
by Dan Wilson, Bill Pierce, and Bill Wood
You're probably asking yourself "Why would I want to know about this thing called the UNIX kernel?. I can add users, run jobs, print files, perform backups and restores, and even start up and shut down the machine when it needs it. Why do I need to know about, and, more specifically, even change my system's configuration to do my job as a systems administrator?" The simple answer is "You don't need to know much about the UNIX kernel if you know you'll never have to add any hardware or change or tune your system to perform better."
In all of our collective years of experience as systems administrators, about 26, we have rarely, if ever, experienced a situation where it was possible or desirable to operate an Original Equipment Manufacturer (OEM) configured UNIX system. There are just too many different uses for this type of operating system for it to remain unchanged throughout its lifetime. So, assuming you are one of the fortunate individuals who has the title of system administrator, we'll try to provide you with some useful and general information about this all-powerful UNIX process called the kernel. After that, we'll take you through some sample configurations for the following UNIX operating systems:
System V Release 4 (SVR4)
What Is a Kernel?
Let's start by providing a definition for the term kernel. The UNIX kernel is the software that manages the user program's access to the systems hardware and software resources. These resources range from being granted CPU time, accessing memory, reading and writing to the disk drives, connecting to the network, and interacting with the terminal or GUI interface. The kernel makes this all possible by controlling and providing access to memory, processor, input/output devices, disk files, and special services to user programs.
The basic UNIX kernel can be broken into four main subsystems:
These subsystems should be viewed as separate entities that work in concert to provide services to a program that enable it to do meaningful work. These management subsystems make it possible for a user to access a database via a Web interface, print a report, or do something as complex as managing a 911 emergency system. At any moment in the system, numerous programs may request services from these subsystems. It is the kernel's responsibility to schedule work and, if the process is authorized, grant access to utilize these subsystems. In short, programs interact with the subsystems via software libraries and the systems call interface. Refer to your UNIX reference manuals for descriptions of the systems calls and libraries supported by your system. Because each of the subsystems is key to enabling a process to perform a useful function, we will cover the basics of each subsystem. We'll start by looking at how the UNIX kernel comes to life by way of the system initialization process.
System initialization (booting) is the first step toward bringing your system into an operational state. A number of machine-dependent and machine-independent steps are gone through before your system is ready to begin servicing users. At system startup, there is nothing running on the Central Processing Unit (CPU). The kernel is a complex program that must have its binary image loaded at a specific address from some type of storage device, usually a disk drive. The boot disk maintains a small restricted area called the boot sector that contains a boot program that loads and initializes the kernel. You'll find that this is a vendor specific procedure that reflects the architectural hardware differences between the various UNIX vendor platforms. When this step is completed, the CPU must jump to a specific memory address and start executing the code at that location. Once the kernel is loaded, it goes through its own hardware and software initialization.
The operating system, or kernel, runs in a privileged manner known as kernel mode. This mode of operation allows the kernel to run without being interfered with by other programs currently in the system. The microprocessor enforces this line of demarcation between user and kernel level mode. With the kernel operating in its own protected address space, it is guaranteed to maintain the integrity of its own data structures and that of other processes. (That's not to say that a privileged process could not inadvertently cause corruption within the kernel.) These data structures are used by the kernel to manage and control itself and any other programs that may be running in the system. If any of these data structures were allowed to be accidentally or intentionally altered, the system could quickly crash. Now that we have learned what a UNIX kernel is and how it is loaded into the system, we are ready to take a look at the four UNIX subsystems Process Management, Memory Management, Filesystem Management and I/O Management.
The Process Management subsystem controls the creation, termination, accounting, and scheduling of processes. It also oversees process state transitions and the switching between privileged and nonprivileged modes of execution. The Process Management subsystem also facilitates and manages the complex task of the creation of child processes.
A simple definition of a process is that it is an executing program. It is an entity that requires system resources, and it has a finite lifetime. It has the capability to create other processes via the system call interface. In short, it is an electronic representation of a user's or programmer's desire to accomplish some useful piece of work. A process may appear to the user as if it is the only job running in the machine. This "sleight of hand" is only an illusion. At any one time a processor is only executing a single process.
A process has a definite structure (see Figure 19.1). The kernel views this string of bits as the process image. This binary image consists of both a user and system address space as well as registers that store the process's data during its execution. The user address space is also known as the user image. This is the code that is written by a programmer and compiled into an ".o " object file. An object file is a file that contains machine language code/data and is in a format that the linker program can use to then create an executable program.
The user address space consists of five separate areas: Text, Data, Bss, stack, and user area.
Text Segment The first area of a process is its text segment. This area contains the executable program code for the process. This area is shared by other processes that execute the program. It is therefore fixed and unchangeable and is usually swapped out to disk by the system when memory gets too tight.
Data Area The data area contains both the global and static variables used by the program. For example, a programmer may know in advance that a certain data variable needs to be set to a certain value. In the C programming language, it would look like:
int x = 15;
If you were to look at the data segment when the program was loaded, you would see that the variable x was an integer type with an initial value of 15.
Bss Area The bss area, like the data area, holds information for the programs variables. The difference is that the bss area maintains variables that will have their data values assigned to them during the programs execution. For example, a programmer may know that she needs variables to hold certain data that will be input by a user during the execution of the program.
int a,b,c; // a,b and c are variables that hold integer values. char *ptr; // ptr is an unitialized character pointer.
The program code can also make calls to library routines like malloc to obtain a chunk of memory and assign it to a variable like the one declared above.
Stack Area The stack area maintains the process's local variables, parameters used in functions, and values returned by functions. For example, a program may contain code that calls another block of code (possibly written by someone else). The calling block of code passes data to the receiving block of code by way of the stack. The called block of code then process's the data and returns data back to the calling code. The stack plays an important role in allowing a process to work with temporary data.
User Area The user area maintains data that is used by the kernel while the process is running. The user area contains the real and effective user identifiers, real and effective group identifiers, current directory, and a list of open files. Sizes of the text, data, and stack areas, as well as pointers to process data structures, are maintained. Other areas that can be considered part of the process's address space are the heap, private shared libraries data, shared libraries, and shared memory. During initial startup and execution of the program, the kernel allocates the memory and creates the necessary structures to maintain these areas.
The user area is used by the kernel to manage the process. This area maintains the majority of the accounting information for a process. It is part of the process address space and is only used by the kernel while the process is executing(see Figure 19.2). When the process is not executing, its user area may be swapped out to disk by the Memory Manager. In most versions of UNIX, the user area is mapped to a fixed virtual memory address. Under HP-UX 10.X, this virtual address is 0x7FFE6000. When the kernel performs a context switch (starts executing a different process) to a new process, it will always map the process's physical address to this virtual address. Since the kernel already has a pointer fixed to this location in memory, it is a simple matter of referencing the current u pointer to be able to begin managing the newly switched in process. The file /usr/include/sys/user.h contains the user area's structure definition for your version of UNIX.
Process Table The process table is another important structure used by the kernel to manage the processes in the system. The process table is an array of process structures that the kernel uses to manage the execution of programs. Each table entry defines a process that the kernel has created. The process table is always resident in the computer's memory. This is because the kernel is repeatedly querying and updating this table as it switches processes in and out of the CPU. For those processes that are not currently executing, their process table structures are being updated by the kernel for scheduling purposes. The process structures for your system are defined in /usr/include/sys/proc.h.
Fork Process The kernel provides each process with the tools to duplicate itself for the purpose of creating a new process. This new entity is termed a child process. The fork() system call is invoked by an existing process (termed the parent process) and creates a replica of the parent process. While a process will have one parent, it can spawn many children. The new child process inherits certain attributes from its parent. The fork() system call documentation for HP-UX 10.0 (fork(2) in HP-UX Reference Release 10.0 Volume 3 (of 4) HP 9000 Series Computers) lists the following as being inherited by the child:
Real, effective, and saved user IDs
Real, effective, and saved group IDs
Supplementary group IDs
Process group ID
Signal handling settings
Profiling on/off status
Command name in the accounting record
All attached shared memory segments
Current working directory
File mode creation mask
File size limit
It is important to note how the child process differs from the parent process in order to see how one tells the difference between the parent and the child. When the kernel creates a child process on behalf of the parent, it gives the child a new process identifier. This unique process ID is returned to the parent by the kernel to be used by the parents code (of which the child also has a copy at this point) to determine the next step the parent process should follow: either continue on with additional work, wait for the child to finish, or terminate. The kernel will return the user ID of 0 (zero) to the child. Since the child is still executing the parent's copy of the program at this point, the code simply checks for a return status of 0 (zero) and continues executing that branch of the code. The following short pseudocode segment should help clarify this concept.
start print " I am a process " print " I will now make a copy of myself " if fork() is greater than 0 print " I am the parent" exit () or wait () else if fork() = 0 print " I am the new child " print " I am now ready to start a new program " exec("new_program") else fork() failed
The child process can also make another system call that will replace the child's process image with that of a new one. The system call that will completely overlay the child's text, data, and BSS areas with that of a new program one is called exec(). This is how the system is able to execute multiple programs. By using both the fork() and the exec() systems calls in conjunction with one another, a single process is able to execute numerous programs that perform any number of tasks that the programmer needs to have done. Except for a few system level processes started at boot time, this is how the kernel goes about executing the numerous jobs your system is required to run to support your organization.
To see how all this looks running on your system, you can use the ps command to view the fact that the system has created all these new child processes. The ps -ef command will show you that the child's parent process ID column (PPID) will match that of the parent's process ID column (PID). The simplest way to test this is to logon and, at the shell prompt, issue a UNIX command. By doing this you are telling the shell to spawn off a child process that will execute the command (program) you just gave it and to return control to you once the command has finished executing. Another way to experiment with this is to start a program in what is termed the background. This is done by simply appending an ampersand (&) to the end of your command line statement. This has the effect of telling the system to start this new program, but not to wait for it to finish before giving control back to your current shell process. This way you can use the ps -ef command to view your current shell and background processes.
Sample ps -ef output from a system running AIX 4.2 UID PID PPID C STIME TTY TIME CMD root 1 0 0 Apr 24 - 2:55 /etc/init root 2060 17606 0 10:38:30 - 0:02 dtwm root 2486 1 0 Apr 24 - 0:00 /usr/dt/bin/dtlogin -daemon root 2750 2486 0 Apr 24 - 3:12 /usr/lpp/X11/bin/X -x xv -D /usr/lib/X11//rgb -T -force :0 -auth /var/dt/A:0-yjc2ya root 2910 1 0 Apr 24 - 0:00 /usr/sbin/srcmstr root 3176 2486 0 Apr 25 - 0:00 dtlogin <:0> -daemon root 3794 1 0 Apr 25 - 0:00 /usr/ns-home/admserv/ns-admin -d /usr/ns-home/admserv . root 3854 2910 0 Apr 24 - 0:00 /usr/lpp/info/bin/infod root 4192 6550 0 Apr 24 - 0:00 rpc.ttdbserver 100083 1 root 4364 1 0 Apr 24 - 2:59 /usr/sbin/syncd 60 root 4628 1 0 Apr 24 - 0:00 /usr/lib/errdemon root 5066 1 0 Apr 24 - 0:03 /usr/sbin/cron root 5236 2910 0 Apr 24 - 0:00 /usr/sbin/syslogd root 5526 2910 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 6014 2910 0 Apr 24 - 0:00 sendmail: accepting connections root 6284 2910 0 Apr 24 - 0:00 /usr/sbin/portmap root 6550 2910 0 Apr 24 - 0:00 /usr/sbin/inetd root 6814 2910 0 Apr 24 - 9:04 /usr/sbin/snmpd root 7080 2910 0 Apr 24 - 0:00 /usr/sbin/dpid2 root 7390 1 0 Apr 24 - 0:00 /usr/sbin/uprintfd root 7626 1 0 Apr 24 - 0:00 /usr/OV/bin/ntl_reader 0 1 1 1 1000 /usr/OV/log/nettl root 8140 7626 0 Apr 24 - 0:00 netfmt -CF root 8410 8662 0 Apr 24 - 0:00 nvsecd -O root 8662 1 0 Apr 24 - 0:15 ovspmd root 8926 8662 0 Apr 24 - 0:19 ovwdb -O -n5000 -t root 9184 8662 0 Apr 24 - 0:04 pmd -Au -At -Mu -Mt -m root 9442 8662 0 Apr 24 - 0:32 trapgend -f root 9700 8662 0 Apr 24 - 0:01 mgragentd -f root 9958 8662 0 Apr 24 - 0:00 nvpagerd root 10216 8662 0 Apr 24 - 0:00 nvlockd root 10478 8662 0 Apr 24 - 0:05 trapd root 10736 8662 0 Apr 24 - 0:04 orsd root 11004 8662 0 Apr 24 - 0:31 ovtopmd -O -t root 11254 8662 0 Apr 24 - 0:00 nvcold -O root 11518 8662 0 Apr 24 - 0:03 ovactiond root 11520 8662 0 Apr 24 - 0:05 nvcorrd root 11780 8662 0 Apr 24 - 0:00 actionsvr root 12038 8662 0 Apr 24 - 0:00 nvserverd root 12310 8662 0 Apr 24 - 0:04 ovelmd root 12558 8662 0 Apr 24 - 4:28 netmon -P root 12816 8662 0 Apr 24 - 0:04 ovesmd root 13074 8662 0 Apr 24 - 0:00 snmpCollect root 13442 2910 0 Apr 24 - 0:00 /usr/lib/netsvc/yp/ypbind root 13738 5526 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 13992 5526 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 14252 5526 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 14510 5526 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 14768 5526 0 Apr 24 - 0:00 /usr/sbin/biod 6 root 15028 2910 0 Apr 24 - 0:00 /usr/sbin/rpc.statd root 15210 6550 0 Apr 24 - 0:00 rpc.ttdbserver 100083 1 root 15580 2910 0 Apr 24 - 0:00 /usr/sbin/writesrv root 15816 2910 0 Apr 24 - 0:00 /usr/sbin/rpc.lockd root 16338 2910 0 Apr 24 - 0:00 /usr/sbin/qdaemon root 16520 2060 0 13:44:46 - 0:00 /usr/dt/bin/dtexec -open 0 -ttprocid 2.pOtBq 01 17916 1342177279 1 0 0 10.19.12.115 3_101_1 /usr/dt/bin/dtterm root 16640 1 0 Apr 24 lft0 0:00 /usr/sbin/getty /dev/console root 17378 1 0 Apr 24 - 0:13 /usr/bin/pmd root 17606 3176 0 10:38:27 - 0:00 /usr/dt/bin/dtsession root 17916 1 0 10:38:28 - 0:00 /usr/dt/bin/ttsession -s root 18168 1 0 Apr 24 - 0:00 /usr/lpp/diagnostics/bin/diagd nobody 18562 19324 0 Apr 25 - 0:32 ./ns-httpd -d /usr/ns-home/httpd-supp_aix/config root 18828 22410 0 13:44:47 pts/2 0:00 /bin/ksh root 19100 21146 0 13:45:38 pts/3 0:00 vi hp.c nobody 19324 1 0 Apr 25 - 0:00 ./ns-httpd -d /usr/ns-home/httpd-supp_aix/config root 19576 6550 0 13:43:38 - 0:00 telnetd nobody 19840 19324 0 Apr 25 - 0:33 ./ns-httpd -d /usr/ns-home/httpd-supp_aix/config root 19982 17606 0 10:38:32 - 0:03 dtfile nobody 20356 19324 0 Apr 25 - 0:33 ./ns-httpd -d /usr/ns-home/httpd-supp_aix/config root 20694 20948 0 Apr 25 - 0:00 /usr/ns-home/admserv/ns-admin -d /usr/ns-home/admserv . root 20948 3794 0 Apr 25 - 0:01 /usr/ns-home/admserv/ns-admin -d /usr/ns-home/admserv . root 21146 23192 0 13:45:32 pts/3 0:00 /bin/ksh nobody 21374 19324 0 Apr 25 - 0:00 ./ns-httpd -d /usr/ns-home/httpd-supp_aix/config root 21654 2060 0 13:45:31 - 0:00 /usr/dt/bin/dtexec -open 0 -ttprocid 2.pOtBq 01 17916 1342177279 1 0 0 10.19.12.115 3_102_1 /usr/dt/bin/dtterm root 21882 19576 0 13:43:39 pts/0 0:00 -ksh root 22038 19982 0 10:38:37 - 0:04 dtfile root 22410 16520 0 13:44:47 - 0:00 /usr/dt/bin/dtterm root 22950 21882 8 13:46:06 pts/0 0:00 ps -ef root 23192 21654 0 13:45:31 - 0:00 /usr/dt/bin/dtterm root 23438 18828 0 13:45:03 pts/2 0:00 vi aix.c
Process Run States
A process moves between several states during its lifetime, although a process can only be in one state at any one time. Certain events, such as system interrupts, blocking of resources, or software traps will cause a process to change its run state. The kernel maintains queues in memory that it uses to assign a process to based upon that process's state. It keeps track of the process by its user ID.
UNIX version System V Release 4 (SVR4) recognizes the following process run states:
- SIDLE This is the state right after a process has issued a fork() system call. A process image has yet to be copied into memory. - SRUN The process is ready to run and is waiting to be executed by the CPU. - SONPROC The process is currently being executed by the CPU. - SSLEEP The process is blocking on an event or resource. - SZOMB The process has terminated and is waiting on either its parent or the init process to allow it to completely exit. - SXBRK The process is has been switched out so that another process can be executed. - SSTOP The process is stopped.
When a process first starts, the kernel allocates it a slot in the process table and places the process in the SIDL state. Once the process has the resources it needs to run, the kernel places it onto the run queue. The process is now in the SRUN state awaiting its turn in the CPU. Once its turn comes for the process to be switched into the CPU, the kernel will tag it as being in the SONPROC state. In this state, the process will execute in either user or kernel mode. User mode is where the process is executing nonprivileged code from the user's compiled program. Kernel mode is where kernel code is being executed from the kernel's privileged address space via a system call.
At some point the process is switched out of the CPU because it has either been signaled to do so (for instance, the user issues a stop signal--SSTOP state) or the process has exceeded its quota of allowable CPU time and the kernel needs the CPU to do some work for another process. The act of switching the focus of the CPU from one process to another is called a context switch. When this occurs, the process enters what is known as the SXBRK state. If the process still needs to run and is waiting for another system resource, such as disk services, it will enter the SSLEEP state until the resource is available and the kernel wakes the process up and places it on the SRUN queue. When the process has finally completed its work and is ready to terminate, it enters the SZOMB state. We have seen the fundamentals of what states a process can exist in and how it moves through them. Let's now learn how a kernel schedules a process to run.
Most modern versions of UNIX (for instance, SVR4 and Solaris 2.x) are classified as preemptive operating systems. They are capable of interrupting an executing a process and "freezing" it so that the CPU can service a different process. This obviously has the advantage of fairly allocating the system's resources to all the processes in the system. This is one goal of the many systems architects and programmers who design and write schedulers. The disadvantages are that not all processes are equal and that complex algorithms must be designed and implemented as kernel code in order to maintain the illusion that each user process is running as if it was the only job in the system. The kernel maintains this balance by placing processes in the various priority queues or run queues and apportioning its CPU time-slice based on its priority class (Real-Time versus Timeshare).
Universities and UNIX system vendors have conducted extensive studies on how best to design and build an optimal scheduler. Each vendor's flavor of UNIX--4.4BSD, SVR4, HP-UX, Solaris, and AIX, to name a few--attempts to implement this research to provide a scheduler that best balances its customers' needs. The systems administrator must realize that there are limits to the scheduler's ability to service batch, real-time, and interactive users in the same environment. Once the system becomes overloaded, it will become necessary for some jobs to suffer at the expense of others. This is an extremely important issue to both users and systems administrators alike. The reader should refer to Chapter 22, "Systems Performance and Tuning," to gain a better understanding of what he can do to balance and tune his system.
Random access memory (RAM) is a very critical component in any computer system. It's the one component that always seems to be in short supply on most systems. Unfortunately, most organizations' budgets don't allow for the purchase of all the memory that their technical staff feel is necessary to support all their projects. Luckily, UNIX allows us to execute all sorts of programs without, what appears at first glance to be, enough physical memory. This comes in very handy when the system is required to support a user community that needs to execute an organization's custom and commercial software to gain access to its data.
Memory chips are high-speed electronic devices that plug directly into your computer. Main memory is also called core memory by some technicians. Ever heard of a core dump? (Writing out main memory to a storage device for post-dump analysis.) Usually it is caused by a program or system crash or failure. An important aspect of memory chips is that they can store data at specific locations called addresses. This makes it quite convenient for another hardware device called the central processing unit (CPU) to access these locations to run your programs. The kernel uses a paging and segmentation arrangement to organize process memory. This is where the memory management subsystem plays a significant role. Memory management can be defined as the efficient managing and sharing of the system's memory resources by the kernel and user processes.
Memory management follows certain rules that manage both physical and virtual memory. Since we already have an idea of what a physical memory chip or card is, we will provide a definition of virtual memory. Virtual memory is where the addressable memory locations that a process can be mapped into are independent of the physical address space of the CPU. Generally speaking, a process can exceed the physical address space/size of main memory and still load and execute.
The systems administrator should be aware that just because she has a fixed amount of physical memory, she should not expect it all to be available to execute user programs. The kernel is always resident in main memory and depending upon the kernel's configuration (tunable-like kernel tables, daemons, device drivers loaded, and so on), the amount left over can be classified as available memory. It is important for the systems administrator to know how much available memory the system has to work with when supporting his environment. Most systems display memory statistics during boot time. If your kernel is larger than it needs to be to support your environment, consider reconfiguring a smaller kernel to free up resources.
We learned before that a process has a well-defined structure and has certain specific control data structures that the kernel uses to manage the process during its system lifetime. One of the more important data structures that the kernel uses is the virtual address space (vas in HP-UX and as in SVR4. For a more detailed description of the layout of these structures, look at the vas.h or as.h header files under /usr/include on your system.).
A virtual address space exists for each process and is used by the process to keep track of process logical segments or regions that point to specific segments of the process's text (code), data, u_area, user, and kernel stacks; shared memory; shared library; and memory mapped file segments. Per-process regions protect and maintain the number of pages mapped into the segments. Each segment has a virtual address space segment as well. Multiple programs can share the process's text segment. The data segment holds the process's initialized and uninitialized (BSS) data. These areas can change size as the program executes.
The u_area and kernel stack contain information used by the kernel, and are a fixed size. The user stack is contained in the u_area; however, its size will fluctuate during its execution. Memory mapped files allow programmers to bring files into memory and work with them while in memory. Obviously, there is a limit to the size of the file you can load into memory (check your system documentation). Shared memory segments are usually set up and used by a process to share data with other processes. For example, a programmer may want to be able to pass messages to other programs by writing to a shared memory segment and having the receiving programs attach to that specific shared memory segment and read the message. Shared libraries allow programs to link to commonly used code at runtime. Shared libraries reduce the amount of memory needed by executing programs because only one copy of the code is required to be in memory. Each program will access the code at that memory location when necessary.
When a programmer writes and compiles a program, the compiler generates the object file from the source code. The linker program (ld) links the object file with the appropriate libraries and, if necessary, other object files to generate the executable program. The executable program contains virtual addresses that are converted into physical memory addresses when the program is run. This address translation must occur prior to the program being loaded into memory so that the CPU can reference the actual code.
When the program starts to run, the kernel sets up its data structures (proc, virtual address space, per-process region) and begins to execute the process in user mode. Eventually, the process will access a page that's not in main memory (for instance, the pages in its working set are not in main memory). This is called a page fault. When this occurs, the kernel puts the process to sleep, switches from user mode to kernel mode, and attempts to load the page that the process was requesting to be loaded. The kernel searches for the page by locating the per-process region where the virtual address is located. It then goes to the segments (text, data, or other) per-process region to find the actual region that contains the information necessary to read in the page.
The kernel must now find a free page in which to load the process's requested page. If there are no free pages, the kernel must either page or swap out pages to make room for the new page request. Once there is some free space, the kernel pages in a block of pages from disk. This block contains the requested page plus additional pages that may be used by the process. Finally the kernel establishes the permissions and sets the protections for the newly loaded pages. The kernel wakes the process and switches back to user mode so the process can begin executing using the requested page. Pages are not brought into memory until the process requests them for execution. This is why the system is referred to as a demand paging system.
The memory management unit is a hardware component that handles the translation of virtual address spaces to physical memory addresses. The memory management unit also prevents a process from accessing another process's address space unless it is permitted to do so (protection fault). Memory is thus protected at the page level. The Translation Lookaside Buffer (TLB) is a hardware cache that maintains the most recently used virtual address space to physical address translations. It is controlled by the memory management unit to reduce the number of address translations that occur on the system.
Input and Output Management
The simplest definition of input/output is the control of data between hardware devices and software. A systems administrator is concerned with I/O at two separate levels. The first level is concerned with I/O between user address space and kernel address space; the second level is concerned with I/O between kernel address space and physical hardware devices. When data is written to disk, the first level of the I/O subsystem copies the data from user space to kernel space. Data is then passed from the kernel address space to the second level of the I/O subsystem. This is when the physical hardware device activates its own I/O subsystems, which determine the best location for the data on the available disks.
The OEM (Original Equipment Manufacture) UNIX configuration is satisfactory for many work environments, but does not take into consideration the network traffic or the behavior of specific applications on your system. Systems administrators find that they need to reconfigure the systems I/O to meet the expectations of the users and the demands of their applications. You should use the default configuration as a starting point and, as experience is gained with the demands on the system resources, tune the system to achieve peak I/O performance.
UNIX comes with a wide variety of tools that monitor system performance. Learning to use these tools will help you determine whether a performance problem is hardware or software related. Using these tools will help you determine whether a problem is poor user training, application tuning, system maintenance, or system configuration. sar, iostat, and monitor are some of your best basic I/O performance monitoring tools.
The memory subsystem comes into effect when the programs start requesting access to more physical RAM memory than is installed on your system. Once this point is reached, UNIX will start I/O processes called paging and swapping. This is when kernel procedures start moving pages of stored memory out to the paging or swap areas defined on your hard drives. (This procedure reflects how swap files work in Windows by Microsoft for a PC.) All UNIX systems use these procedures to free physical memory for reuse by other programs. The drawback to this is that once paging and swapping have started, system performance decreases rapidly. The system will continue using these techniques until demands for physical RAM drop to the amount that is installed on your system. There are only two physical states for memory performance on your system: Either you have enough RAM or you don't, and performance drops through the floor.
Memory performance problems are simple to diagnose; either you have enough memory or your system is thrashing. Computer systems start thrashing when more resources are dedicated to moving memory (paging and swapping) from RAM to the hard drives. Performance decreases as the CPUs and all subsystems become dedicated to trying to free physical RAM for themselves and other processes.
This summary doesn't do justice, however, to the complexity of memory management nor does it help you to deal with problems as they arise. To provide the background to understand these problems, we need to discuss virtual memory activity in more detail.
We have been discussing two memory processes: paging and swapping. These two processes help UNIX fulfill memory requirements for all processes. UNIX systems employ both paging and swapping to reduce I/O traffic and execute better control over the system's total aggregate memory. Keep in mind that paging and swapping are temporary measures; they cannot fix the underlying problem of low physical RAM memory.
Swapping moves entire idle processes to disk for reclamation of memory, and is a normal procedure for the UNIX operating system. When the idle process is called by the system again, it will copy the memory image from the disk swap area back into RAM.
On systems performing paging and swapping, swapping occurs in two separate situations. Swapping is often a part of normal housekeeping. Jobs that sleep for more that 20 seconds are considered idle and may be swapped out at any time. Swapping is also an emergency technique used to combat extreme memory shortages. Remember our definition of thrashing; this is when a system is in trouble. Some system administrators sum this up very well by calling it "desperation swapping."
Paging, on the other hand, moves individual pages (or pieces) of processes to disk and reclaims the freed memory, with most of the process remaining loaded in memory. Paging employs an algorithm to monitor usage of the pages, to leave recently accessed pages in physical memory, and to move idle pages into disk storage. This allows for optimum performance of I/O and reduces the amount of I/O traffic that swapping would normally require.
I/O performance management, like all administrative tasks, is a continual process. Generating performance statistics on a routine basis will assist in identifying and correcting potential problems before they have an impact on your system or, worst case, your users. UNIX offers basic system usage statistics packages that will assist you in automatically collecting and examining usage statistics.
You will find the load on the system will increase rapidly as new jobs are submitted and resources are not freed quickly enough. Performance drops as the disks become I/O bound trying to satisfy paging and swapping calls. Memory overload quickly forces a system to become I/O and CPU bound. However, once you identify the problem to be memory, you will find adding RAM to be cheaper than adding another CPU to your system.
Hard Drive I/O
Some simple configuration considerations will help you obtain better I/O performance regardless of your system's usage patterns. The factors to consider are the arrangement of your disks and disk controllers and the speed of the hard drives.
The best policy is to spread the disk workload as evenly as possible across all controllers. If you have a large system with multiple I/O back planes, split your disk drives evenly among the two buses. Most disk controllers allow you to daisy chain several disk drives from the same controller channel. For the absolute best performance, spread the disk drives evenly over all controllers. This is particularly important if your system has many users who all need to make large sequential transfers.
Small Computer System Interface (SCSI) devices are those that adhere to the American National Standards Institute (ANSI) standards for connecting intelligent interface peripherals to computers. The SCSI bus is a daisy-chained arrangement originating at a SCSI adapter card that interconnects several SCSI controllers. Each adapter interfaces the device to the bus and has a different SCSI address that is set on the controller. This address determines the priority that the SCSI device is given, with the highest address having the highest priority. When you load balance a system, always place more frequently accessed data on the hard drives with the highest SCSI address. Data at the top of the channel takes less access time, and load balancing increases the availability of that data to the system.
After deciding the best placement of the controllers and hard drives on your system, you have one last item for increasing system performance. When adding new disks, remember that the seek time of the disk is the single most important indicator of its performance. Different processes will be accessing the disk at the same time as they are accessing different files and reading from different areas at one time.
The seek time of a disk is the measure of time required to move the disk drive's heads from one track to another. Seek time is affected by how far the heads have to move from one track to another. Moving the heads from track to track takes less time that shifting those same drive heads across the entire disk. You will find that seek time is actually a nonlinear measurement, taking into account that the heads have to accelerate, decelerate, and then stabilize in their new position. This is why all disks will typically specify a minimum, average, and maximum seek time. The ratio of time spent seeking between tracks to time spent transferring data is usually at least 10 to 1. The lower the aggregate seek time, the greater your performance gain or improvement.
One problem with allowing for paging and swap files to be added to the hard disks is that some system administrators try to use this feature to add more RAM to a system. It does not work that way. The most you could hope for is to temporarily avert the underlying cause, low physical memory. There is one thing that a systems administrator can do to increase performance, and that is to accurately balance the disk drives.
Don't overlook the obvious upgrade path for I/O performance, tuning. If you understand how your system is configured and how you intend to use it, you will be much less likely to buy equipment you don't need or that won't solve your problem.
Filesystem Management Subsystem
In discussing "Kernel Basics and Configuration" a very important topic, filesystems, must be considered. This discussion shall deal with the basic structural method of long-term storage of system and user data. Filesystems and the parameters that are used to create them have a direct impact on performance, system resource utilization, and kernel efficiency dealing with Input/Output (I/O).
There are several important filesystem types that are supported by different operating systems (OS), many of which are not used for implementation at this time. The reasons they are not used vary from being inefficient to just being outdated. However, many operating systems still support their filesystem structure so that compatibility doesn't become an issue for portability.
This support of other filesystem structures plays a large role in allowing companies to move between OS and computer types with little impact to their applications.
The following is a list of filesystem types that are supported by specific operating systems. The list will only cover local, network, and CD-ROM filesystems.
Note: NFS stands for Networked FileSystem
Since filesystems are stored on disk, the systems administrator should look at basic disk hardware architecture before proceeding with specifics of filesystems. A disk is physically divided into tracks, sectors, and blocks. A good representation of a sector would be a piece of pie removed form the pie pan. Therefore, as with a pie, a disk is composed of several sectors (see Figure 19.3). Tracks are concentric rings going from the outside perimeter to the center of the disk, with each track becoming smaller as it approaches the center of the disk. Tracks on a disk are concentric, therefore they never touch each other. The area of the track that lies between the edges of the sector is termed a block, and the block is the area where data is stored. Disk devices typically use a block mode accessing scheme when transferring data between the file management subsystem and the I/O subsystem. The block size is usually 512- or 1024-byte fixed-length blocks, depending upon the scheme used by the operating system. A programmer may access files using either block or character device files.
You now have a basic understanding of the terms tracks, sectors, and blocks as they apply to a single platter disk drive. But most disk today are composed of several platters with each platter having its own read/write head. With this in mind, we have a new term: cylinder (see Figure 19.4). Let's make the assumption that we have a disk drive that has six platters so, logically, it must have six read/write heads. When read/write head 1 is on track 10 of platter 1, then heads 2 through 6 are on track 10 of their respective platters. You now have a cylinder. A cylinder is collectively the same track on each platter of a multi-platter disk.
Filesystem Concepts and Format
The term filesystem has two connotations. The first is the complete hierarchical filesystem tree. The second is the collection place on disk device(s) for files. Visualize the filesystem as consisting of a single node at the highest level (ROOT) and all other nodes descending from the root node in a tree-like fashion (see Figure 19.5) . The second meaning will be used for this discussion, and Hewlett Packard's High-performance Filesystem will be used for technical reference purposes.
The superblock is the key to maintaining the filesystem. It's an 8 KB block of disk space that maintains the current status of the filesystem. Because of its importance, a copy is maintained in memory and at each cylinder group within the filesystem. The copy in main memory is updated as events transpire. The update daemon is the actual process that calls on the kernel to flush the cached superblocks, modified inodes, and cached data blocks to disk. The superblock maintains the following static and dynamic information about the filesystem. An asterisk will denote dynamically maintained information.
Number of Inodes
Location of free space
Number of cylinder groups
Fragment size and number
Block size and number
Location of superblocks, cylinder groups, inodes, and data blocks
Total number of free data blocks
Total number of free inodes
Filesystem status flag (clean flag)
As you can see from the listed information, the superblock maintains the integrity of the filesystem and all associated pertinent information. To prevent catastrophic events, the OS stores copies of the superblock in cylinder groups. The locations of these alternate superblocks may be found in /etc/sbtab. When system administrators are using fsck -b to recover from an alternate superblock, they will be required to give the location of that alternate block. Again, the only place to find that information is in /etc/sbtab. As a qualification to that statement, there is always an alternate superblock at block sixteen.
Cylinder groups are adjacent groups of cylinders, 16 cylinders by default, that have their own set of inodes and free space mapping. This is done to improve performance and reduce disk latency. Disk latency is the time between when the disk is read and the I/O subsystem can transfer the data. Some factors that affect disk latency are rotational speed, seek time, and the interleave factor. This concept also associates the inodes and data blocks in closer proximity.
The layout of the cylinder group is:
Cylinder group information
The boot block and the primary superblock will only be there if this is the first cylinder group; otherwise, it may be filled with data.
Inodes are fixed-length entries that vary in their length according to the OS implemented. SVR4 implementation is 128 bytes for a UFS inode and 64 bytes for an S5 inode. The inode maintains all of the pertinent information about the file except for the filename and the data. The information maintained by the inode is as follows:
File permissions or mode
Type of file
Number of hard links
Group associated to the file
Actual file size in bytes
Time/Date file last changed
Time/Date file last accessed
Time/Date last inode modification
Single indirect block pointer
Double indirect block pointer
Triple indirect block pointer
There are 15 slots in the inode structure for disk address or pointers(see Figure 19.6). Twelve of the slots are for direct block addressing. A direct address can either point to a complete block or to a fragment of that block. The block and fragment sizes we are discussing are configurable parameters that are set at filesystem creation. They cannot be altered unless the filesystem is removed and re-created with the new parameters.
Listing of a typical AIX Root directory using ls -ali, to indicate the inode numbers for each file entry in the directory.
Single indirect addressing (slot 13) points to a block of four-byte pointers that point to data blocks. If the block that is pointed to by the single indirect method is 4 KB in size, it would contain 1024 four-byte pointers, and if it were 8 KB in size, it would contain 2048 four-byte pointers to data blocks. The double indirect block pointer is located in slot 14, and slot 15 maintains the triple indirect block pointer.
In the "Filesystem Concepts and Format" section, the initial discussion covered basic concepts of superblocks, alternate superblocks, cylinder groups, inodes, and direct and indirect addressing of data blocks. Further reading into these subjects is a must for all systems administrators, especially the new and inexperienced.
Kernel Configuration Process
Kernel configuration is a detailed process in which the systems administrator is altering the behavior of the computer. The systems administrator must remember that a change of a single parameter may affect other kernel subsystems, thus exposing the administrator to the "law of unintended consequences."
When Do You Rebuild the Kernel
Kernel components are generally broken into four major groups, and if changes are made to any of these groups, a kernel reconfiguration is required.
Subsystems--These are components that are required for special functionality (ISO9660)
Dump Devices--System memory dumps are placed here when a panic condition exist. Core dumps are usually placed at the end of the swap area.
Configurable Parameters--These are tuning parameters and data structures. There are a significant number, and they may have inter-dependencies, so it is important that you are aware of the impact of each change.
Device Drivers--These handle interfaces to peripherals like modems, printers, disks, tape drives, kernel memory, and other physical devices.
There are two ways to rebuild the kernel:
A. Use the System Activity Monitor (SAM)
Step 1--Run SAM and select "Kernel Configuration."
Step 2--Select the desired component and make the appropriate change(s).
Step 3--Now answer the prompts and the kernel will be rebuilt.
Step 4--It will also prompt you for whether you want to reboot the kernel now or later.
Consider the importance of the changes and the availability of the system to answer this prompt. If you answer "YES" to reboot the system now it can not be reversed. The point is to know what you are going to do prior to getting to that prompt.
B. Manual Method
Step 1--Go to the build area of the kernel by typing the command line below.
# cd /stand/build
Step 2--The first step is to create a system file from the current system configuration by typing the command line below.
# /usr/lbin/sysadm/system_prep -s system
This command places the current system configuration in the filesystem. There is no standard that you call it system; it could be any name you desire.
Step 3--Now you must modify the existing parameters and insert unlisted configuration parameters, new subsystems, and device drivers, or alter the dump device. The reason you may not have one of the listed configurable parameters in this file: The previous kernel took the default value.
Step 4--The next step is to create the conf.c file, and we are using the modified system file to create it. Remember, if you did not use system for the existing configuration file, insert your name where I show system. The conf.c file has constants for the tunable parameters. Type the command below to execute the config program.
# /usr/sbin/config -s system
Step 5--Now rebuild the kernel by linking the driver objects to the basic kernel.
# make -f config.mk
Step 6--Save the old system configuration file.
# mv /stand/system /stand/system.prev
Step 7--Save the old kernel.
# mv /stand/vmunix /stand/vmunix.prev
Step 8--Move the new system configuration file into place.
# mv ./system /stand/system
Step 9--Move the new kernel into place.
# mv ./vmunix_test /stand/vmunix
Step 10--You are ready to boot the system to load the new kernel.
# shutdown -r -y 60
Suppose we were going to run Oracle on our Sun system under Solaris 2.5 and you wanted to change max_nprocs to 1000 and set up the following Interprocess Communications configuration for your shared memory and semaphore parameters:
Step 1--As root, enter the commands below:
# cd /etc # cp system system.old - create a backup
# vi system
Add or change the following:
set max_nprocs=1000 set shmsys:shminfo_shmmax=2097152 set shmsys:shminfo_shmmin=1 set shmsys:shminfo_shmmni=100 set shmsys:shminfo_shmseg=32 set msgsys:seminfo_semmni=64 set msgsys:seminfo_semmns=1600 set msgsys:seminfo_semmnu=1250 set msgsys:seminfo_semmsl=25
Save and close the file.
Step 3--Reboot your system by entering the following command.
# shutdown -r now
The above kernel parameter and kernel module variables are now set for your system.
In this example we will set the tunable NPROC to 500 and then rebuild the kernel to reflect this new value.
Step 1--Log into the system as root and make a backup of /stand/unix to another area.
# cp /stand/unix /old/unix
Edit the init.base file to include any changes that you made in the /etc/inittab file that you want to make permanent. A new /etc/inittab file is created when a new kernel is built and put into place.
Step 3--In this step you edit the configuration files in the /etc/conf directory. We will only change /etc/conf/cf.d/stune (although you can change /etc/conf/cf.d/mtune). The stune and mtune files contain the tunable parameters the system uses for its kernel configuration. stune is the system file that you should use when you alter the tunable values for the system. It overrides the values listed in mtune. mtune is the master parameter specification file for the system. It contains the tunable parameters' default, minimum, and maximum values.
The following command line is an example of how you make stune reflect a parameter change.
# /etc/conf/bin/idtune NPROC 500
You can look at stune to see the changes. (stune can be altered by using the vi editor)
Step 4--Build the new kernel.
It will take several minutes to complete.
Step 5--Reboot the computer system to enable the new kernel to take effect.
# shutdown -I6 -g0 -y
To see your changes, log back into your system and execute the sysdef command. The system parameters will then be displayed.
Unlike the preceding examples, the AIX operating system requires a special tool to reconfigure the kernel. This tool is the System Management Interface Tool (SMIT), developed by IBM for the AIX operating system. The AIX kernel is modular in the sense that portions of the kernel's subsystems are resident only when required.
The following shows a SMIT session to change the MAX USERS PROCESSES on an AIX 4.2 system. This is demonstrated to the reader by screen prints of an actual kernel configuration session. While using SMIT you can see the commands sequences being generated by SMIT by pressing the F6 key. SMIT also makes two interaction logs that are handy for post configuration review. SMIT.LOG is an ASCII file that shows all menu selections, commands, and output of a session. SMIT.SCRIPT shows just the actual command line codes used during the session.
Step 1--At root, start SMIT with the following command. This will bring up the IBM SMIT GUI interface screen.
Step 2--Select "System Environments" from the System Management menu with your mouse.
Step 3--Select "Change/Show Characteristics of Operating System" from the System Environment menu with your mouse.
Step 4--Change "Maximum number of PROCESSES allowed per user" to "50" in the "Change/Show Characteristics of Operating System" menu. Do this by selecting the field for "Maximum number of PROCESSES" with your mouse. Then change the current value in the field to "50."
Step 5--After making your change, select the "OK" button to make the new kernel parameters take effect.
Step 6--The System Management Interface Tool will respond with a "Command Status" screen. Verify that there are no errors in it. If there are none, you are done.
If an error is returned it would look like the following screen print.
The point-to-point protocol (PPP) allows you to dial-in over a telephone line and run Transmission Control Protocol/Internet Protocol (TCP/IP).This allows you to run your GUI applications that use IP from a system that is not directly connected to a network. Let's look at how to configure PPP into the Linux kernel.
Step 1--Linux source code is usually found in the /usr/rc/linux directory. Let's start by changing to this directory by typing the following command.
# cd /usr/src/linux
Step 2--Type the following:
# make config
You will be presented with a series of questions asking if you would like to include or enable specific modules, drivers, and other kernel options in your kernel. For our build, we are concerned that we have a modem and the required networking device driver information configured into our kernel. Make sure you have answered [y] to:
Networking Support (CONFIG_NET)
Network device support (CONFIG_NETDEVICES)
TCP/IP networking (CONFIG_INET)
PPP (point-to-point) support (CONFIG_PPP)
For the most part, you can accept the defaults of most of the questions. It's probably a good idea to go through this step once without changing anything to get a feel for the questions you will need to answer. That way you can set your configuration once and move on to the next step.
After you respond to all the questions, you will see a message telling you that the kernel is configured (it still needs to be built and loaded).
Step 3--The next two commands will set all of the source dependencies so that you can compile the kernel and clean up files that the old version leaves behind.
# make dep # make clean
Step 4--To compile the new kernel issue the make command.
Don't be surprised if this takes several minutes.
Step 5--To see the new kernel, do a long listing of the directory.
# ls -al
You should see vmlinux in the current directory.
Step 6--You now need to make a bootable kernel.
# make boot
To see the compressed bootable kernel image, do a long listing on arch/i386/boot You will see a file named zImage.
Step 7--The last step is to install the new kernel to the boot drive.
# make zlilo
This command will make the previous kernel (/vmlinuz) to become /vmlinuz.old. Your new kernel image zImage is now /vmlinuz. You can now reboot to check your new kernel configuration. During the boot process, you should see messages about the newly configured PPP device driver scroll across as the system loads.
Once everything checks out and you are satisfied with your new Linux kernel, you can continue on with setting up the PPP software.
We began our discussion by defining the UNIX kernel and the four basic subsystems that comprise the Operating System. We described how Process Management creates and manages the process and how Memory Management handles multiple process in the system. We discussed how the I/O subsystem takes advantage of swapping and paging to balance the system's load and the interaction of the I/O subsystem with the file management subsystem.
Next, we covered the steps involved in altering the kernel configuration. We demonstrated in detail the steps involved in configuring:
System V Release 4 (SVR4)
In the author's opinion, the systems administrator should become familiar with the concepts presented in this chapter. Further in-depth study of the kernel and its four subsystems will make the systems administrator more knowledgeable and effective at systems management.