Note: This content is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the display of our site's design details. We support the mission of the Web Standards Project in the campaign encouraging users to upgrade their browsers.

Tobi Waves


INDEX | NOW | 2003|2004|2005 / 02|03|04|05|08|09|12 / 09|10|11|12|13|14|17|19|20

NordU 2003 Tutorial: Solaris Internals

Wednesday, February 12, 2003 17:00 // Aros Congress Center, Västerås, Sweden // href

eye candy

Solaris Internals: Architectural Tips and Tidbits by Richard McDougall richard.mcdougal@eng.sun.com

In this tutorial, Richard highlighted various components of Solaris. Below I have listed the things I found interesting with an emphasis on Solaris 9 features and Solaris 8 tricks.

The Solaris kernel is preemptive. There are only very few non-preemption points in critical code paths. Combined with the threaded kernel this allows for very scalable handling of IO interrupts.

In the 64bit transition of Solaris only long and pointer were changed from 32 to 64bit the other types stayed the same size. (32bit -gt ILP 32, 64bit -gt LP 64).

Oracle profits from 64bit by being able to cache all diskblocks and thus acces data without system calls. Memory mapping a whole database is not unreasonable with todays memory prices.

64bit code has bigger pointers and longs it has to move more data. This which results in a  5% performance loss. If other 64bit features are used this loss of performance is not significant.

Solaris 8 has a new modular, live kernel debugger called mdb which replaces adb and crash. It allows to look at things like the list of open files and other kernel data structures. In Solaris 9 adb and crash are removed. In Solaris 9 mdb even understands c structs defined in header files. It uses this information for "nice" memory dumping.

The kstat command is used to show kernel statistics: eg. kstat -n~system_misc. Kstat is written in perl and uses a perl module to access data. This module can be easily used in other programs.

About the Solaris Virtual Memory System

VM is all done on demand, pages get loaded as they are used, but the memory reservation (in swap) happens on exec, fork and break. The effect of this is, that a lot of swap gets reserved but it does not actually get used as long as it does not get touched by the code. Forking apache reserves the same amount of memory for the forked copy as for the original but it only copies the bits of the original that get modified (copy on write). The advantage is that Solaris will be able to actually provide all the memory a process got allocated in the first place and will not die halfway through the operation because it promised more memory than what was actually available. (Linux and AIX are less conservative as they hand out reservations without looking at the available memory.)

The loading of programs uses a similar process where the file on disk becomes part of the OS memory (it gets memory mapped) and then get successively loaded (demand paged) into RAM as when parts of the program get executed.

When multiple copies of a program get started (or forked) they will all share the same memory as long as they do not modify it (copy on write).

How to determine how much memory a process really uses? The text segment of each process is read only so all instances of the same program or shared library are shared. The data segment is initially shared but will get split partly as the programs modify their statically allocated memory and copy on write happens. With

pmap -x  PID

I get a detailed report of what memory the process really uses. The interesting part is the total of private memory (or anon memory in Solaris 9) which shows the amount of memory which is exclusive to the process. Solaris 9 gives us pmap -S which will show how much memory is reserved in swap.

Memory Access Speed

Solaris 9 U1 has some support for NUMA (non uniform memory access time architectures) architectures. This means that Solaris can handle the fact that different sections of the memory have different access times. While Suns multi CPU boxes are essentially symmetrical, so all memory has the same access time, the larger variants like Ex800 or F15k are slightly NUMA. Solaris 9 U1 deals with this fact by building memory latency groups and taking the access time relative to the cpu into account when allocating memory to a process. This happens automatically but there are also APIs for influencing this when writing programs.

How to determine if there is a memory shortage

Solaris 8 has an all new paging system which has much better performance than previous versions. The vmstat -p~3 command will show a detailed report on the paging activity. By looking out for high number in the Anonymous column this indicates that there is a shortage of physical memory. So this is the way to check if any bad swapping is happening. Make sure all VM tunables are removed from /etc/system when migrating to Solaris 8.

In Solaris 9 the mdb -k command ::memstat gets a detailed list of memory usage.

CPU Usage Accounting

In Solaris 8 interrupts (eg from the Network Card) are accounted as idle time while Solaris 9 does account them correctly.

With trapstat in Solaris 9 it is possible to see how many interrupts that are occurring.

How to determine where the kernel is spending it stime

The command lockstat -kIi997 sleep 10 will monitor what the kernel is doing for 10 seconds and sample the kernel threads 997 time a second and then show you what it found. lockstat sleep 10 will tell how many locks of what type occurred where within the last 10 seconds. The mpstat 1 will show some per processor statistics. Especially the column Intr will show if the box is spending a lot of time being interrupted. If Smtx is high, this indicates that the CPU is spinning on mutexes while it unsuccessfully tries to acquire a lock.

Kernel level Threading

Up to Solaris 8 the default thread model was based on the idea of a user-level scheduler mapping user-land threads onto kernel threads (LWP). Thread switching happens on blocked threads only. So this is almost like cooperative multitasking with the obvious problems. The nice thing is that this can be very fast as the handing of between threads is very light weight and could create very efficient programs. Unfortunately the whole system is complex and quite unreliable the massive number of thread patches is a witness to this fact.

An alternative threading library is sitting is /usr/lib/lwp it turns all threads into kernel threads and scales really well because there is now only one scheduler. All the fairness problems go away. What is really cool about it, a simple

LD_LIBRARY_PATH=/usr/lib/lwp program

will make a threaded application use the kernel threads. Note that this is only working really well from S8U7 (kernel jumbo from feb 2002) in Solaris 9 the new threading lib is the default.

A better top: prstat

Solaris 8 includes something like top called prstat it has various options to show all sorts of statistics: -m will show per process microstat where it is possible to see much better where a process is spending its time. With out options it acts just like top. With -t you get stats on a per user level.

Tricks with truss

See what is happening in the program besides system calls

truss -d -u a.out,libc program

Find out in which system calls take how much time

truss -c program

Filesystem Tuning

Machines which access many files concurrently might profit from setting ufsninode and ncache to higher values. Searching for "system tuning manual" on (docs.sun.com ...) will give more information. This could also be of interest for Machines running diskless which access many different files concurrently over NFS as inode cache misses are even more expensive.

In Solaris 9 there is an option to get checksumming on the device driver level to make sure files do not change on disk.

The kernel parameters ufs_LW and ufs_HW make sure that UFS does not consume too much memory (write throttling) this has a rather negative impact on performance as their default value is too low. In Solaris 9 this is therefore set to higher values. Richard suggested the following to be put in /etc/system of Solaris 8 systems especially if they have a lot of memory:

set ufs_LW=4194304
set ufs_HW=67188864

Check (kr.sun.com ...) and (www.princeton.edu ...) for some information.

Another important tunable parameter is maxphys. It sets the maximal junk of data the filesystem will write to disk in one go. For SCSI this is ste to 128k by default which is way too low. Richard suggests to set this to 2 MByte.

set maxphys=2097152

Check (206.231.101.22 ...)

To learn more ...

For more information, Richard recommends his Solaris Internals book: (www.amazon.com ...) or (www.solarisinternals.com ...)

 

The effects of a full day Tutorial

Wednesday, February 12, 2003 17:09 // Aros Congress Center, Västerås, Sweden // href

Yesterday Tom Limoncelli said that he thought that one day tutorials were probably the best method for learning new things. Today I did the 'self experiment' and spent the whole day in Richard McDougall's Solaris tutorial. Only minutes after the tutorial has finishes I can confirm that I really do feel exhilarated and would like to try out all the thing Richard touched upon. But looking more closely at what exactly it is, that I learned from today's slew of slides and explanations, all that remains are my notes, a headache and quite a number of areas I would like to investigate. I have not yet learned anything in the sense that I have tested and applied what Richard has been explaining in the real world. So even though I would have never "learned" as much as I heard today, I guess I would have profited more from less information and more hands-on training on real world problems. Or if hands-on was not possible, then probably paper exercises where I had to think up solutions which then would have been discussed later on.

The main problem with this more thorough approach is, that it would be way less sexy than the information blast method and people who pay for these tutorials want something out of them. At worst people might complain, that while they had learned how to make their Sun run faster they effectively had figured it out on their own and wondered why they had to pay so much for an instructor to only ask them questions.

 

NEWER | LONGER |