Note: This content is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the display of our site's design details. We support the mission of the Web Standards Project in the campaign encouraging users to upgrade their browsers.
Wednesday, February 12, 2003 17:00 // Aros Congress Center, Västerås, Sweden // href

Solaris Internals: Architectural Tips and Tidbits by Richard McDougall richard.mcdougal@eng.sun.com
In this tutorial, Richard highlighted various components of Solaris. Below I have listed the things I found interesting with an emphasis on Solaris 9 features and Solaris 8 tricks.
The Solaris kernel is preemptive. There are only very few non-preemption points in critical code paths. Combined with the threaded kernel this allows for very scalable handling of IO interrupts.
In the 64bit transition of Solaris only long and pointer were changed from 32 to 64bit the other types stayed the same size. (32bit -gt ILP 32, 64bit -gt LP 64).
Oracle profits from 64bit by being able to cache all diskblocks and thus acces data without system calls. Memory mapping a whole database is not unreasonable with todays memory prices.
64bit code has bigger pointers and longs it has to move more data. This which results in a 5% performance loss. If other 64bit features are used this loss of performance is not significant.
Solaris 8 has a new modular, live kernel debugger called mdb which replaces adb and crash. It allows to look at things like the list of open files and other kernel data structures. In Solaris 9 adb and crash are removed. In Solaris 9 mdb even understands c structs defined in header files. It uses this information for "nice" memory dumping.
The kstat command is used to show kernel statistics: eg. kstat -n~system_misc. Kstat is written in perl and uses a perl module to access data. This module can be easily used in other programs.
About the Solaris Virtual Memory System
VM is all done on demand, pages get loaded as they are used, but the memory reservation (in swap) happens on exec, fork and break. The effect of this is, that a lot of swap gets reserved but it does not actually get used as long as it does not get touched by the code. Forking apache reserves the same amount of memory for the forked copy as for the original but it only copies the bits of the original that get modified (copy on write). The advantage is that Solaris will be able to actually provide all the memory a process got allocated in the first place and will not die halfway through the operation because it promised more memory than what was actually available. (Linux and AIX are less conservative as they hand out reservations without looking at the available memory.)
The loading of programs uses a similar process where the file on disk becomes part of the OS memory (it gets memory mapped) and then get successively loaded (demand paged) into RAM as when parts of the program get executed.
When multiple copies of a program get started (or forked) they will all share the same memory as long as they do not modify it (copy on write).
How to determine how much memory a process really uses? The text segment of each process is read only so all instances of the same program or shared library are shared. The data segment is initially shared but will get split partly as the programs modify their statically allocated memory and copy on write happens. With
pmap -x PID
I get a detailed report of what memory the process really uses. The interesting part is the total of private memory (or anon memory in Solaris 9) which shows the amount of memory which is exclusive to the process. Solaris 9 gives us pmap -S which will show how much memory is reserved in swap.
Memory Access Speed
Solaris 9 U1 has some support for NUMA (non uniform memory access time architectures) architectures. This means that Solaris can handle the fact that different sections of the memory have different access times. While Suns multi CPU boxes are essentially symmetrical, so all memory has the same access time, the larger variants like Ex800 or F15k are slightly NUMA. Solaris 9 U1 deals with this fact by building memory latency groups and taking the access time relative to the cpu into account when allocating memory to a process. This happens automatically but there are also APIs for influencing this when writing programs.
How to determine if there is a memory shortage
Solaris 8 has an all new paging system which has much better performance than previous versions. The vmstat -p~3 command will show a detailed report on the paging activity. By looking out for high number in the Anonymous column this indicates that there is a shortage of physical memory. So this is the way to check if any bad swapping is happening. Make sure all VM tunables are removed from /etc/system when migrating to Solaris 8.
In Solaris 9 the mdb -k command ::memstat gets a detailed list of memory usage.
CPU Usage Accounting
In Solaris 8 interrupts (eg from the Network Card) are accounted as idle time while Solaris 9 does account them correctly.
With trapstat in Solaris 9 it is possible to see how many interrupts that are occurring.
How to determine where the kernel is spending it stime
The command lockstat -kIi997 sleep 10 will monitor what the kernel is doing for 10 seconds and sample the kernel threads 997 time a second and then show you what it found. lockstat sleep 10 will tell how many locks of what type occurred where within the last 10 seconds. The mpstat 1 will show some per processor statistics. Especially the column Intr will show if the box is spending a lot of time being interrupted. If Smtx is high, this indicates that the CPU is spinning on mutexes while it unsuccessfully tries to acquire a lock.
Kernel level Threading
Up to Solaris 8 the default thread model was based on the idea of a user-level scheduler mapping user-land threads onto kernel threads (LWP). Thread switching happens on blocked threads only. So this is almost like cooperative multitasking with the obvious problems. The nice thing is that this can be very fast as the handing of between threads is very light weight and could create very efficient programs. Unfortunately the whole system is complex and quite unreliable the massive number of thread patches is a witness to this fact.
An alternative threading library is sitting is /usr/lib/lwp it turns all threads into kernel threads and scales really well because there is now only one scheduler. All the fairness problems go away. What is really cool about it, a simple
LD_LIBRARY_PATH=/usr/lib/lwp program
will make a threaded application use the kernel threads. Note that this is only working really well from S8U7 (kernel jumbo from feb 2002) in Solaris 9 the new threading lib is the default.
A better top: prstat
Solaris 8 includes something like top called prstat it has various options to show all sorts of statistics: -m will show per process microstat where it is possible to see much better where a process is spending its time. With out options it acts just like top. With -t you get stats on a per user level.
Tricks with truss
See what is happening in the program besides system calls
truss -d -u a.out,libc program
Find out in which system calls take how much time
truss -c program
Filesystem Tuning
Machines which access many files concurrently might profit from setting ufsninode and ncache to higher values. Searching for "system tuning manual" on (docs.sun.com ...) will give more information. This could also be of interest for Machines running diskless which access many different files concurrently over NFS as inode cache misses are even more expensive.
In Solaris 9 there is an option to get checksumming on the device driver level to make sure files do not change on disk.
The kernel parameters ufs_LW and ufs_HW make sure that UFS does not consume too much memory (write throttling) this has a rather negative impact on performance as their default value is too low. In Solaris 9 this is therefore set to higher values. Richard suggested the following to be put in /etc/system of Solaris 8 systems especially if they have a lot of memory:
set ufs_LW=4194304 set ufs_HW=67188864
Check (kr.sun.com ...) and (www.princeton.edu ...) for some information.
Another important tunable parameter is maxphys. It sets the maximal junk of data the filesystem will write to disk in one go. For SCSI this is ste to 128k by default which is way too low. Richard suggests to set this to 2 MByte.
set maxphys=2097152
Check (206.231.101.22 ...)
To learn more ...
For more information, Richard recommends his Solaris Internals book: (www.amazon.com ...) or (www.solarisinternals.com ...)
Wednesday, February 12, 2003 17:09 // Aros Congress Center, Västerås, Sweden // href
Yesterday Tom Limoncelli said that he thought that one day tutorials were probably the best method for learning new things. Today I did the 'self experiment' and spent the whole day in Richard McDougall's Solaris tutorial. Only minutes after the tutorial has finishes I can confirm that I really do feel exhilarated and would like to try out all the thing Richard touched upon. But looking more closely at what exactly it is, that I learned from today's slew of slides and explanations, all that remains are my notes, a headache and quite a number of areas I would like to investigate. I have not yet learned anything in the sense that I have tested and applied what Richard has been explaining in the real world. So even though I would have never "learned" as much as I heard today, I guess I would have profited more from less information and more hands-on training on real world problems. Or if hands-on was not possible, then probably paper exercises where I had to think up solutions which then would have been discussed later on.
The main problem with this more thorough approach is, that it would be way less sexy than the information blast method and people who pay for these tutorials want something out of them. At worst people might complain, that while they had learned how to make their Sun run faster they effectively had figured it out on their own and wondered why they had to pay so much for an instructor to only ask them questions.
Thursday, February 13, 2003 08:29 // Aros Congress Center, Västerås, Sweden // href

by Eija Onnela eija.onnela@turku.fi
Turku is a city in Finland with 173'000 inhabitants, 13'000 city employees, 5000 workstations, 150 man/years spent on IT every year. 54 people in central IT.
Reasons for looking into OpenSource: a) OpenOffice in Finnish, b) new M$ licensing policy, c) report on usability of OpenOffice and Linux in Turku City http:///www.turku.fi/english/administration_economy/it_department.html
Test Setup
Test setups were created for Linux (1 person) and Windows (4 people sponsored by MS)
The goal of the upgrade is to simplify system management while keeping the user experience at a good level.
The Linux Software Environment is based on Suse using Webmin for administration and OpenAFS for home directory access. On the Application side they used OpenOffice and Netscape. Installation was done via CD because of the slow network environment. All running of a single Linux server.
The Windows Setup was done with RIS through PXE. Office on the Software side ... special application distributed through SMS. 9~Servers.
Problems
On the Linux side the problems were mostly because of interoperability with old office documents and the fact that the users are not used to the Linux environment and many small application which were not available on Linux.
On the Windows side, problems were mostly because of applications which were OK on Windows NT did not work on Windows XP.
There is a lot of resistance from the user side and from department admins regarding a switch to Linux. Users do not want to work with a different environment and local admins do not want to change to a centralized solution.
Conclusion
Two TCO analysis projects are still underway to determine the financial implications of the two solutions. A decision has not been reached yet on whether to go for Linux or Windows.
Eija is under quite a lot of pressure currently because all major cities in Finland are waiting on the outcome of the Turku project. And things are not looking all that good for linux because of missing applications and users as well as local admins stalling. Maybe Cytrix and terminal server can help.
Thursday, February 13, 2003 10:26 // Aros Congress Center, Västerås, Sweden // href

by Richard McDougall r@sun.com
Reasons for Big Memory and thus 64bit Solaris
Machines with up to 500 GB of memory are possible. this opens new possibilities like for example keeping huge databases totally in memory and thus eliminating all the read performance problems on the file system level.
UFS in Solaris gt= 8
File creation is 10 times faster, file system creation is magnitudes faster, directory lookups scale linearly with directory size.
New Tools in Solaris 8
prstat (a better top), mdb (successor to adb and crash), lockstat -k (kernel profiling), kstat (command and perl library for kernel statistics), extended truss (traces library and program calls), new accounting system, cpustat for cache and bus statistics.
Solaris 9 Resource Management
The RM is a Infrastructure to automate performance management.
Traditionally machines had to be sized quite big because the workloads were very uneven. With RM it is possible to add workloads at a low priority and thus use all available CPU time without disturbing the main task on the machine.
RM allows to group processes into projects and assign resources to them and also do accounting on them. In /etc/project (or via LDAP/NIS+) you can define process groups by program, user and group and assign resources to each group. With the newtask command a program can be explicitly assigned to a certain project and thus gets access to the respective resources.
Resources are defined in pools which allow to select number of CPUs and the type of scheduler to be used. The projects are assigned to pools which then define the resources available to the processes in a project.
The resource constraints facility lets you send signals to programs violating resource limits or also deny them access to resources.
Many of the Solaris performance tools know about the projects concept and can report based on projects instead of processes.
Relevant commands projects, proj{add,mod,del), newtask, pooladm, poolcfg, poolbind.
Check (www.sun.com ...)
Thursday, February 13, 2003 11:11 // Aros Congress Center, Västerås, Sweden // href

by Andrew Cagnery
Gdb is the most widely used debugger, only MS is still doing their own thing, most other companies have switched to gdb totally or are at least helping it succeed.
New Tricks
Languages C, C++, Java, Fortran, Scheme, Modula-2
Expression parser understands function expressions written in the language of the program and can evaluate them on the fly.
Remote debugging for debugging embedded systems remotely with gdb server.
Program tracing with trigger points to do on the fly monitoring without stopping the program.
Out in the next few weeks: tui the gdb split screen, curses based text gui.
The next version will know about multiple architectures. This means a single instance of gdb is able to remote debug code on different architectures. The eventual goal of this is to eventually be able to transparently step into remote procedure calls.
GDB is introducing a new interface called MI (machine interface) to simplify the use of gdb from front end programs. There are very strict criteria on changes to this interface to ensure that front-ends can rely on the stability of the interface.
Handle debugging of optimized code with CFI (gdb 6)
Old Dog
Gdb 1.x was out 1986 for SPARC, VAX, Tahoe and GOULD ... Andrew is looking for it.
Not really big new features since 1991, mainly new architectures were added. Almost any cpu ever designed is supported by gdb (and gcc).
Still the code base is growing exponentially they are at 1.5M lines now.
A few years back gdb supported 36 architectures. As this is difficult to maintain they have been actively eliminating old code ... they are down to 22 now.
Code Quality Improvements
Select -Werror fags for zero warning tolerance.
GNU Coding standard. ReIndent with GNU indent. Strict ISO C, Eliminate subjectivity. Use GNU indent and don't argue.
GDB specific lint which checks for various common problems. Code which does not pass is not accepted.
Move to opaque objects and avoid globals.
Thursday, February 13, 2003 11:58 // Aros Congress Center, Västerås, Sweden // href

by Jes Sørensen jes@wildopensource.com
Jes has been working on the Linux kernel for the last 10 years. He is specializing on driver development. Currently he is involved with the consultancy Wildopensource.
The Basic Problem
Drivers needed to talk to hardware
Documentation lets us write better drivers, as we have to guess less. Not that documentation would generally really describe what the hardware does in reality, but it is a start.
Binary drivers are a problem. As kernel API changes without regard for binary compatibility with existing drivers.
Open code is generally better because of peer review.
Why are people not releasing specs?
Many companies think that they will loose competitive edge if the competitors know how to program their HW.
Jes makes a case that this is not true as giving programming information is not a real IP issue anymore as today the core of a companies IP is mostly within the chip and not in the interface.
Another problem is that many companies like nVidia for example, do not own all their interfaces, due to cross licensing and patenting issues and are thus not allowed to release source.
Convincing People to release their Specs
Having a driver in the official kernel gets it some automatic maintenance as the code is updated with the kernel, or at least compatibility problems will be discovered more quickly.
Better public acceptance due to a good image within the community and thus better sales.
Free help for debugging the hardware as external driver writers tend to find new problems.
Addressing Execs
Engineers are normally not a problem, they like to share information and help each other out within the limits of the environment they work in.
NDAs are acceptable if they just protect the documentation and not the code written based on the information gained from the docs.
GPL helps as it ensures that the source can not be taken by competitors and included in their closed product.
Flaming and yelling shuts doors. Good behavior helps. Don't use SlashDot.
Petitions might help, but only if the company is interested in this.
What todo when told NO?
Look for alternative vendors, OEMs might be more friendly. E.g. Broadcoms OEMs.
Sometimes new chips are largely based on the previous model and thus the interfaces are similar.
Use reverse engineering, but beware if you live in a non free country like the US where the DMCA can send you to jail for years. In the EU, currently interfaces can be legally reverse engineered for the purpose of interoperability if the vendor refuses to give specs. Also be aware that some countries believe to have jurisdiction everywhere.
Reverse Engineering
Take a close look at existing drivers for other OSes.
Snoop drivers' register access.
Use srandom to figure out the correct access sequence (Andrew Tridel of Samba Fame used this to figure out how the Vaio Picture Book Camera works) .
To avoid licensing issues, get a friend to read the specs and have him tell you how it works.
How to Write a Driver
Do not use a compatibility Layer. Write the driver Linux specific.
Examples
Jes has been using these techniques to write drivers for Alteon, Intel EEPro 100 and Broadcom 570x.
Alteon is no more, with Intel there are now excellent links and they even submit patches. Broadcom was really difficult to work with but due to pressure from their customers they seem to come around.
Thursday, February 13, 2003 15:08 // Aros Congress Center, Västerås, Sweden // href
I really love speaking in front of an audience. This is why it is so easy to convince me to come to conferences. During the last hour I finally had my own talk here at the NordU conference. I was talking about scalable system management concepts in a large environment. Presenting the major tools we have developed at the ISG.EE. There were not all that many people in my talk, but taking into account that only slightly more than 100 people at the conference and that there were 3 sessions in parallel plus a vendor exhibition I am actually quite happy. I think I drew over 30%.
Oh yea and I held the set time of 45 minutes exactly. I finished my talk 2 minutes before the alloted time with some break halfway through for questions. Now I just need to find a way to loose that adrenalin to be able to concentrate on other talks again.
Thursday, February 13, 2003 15:24 // Aros Congress Center, Västerås, Sweden // href

by Bruno Cornec from the HP/Intel Solution Center.
HP up to the management level is now taking Linux seriously. They finance most of the ia64 and wireless work. They employ several key Linux developers for example Jeremy Allison of Samba Fame.
Itanium is HPs future. All operating systems the users require will be provided. This includes Linux, Windows, HP-UX and OpwnVMS.
Itaniums are a new architecture co-developed between HP and IBM. It includes hardware IA32 emulation. The chip includes the Floating Point Unit from PA-RISC and is thus very fast in this area.
While Itanium is available to whoever wants to buy it from Intel, HP has developed their own high performance chip set for the Itanium 2 which they hope to gain competitive edge from.
HP is not only working on the ia64 architecture but also supporting ports to PA-RISC and Alpha.
HPs David Mossberger is responsible for the linux ia64 port. His main focus in doing the port is to comply with all the unix standards for 64 bit as well as keeping the ia64 port close to the ia32 version to ease portability. The ia64 port also includes access to the ia32 hardware emulator.
Several vendors already provide Itanium compatible products: Intels C Compiler and Oracle, Side Effects Houndini, MSC.Linux, MSC.Nastran, SCI, Quadrics drivers, Myrinet, SSI, Alinka.
HP is supporting external developers in improving the gcc code generation for the ia64 in order to get it on par with Intels compiler.
HP is working with INRIA on porting MandrakeCluster to the Itanium Platform. (clic.mandrakesoft.com ...)
Tips for porting to the Itanium
Alpha thing will just work.
Pointers and Longs are 64bit.
Big-endian is settable for certain programs as required.
Use int32_t, int64_t, u_int8_t
Compile with -Wall and take the warnings seriously.
Thursday, February 13, 2003 16:28 // Aros Congress Center, Västerås, Sweden // href

by Jens Ole Hald of Hanstholm City
Another City switching away from MS Office: Hanstholm in Danmark. Jens Ole Hald of the City IT department tells us how and why they did it.
A Testing Group of 15 Users has been evaluating StarOffice for 2 weeks in spring 2002. Since November 2002 there are 300 Employees working with Staroffice and OpenOffice.
Most problems were with reading Microsoft formats but even those were minor and got mostly fixed in the meantime. Some documents need minor re-formating when opened for the first time but this is not really a problem. Internal Problems with StarOffice were not found.
Users got a 3 hour up-lift course for StarOffice to make them ready for the new tool.
At the moment the Workstations are still running on Windows. But they are looking on moving over to Linux.
On the server side they want to stay with Novel. Quote "You have to know and do a lot to make a Windows or Unix box secure. About as much as you have todo and know in order to make a Novel box insecure."
To ease the transition for the Users, the local admins have produced templates and some custom icons and menus mimicking MS Office.
Reasons for changing
The reason for changing was primarily Microsofts new more expensive licensing scheme.
Hanstholm was already (or still) using terminal based programs. On IBM Mainframes and Unix Servers. They were mainly using Word and Excell from the Office Suite.
Unix was already deployed on the server in certain areas like Web and Proxy Servers.
Initiating the Transition
In summer all employees were invited to a presentation where the head of the cities administration introduced the new application and also made it clear that the decision to move to OpenOffice was taken and could not be changed. This set the tone so that the acceptance of the new program was very good and people were mostly interested in learning how to use the product and not in discussing if they want to use it.
Problems
Users who were very experienced with MS Excel had the most problems with the transition as things in the OpenOffice Spreadsheet are working slightly different. But then again it is probably mostly due to them not really accepting the change yet. They will now get a special 1 week introduction to OpenOffice.
Thursday, February 13, 2003 17:16 // Aros Congress Center, Västerås, Sweden // href

IBM has setup a special group concerned with improving the performance of Java. Robert F. Berry of IBM tells us of their efforts.
JVM innovation is manly driven by performance enhancements. It started out on the client side, but today Java is relay big on the server side.
Java performance on a specific hardware has developed into a major selling point.
Performance Improvements
In the memory management area, an enhanced fully threaded Marc/Sweep/Compact algorithm was developed which uses system idle time for marking and does incremental compaction.
IBMs Just in time Compiler (JIT) uses an aggressive in-lining technique which gives the jit much more code to look at and optimize. Object allocations can be improved by static analysis of their locality and then probably allocate them on the stack and thus also save on synchronization time.
Restarting a JVM is expensive, but from a transaction isolation point this is a useful concept. To make this a viable solution a JVM start and clean mechanism has been developed where several JVMs are sharing part of their environment. The startup time for an additional JVM has been reduced by about a magnitude.
Future Work
Footprint Size
Very Large Heaps gt 500 GB
Very Large Systems (n-Way Servers)
Object Pooling (e.g Jakarta Commons)
Improve decimal arithmetics for banking transactions
Improve performance on XML and XST workloads for Webservices
Conclusion
I find it rather hard to write a report on a topic I am not really fluent in :)
Content © by Tobias Oetiker