Over the last couple of years I have done a lot of benchmarking and performance evaluation work. When a member of my team comes up with a good idea or drags some new technology into the office the first remark usually is ” How does it perform and will it scale? ”
Today I will be ranting about one of my personal favourites; benchmarking in general and disk/storage subsystem benchmarking in particular. I will discuss some of the tools I have used and also give some insight into our methodologies.
When you are benchmarking you can either be general or specific. General benchmarking is synthetic, meaning that you expose your storage subsystem to a variety of tests and workloads in order to measure what kind of throughput and latency you can expect. Specific benchmarking closely models the actual behaviour of whatever application you will run off your storage subsystem and is thus more an act of performance evaluation rather than simple benchmarking.
Be careful about what it is you are testing and look very closely at the numbers that your tests produce. If they do not seem right they usually are not. For example, people often end up unknowingly benchmarking various caching layers (OS, controller, disk, etc) instead of the actual raw disk performance which is not always that you want. Disable as much as you can , rerun your tests to get an idea of the worst case scenario and then turn them back on one by one — this will give you a good impression of what impact the individual layers have. Boot between the different runs to make sure that each run gets a “clean” machine to run on. Make sure that your benchmark is not artificially restrained, look out for CPU starvation, high number of context switches, interrupts and such to ensure that your benchmarking tool can drive the workload all the way. I have seen lots of examples where people accidentally end up benchmarking their benchmark machine instead of the actual target. If you can, test the different filesystems that your OS has to offer — you may be surprised by the differences the filesystem can make for your application performance.
And remember, it can be really fun to torture hardware.
Choose your tools wisely and use at least 2 different products in order to compare the results. And no, dd is probably not the right tool unless you really want to test single threaded, sequential I/O behaviour. And keep in mind that capacity problems are not performance problems.
I normally use Iozone and iostat for general benchmarking. Over the years we have comprised a fairly detailed test set that provides us with a good overview of what any given storage subsystem can provide. We iterate a number of times over the same test set, adjusting parameters like the type of I/O operation, block size, threads, processes, target file sizes, synchronous and asynchronous I/O and so on. I can only recommend that you visualise your results — it is way easier for humans to understand colours and lines than pages full of numbers. Save your test results — they can serve as baselines and be invaluable in later storage related problem determination.
This is where it gets interesting. Specific benchmarking requires intimete knowledge of the application that you are evaluating your storage subsystem for. Usually there are other subsystems involved, such as the networking layer, as well as storage protocols like NFS and/or iSCSI. They should all undergo similar testing. If I find the time I will write about them as well. Throw in the inner workings of your OS and you have yourself a party. A full stack test is the most challenging thing to overcome.
There are different ways of obtaining this information, depending on the application. You could Google, read the source code, do the strace dance to observe the syscalls it generates and which flags it uses. Reading the source code can be overwhelming for larger applications and strace adds a significant overhead ‚making it hard to submit the application to real-world workloads and still obtain valid results. Google can provide clues but more often than not reveals little substance and much misinformation.
Enter Dtrace. Dtrace is a dynamic tracing framework that was invented by a few engineers at Sun Microsystems. Dtrace enables you to trace each and every aspect of a running system, including all kernel and userland function calls and it does so with extremely little overhead. I am absolutely in love with dtrace and cannot imagine living without it. I used dtrace for almost everything; disks, filesystems, network, storage protocols, thread performance and so on. I could rant for hours about it but I will reserve that for a future post. Dtrace has been ported to FreeBSD, NetBSD, Mac OS X and Linux so chances are that it is already available to you. If not, no worries. You can always run the applications on Solaris for the single purpose of obtaining profiling information with dtrace … lots of people do. Running stuff like VMWare or Xen against the Solaris iSCSI or NFS stack can reveal very interesting information that otherwise is hard to come by.
I might also use the chance to promote the new Dtrace book that Brendan Gregg and Jim Mauro have been cooking up. It’s wonderful and worth every penny.
Now, once you have collected all the I/O related information you need, what do you do ? You need to somehow emulate the application behaviour in order to run your tests. Enter filebench. Filebench is an extremely powerful and flexible benchmarking tool that includes it’s own scripting language which allows you to emulate the actual application. It also comes with a set of predefined micro and macro benchmarks which resemble the more typical workloads . And yes, they also can be customised.
Filebench can be used to emulate client-server behaviour, useful for testing large storage systems and/or protocol overhead. I have used this a couple of times to test and benchmark 10Gbit iSCSI and NFS behaviour. Having a single master and an arbitrary number of clients that drive the workload is awesome …
Filebench is a lot of fun to play with.
The methodologies that I stated at the beginning of this post also apply here so be careful, think twice before you test and be critical of the results.
When you are done testing, measuring and tweaking the subsystems that are relevant to your application you should always do a full load-test. No emulation can beat the real thing. The bad thing is that it sometimes can send you back to the drawing board …