The art of benchmarking and performance evaluations

Over the last cou­ple of years I have done a lot of bench­mark­ing and per­for­mance eval­u­a­tion work. When a mem­ber of my team comes up with a good idea or drags some new tech­nol­ogy into the office the first remark usu­ally is  ” How does it per­form and will it scale? ”

Today I will be rant­ing about one of my per­sonal favourites; bench­mark­ing in gen­eral and disk/storage sub­sys­tem bench­mark­ing in par­tic­u­lar.  I will dis­cuss some of the tools I have used and also give some insight into our methodologies.

When you are bench­mark­ing you can either be gen­eral or spe­cific. Gen­eral bench­mark­ing is syn­thetic, mean­ing that you expose your stor­age sub­sys­tem to a vari­ety of tests and work­loads  in order to mea­sure what kind of through­put and latency you can expect. Spe­cific bench­mark­ing closely mod­els the actual behav­iour of what­ever appli­ca­tion you will run off  your stor­age sub­sys­tem and is thus more an act of per­for­mance eval­u­a­tion rather than sim­ple benchmarking.

Be care­ful about what it is you are test­ing and look very closely at the num­bers that your tests pro­duce. If they do not seem right they usu­ally are not. For exam­ple, peo­ple often end up  unknow­ingly bench­mark­ing var­i­ous caching lay­ers (OS, con­troller, disk, etc) instead of the actual raw disk per­for­mance which is not always that you want. Dis­able as much as you can , rerun your tests to get an idea of the worst case sce­nario and then turn them back on one by one — this will give you a good impres­sion of what impact the indi­vid­ual lay­ers have. Boot between the dif­fer­ent runs to make sure that each run gets a “clean” machine to run on. Make sure that your bench­mark is not arti­fi­cially restrained, look out for CPU star­va­tion,  high num­ber of con­text switches, inter­rupts and such to ensure that your bench­mark­ing tool can drive the work­load all the way. I have seen lots of exam­ples where peo­ple acci­den­tally end up bench­mark­ing their bench­mark machine instead of the actual tar­get. If you can, test the dif­fer­ent filesys­tems that your OS has to offer — you may be sur­prised by the dif­fer­ences the filesys­tem can make for your appli­ca­tion performance.

And remem­ber, it can be really fun to tor­ture hardware.

Choose your tools wisely and use at least 2 dif­fer­ent  prod­ucts  in order to com­pare the results. And no, dd is prob­a­bly not the right tool unless you really want to test sin­gle threaded, sequen­tial I/O behav­iour. And keep in mind that capac­ity prob­lems are not per­for­mance problems.

Gen­eral benchmarking

I nor­mally use Iozone and iostat for gen­eral bench­mark­ing. Over the years we have com­prised a fairly detailed test set that pro­vides us with a good overview of what any given stor­age sub­sys­tem can pro­vide. We iter­ate a num­ber of times over the same test set, adjust­ing  para­me­ters like the type of I/O oper­a­tion, block size, threads, processes, tar­get file sizes, syn­chro­nous and asyn­chro­nous I/O  and so on. I can only rec­om­mend that you visu­alise your results — it is way eas­ier for humans to under­stand colours and lines than pages full of num­bers. Save your test results — they can serve as base­lines and be invalu­able in later stor­age related prob­lem determination.

Spe­cific benchmarking

This is where it gets inter­est­ing. Spe­cific bench­mark­ing requires  intimete knowl­edge of the appli­ca­tion that you are eval­u­at­ing your stor­age sub­sys­tem for. Usu­ally there are other sub­sys­tems involved, such as the net­work­ing layer, as well as stor­age pro­to­cols like NFS and/or iSCSI.  They should all undergo sim­i­lar test­ing. If I find the time I will write about them as well. Throw in the inner work­ings of your OS and you have your­self a party. A full stack test is the most chal­leng­ing thing to overcome.

There are dif­fer­ent ways of obtain­ing this infor­ma­tion, depend­ing on the appli­ca­tion.  You could Google, read the source code, do the strace dance to observe the syscalls it gen­er­ates and which flags it uses. Read­ing the source code can be over­whelm­ing for larger appli­ca­tions and strace adds a sig­nif­i­cant over­head ‚mak­ing it hard to sub­mit the appli­ca­tion to real-world work­loads and still obtain valid results. Google can pro­vide clues but more often than not reveals lit­tle sub­stance and much misinformation.

Enter Dtrace. Dtrace is a dynamic trac­ing frame­work that was invented by a few engi­neers at Sun Microsys­tems. Dtrace enables you to trace each and every aspect of a run­ning sys­tem, includ­ing all ker­nel and user­land func­tion calls and it does so with extremely lit­tle over­head. I am absolutely in love with dtrace and can­not imag­ine liv­ing with­out it. I used dtrace for almost every­thing; disks, filesys­tems, net­work, stor­age pro­to­cols, thread per­for­mance and so on. I could rant for hours about it but I will reserve that for a future post. Dtrace has been ported to FreeBSD, NetBSD, Mac OS X and Linux so chances are that it is already avail­able to you. If not, no wor­ries. You can always run the appli­ca­tions on Solaris for the sin­gle pur­pose of obtain­ing pro­fil­ing infor­ma­tion with dtrace … lots of peo­ple do. Run­ning stuff like VMWare or Xen against the Solaris iSCSI or NFS stack can reveal very inter­est­ing infor­ma­tion that oth­er­wise is hard to come by.

I might also use the chance to pro­mote the new Dtrace book that Bren­dan Gregg and Jim Mauro have been cook­ing up. It’s won­der­ful and worth every penny.

Now, once you have col­lected all the I/O related infor­ma­tion you need, what do you do ? You need to some­how emu­late the appli­ca­tion behav­iour in order to run your tests. Enter filebench. Filebench is an extremely pow­er­ful and flex­i­ble bench­mark­ing tool that includes it’s own script­ing lan­guage which allows you to emu­late the actual appli­ca­tion. It also comes with a set of pre­de­fined micro and macro bench­marks which resem­ble the more typ­i­cal work­loads . And yes, they also can  be customised.

Vdbench is in the same boat as filebench and it may be wise to use both.

Filebench can be used to emu­late client-server behav­iour, use­ful for test­ing large stor­age sys­tems and/or pro­to­col over­head. I have used this a cou­ple of times to test and bench­mark 10Gbit iSCSI and NFS behav­iour. Hav­ing a sin­gle mas­ter and an arbi­trary num­ber of clients that drive the work­load is awesome …

Filebench is a lot of fun to play with.

The method­olo­gies that I stated at the begin­ning of this post  also apply here so be care­ful, think twice before you test and be crit­i­cal of the results.

When you are done test­ing, mea­sur­ing and tweak­ing the sub­sys­tems that are rel­e­vant to your appli­ca­tion you should always do a full load-test. No emu­la­tion can beat the real thing. The bad thing is that it some­times can send you back to the draw­ing board …

No Comments

Leave a Reply

Your email is never shared.Required fields are marked *