Large scale disk-to-disk backups using Bacula

Over the past year I have been deeply involved in the nitty gritty details of choos­ing, design­ing, build­ing, deploy­ing and man­ag­ing a new backup infra­struc­tur at work. It has been a very edu­ca­tional experience.

Our old backup plat­form con­sisted of var­i­ous dif­fer­ent tools and tech­nolo­gies  and the result­ing spaghetti  bowl got more and more dif­fi­cult and time con­sum­ing to man­age. So we set out to find a suit­able replace­ment, some­thing that was both man­age­able and scal­able while pro­vid­ing opti­mal data pro­tec­tion for us and our customers.

As you may have guessed from the title of this post, we chose Bac­ula. After build­ing a demo setup and hav­ing 2 day onsite  proof-of-concept we closed a deal with Bac­ula Sys­tems primo 2010.

We did inves­ti­gate other prod­ucts ( IBM TSM, Syman­tec Net­Backup and Com­m­Vault Sim­pana ) but they could not com­pete with what Bac­ula Sys­tems could deliver.

Our Bac­ula infra­struc­ture is some­what exten­sive and con­sists of 9 servers and over 25 disk arrays.  All our back­ups are going off­site and the uplink between our pri­mary dat­a­cen­ter and our backup loca­tion con­sists of 3 x 10 Gb fiber links. Backup data is stored on 3 large ZFS stor­age pools, span­ning 375 indi­vid­ual disks.

Let me cough up some facts :

We are cur­rently back­ing up close to a 4 digit num­ber of machines, con­tain­ing a mul­ti­tude of Oper­at­ing Sys­tems (Win­dows, Linux, Solaris, BSD) and over 300 MS-SQL instances. In terms of size, a stan­dard 2 week rota­tion amounts to ~190 TB of data con­tain­ing ~490.000.000 files.

In daily use we are see­ing network-to-disk speeds in excess of 600 MB/s per Bac­ula stor­age server, totalling ~2 GB/s in terms of through­put when run­ning 300 simul­ta­ne­ous jobs while main­tain­ing an accept­able level of CPU util­i­sa­tion on our stor­age servers.

So far we have run 96.594 jobs. Over 90% of all jobs have run with­out issue and out of the remain­ing 10 per­cent 3 (!) jobs failed because of an error in Bac­ula itself. The remain­ing jobs failed either due to oper­a­tor or client side error.

In order to facil­i­tate the backup of MS-SQL instances we wrote our own tool which Bac­ula starts before run­ning the file level backup. This also enabled us to apply com­pres­sion before stor­ing the data­bases on disk for Bac­ula to pick up. The aver­age com­pres­sion ratio is well over 80% so it has been worth the effort.

We are spend­ing a mere ~30 min­utes a day to over­see all Bac­ula oper­a­tions, includ­ing check­ing for and restart­ing all failed jobs. Inte­gra­tion af Bac­ula into our con­trol panel, billing sys­tem, etc has been a very straight­for­ward process and con­sisted mainly of writ­ing a sim­ple XML API against our Bac­ula Cat­a­log servers.

Admit­tedly, Bac­ula lacks some of the bells and whis­tles of it’s com­pe­ti­tion but for us that has not been a prob­lem. It is reli­able, scal­able and  makes over­all man­age­ment a breeze.  The advice and sup­port that we have received from Bac­ula Sys­tems has been more than worth the money we spend.

IBM, Syman­tec and Com­m­Vault should start look­ing over their sholder … the bat is com­ing to get them !

Update

Part II of the series can be found here

No Comments

Leave a Reply

Your email is never shared.Required fields are marked *