Large scale disk-to-disk backups using Bacula, Part VI

This is going to be my last post in the series. There are a few loose ends to tighten up and some more ques­tions to answer. I’ll also explain some of the miss­ing pieces to our puzzle.

Our Bac­ula deploy­ment is actu­ally really sim­ple.  We are only using the most basic fea­tures that Bac­ula has to offer. It  has a ton of advanced func­tion­al­ity so please bear in mind that we  cur­rently are only using the basics. I know some rather small deploy­ments that are far more com­plex than ours. What sets us apart from most oth­ers is sheer volume.

If noth­ing else, my series of posts should be seen as a tes­ti­mony the the power, per­for­mance and scal­a­bil­ity that those basics pro­vide  — not as an exam­ple of what is pos­si­ble in terms of functionality.

Some peo­ple have ques­tioned my moti­va­tion for writ­ing these posts, insin­u­at­ing that I am in some way affil­i­ated with Bacula Sys­tems or per­haps am a “hired gun”. Well, I am nei­ther. I work for a Bac­ula Sys­tems cus­tomer were I have been chief bat and my fas­ci­na­tion over what Bac­ula has to offer in com­par­i­son with its com­pe­ti­tion is my sole rea­son for shar­ing my experiences.

How many enter­prise soft­ware ven­dors do you know where you can drop an email to the lead devel­oper and get an author­i­ta­tive answer within a few hours ? How many times have you stood with a crit­i­cal prob­lem at hand and bat­tled imbe­cile first-level sup­port drones ? How many truly enter­prise grade backup sys­tems are open source with all the good­ness that comes with it?

We have a lot of sup­port con­tracts at work, includ­ing some with the largest soft­ware com­pa­nies in exis­tence and very few match the level of skill, under­stand­ing, sup­port and gen­eral help­ful­ness that we have got­ten from Bac­ula Sys­tems. The com­pany may be small but they rock when it comes to backup in gen­eral and Bac­ula in par­tic­u­lar. Period.

I have also been asked why I did not sim­ply post my paper instead of blog­ging bits and pieces at a time. Well, the paper was writ­ten dur­ing work hours so I would need per­mis­sion from my employer to do so … and since I have resigned my posi­tion with the com­pany that would most likely have got­ten me nowhere. The blog posts were all writ­ten in my spare time and I have been cau­tious not to release too much spe­cific infor­ma­tion, such as entire con­fig files, test results, graphs, and so on to avoid pos­si­ble confrontations.

But lets get on with the inter­est­ing stuff …

Now that we have com­pleted the migra­tion to Bac­ula and have seen it work very reli­ably for a cou­ple of month we are ready to move for­ward and embrace the more advanced fea­tures it has to offer. Our backup stor­age needs have an annual growth rate of ~25% so some­thing has do be done if we want to avoid drown­ing in disk arrays. The new 48 core Dell R815 stor­age servers that will replace our cur­rent stor­age heads have enough excess horse­power to allow us to enable com­pres­sion on our ZFS filesys­tems. Test­ing has shown that good results can be achieved, but the results vary from host to host. There will most def­i­nitely be changes in our sched­ule since there really is no need to do weekly full back­ups any longer.

There have been some ques­tion from read­ers about restore speed. Well, a bare metal restore of a Win­dows 2003 DC with 50 GB data takes less than an hours start to fin­ish. A Win­dows 2008 DC slightly longer. Sin­gle file restores take min­utes. We have a spe­cial 10 Gbit drop which is reserved for restore pur­poses should the need arise to restore a really fat client. And yes, our stor­age servers can deliver that speed.

Other have asked what we use for mon­i­tor­ing. We use Webac­ula and our own con­trol panel to do all Bac­ula mon­i­tor­ing. The main rea­son for choos­ing Webac­ula over Bweb was the sim­ple fact that we have a lot more peo­ple in house that know PHP than Perl.

Mon­i­tor­ing is done by our Win­dows team. The major­ity of servers we have are run­ning Win­dows so it seemed to be a log­i­cal course of action. It still is a mys­tery to me why Win­dows VSS can work flaw­lessly one day and then fail for obscure rea­sons the next. In prac­tice this approach has worked well. Even though the Win­dows team only has a very lim­ited knowl­edge of UNIX / Linux they have no prob­lems adding new clients, mod­i­fiy­ing  con­fig­u­ra­tion files, mon­i­tor­ing jobs and fix­ing most prob­lems by them­selves. Heck, they even pre­fer bcon­sole over Webac­ula for cer­tain tasks.

The only draw­back so far is that Win­dows peo­ple after a few years appar­ently loose the abil­ity to read and under­stand more than a cou­ple of lines of text which can be a chal­lenge when the answer to your ques­tion is “buried”  below a whole 70 line job report.

I have also been asked if we were miss­ing stuff in Bac­ula and the answer is a resound­ing yes.

From the top my head :

Dedu­pli­ca­tion

We really,really, really  need a dedu­pli­ca­tion friendly vol­ume for­mat in Bac­ula and prefer­ably one that works well with ZFS. The poten­tial sav­ings in both space and speed (yes, dedup can act as an I/O accel­er­a­tor) are enor­mous in instal­la­tions as large as ours.

Yes, I know about base jobs and we have some very good rea­sons to pre­fer doing dedup directly on our stor­age servers instead.

SD reload

Our setup requires us to add devices to our stor­age servers all the time. Cur­rently there is no way to sim­ply reload the stor­age dae­mon and restart­ing it can be dis­rup­tive. We have worked our way around it but it is a hack. A proper imple­men­ta­tion would be preferable.

Incre­ment the FD non-fatal error counter from a runscript

It is often desir­able to indi­cate that some run­script (SQL back­ups, etc) has failed with­out either fail­ing the entire job or result­ing to grep’ing after cer­tain out­put in the job logs. I would love to have the abil­ity to incre­ment the non-fatal FD error counter from a run­script to indi­cate that an error has occured, for exam­ple by using a pre­de­ter­mined exit code. This would fur­ther ease our job management.

All in all we are more than happy with Bac­ula and Bac­ula Sys­tems. We ended up with a very solid and scal­able plat­form that was sur­pris­ingly easy to inte­grate into our exist­ing busi­ness tools and out­per­formed our expec­ta­tions. But most impor­tantly it solved the prob­lem at hand to ours and our cus­tomers sat­is­fac­tion. Sav­ing a 6 digit amount of Euro’s was not exactly bad, either :-)

10 Comments

  • Howard Thomson wrote:

    Hi Hen­rik,
    Could you expand a bit on the de-duplication issues ? You men­tion the Vol­ume for­mat, and I made the point at the Bac­ula Devel­oper Con­fer­ence that a new Vol­ume for­mat could be made more de-dup friendly to the F/S on which disk Vol­umes are stored, par­tic­u­larly for data alignment.I am at the plan­ning stages for re-implementing the algo­rithms used in the ‘bup’ pack­age, pro­vid­ing a disk-only Bac­ula com­pat­i­ble SD. De-dup is my pri­mary dif­fi­culty with Bac­ula also, although on a MUCH smaller scale!
    Regards Howard

  • IIRC Bac­ula adds sev­eral head­ers con­tain­ing job spe­cific infor­ma­tion to each and every data block in a vol­ume file. The result of this is that almost no 2 blocks are the same thus mak­ing the cur­rent vol­ume for­mat unus­able for block based deduplication.

    I see why this is nec­es­sary for the way Bac­ula cur­rently oper­ates and I do under­stand that chang­ing this would be very dis­rup­tive. A new and dedup friendly vol­ume for­mat would prob­a­bly be eas­ier to implement.

  • Howard Thomson wrote:

    Yes, the cur­rent vol­ume for­mat is strictly ser­ial. A dedup friendly for­mat would pro­vide, at least, for accu­mu­lat­ing meta-data and tail frag­ments in a much larger header block fol­lowed by block aligned full data blocks, rinse and repeat … At least this would be much bet­ter for block level dedup by the f/s on which disk vol­umes are stored.

  • > The only draw­back so far is that Win­dows peo­ple after a few years appar­ently loose the abil­ity to read and under­stand more than a cou­ple of lines of text …

    That’s right.

  • Webac­ula news
    * Ver­sion 5.5 com­ing soon!
    * Bac­ula and Webac­ula ACLs (Access Con­trol Lists) implemented.

    I.e. in other words, now user can see and man­age only their Jobs, Pools, etc.

  • Sweet! I have been wait­ing for an excuse to play with ACL’s in Bac­ula ;-)

  • Pasi Kärkkäinen wrote:

    What kind of CPU load do you have on the Direc­tor servers when run­ning 100 con­cur­rent jobs?

    What was the rea­son to go for 3 sep­a­rate Directors?

    That would be inter­est­ing to know :)

  • The CPU’s on the direc­tor are less than 10% utilised when run­ning 100 con­cur­rent jobs. We wanted a clear sep­a­ra­tion of duties, mean­ing that one direc­tor “owned” it’s own ded­i­cated cat­a­log and stor­age server and since we needed 3 stor­age servers we also needed 3 direc­tors. Fur­ther­more they are divided by func­tion; 2 direc­tors for Win­dows clients and 1 direc­tor for Linux / UNIX clients …

  • Dan wrote:

    Can you please elab­o­rate on your deci­sion to use a one-job-per-volume stor­age method? Nine months later, how is that work­ing out for you? I’m hav­ing dif­fi­culty grasp­ing the con­cept since it’s not the tra­di­tional route, but I would be very inter­ested in hear­ing more about this.

    Excel­lent writeup. Though I was look­ing for arti­cles that had more tech­ni­cal value, your busi­ness value is just as — well — valuable.

  • Hello Dan,

    Well, the one-job-per-volume-per-client approach was cho­sen because it was sig­nif­i­cantly faster in our bench­marks (restor­ing files and writ­ing backup data to disk).

    Another nice aspect is that you keep your vol­umes on a man­age­able level in terms of file size — no fun in pok­ing around in a 3 TB vol­ume file by hand should
    your direc­tor / stor­age dae­mon go down when you need it the most or doing rsync’s to sec­ondary storage.

Leave a Reply

Your email is never shared.Required fields are marked *