The SANMAN: Data Domain's CPU Centric Deduplication Genius is no Dupe

Last year EMC’s somewhat controversial acquisition of Data Domain right under the noses of NetApp raised several eyebrows to say the least. Considering the reported amount of $2.1 billion and their already deduplication packed portfolio which consisted of the source based Avamar, the file-level deduplication/compression of its Celerra filer and their Quantum dedupe integrated VTLs, some heads were left scratching as to what actually was the big deal with the target based deduplication solution of Data Domain. Almost a year on and with Data Domain’s DD880 being adopted by an ever growing customer base, the heads have stopped scratching and are paying close attention as to what is probably the most significant advancement in backup technology of the last decade.

With deduplication currently being all the rage, with possibly only ‘Cloud Computing’ overshadowing it, the benefits of deduplication are becoming an exigency for backup and storage architects. With most backup software producing copious amounts of duplicate data stored in multiple locations, deduplication offers the ability to eliminate those redundancies and hence use less storage, less bandwidth for backups and hence shrink backup windows. With source based and file level based deduplication offerings, it is Data Domain’s target based solution i.e. the big black box that is clearly taking the lead and producing the big percentages in terms of data reduction. So what exactly is so amazing about the Data Domain solution, when upon initial glance at for example the DD880 model, all one can see is just a big black box? Even installing one of the Data Domain boxes hardly requires much brainpower apart from the assignment of an IP address and a bit of cabling. And as for the GUI, one could easily forget about it as the point of the ‘big black box’ is that you just leave it there to do its thing and sure enough it does its thing.

And while the big black box sits there in your data center the figures start to jump out at you where an average backup environment can see a reduction of up to 20 times. For example a typical environment with a first full backup of 1TB with only 250GB of physical data will immediately see a quadrupled reduction. If such an environment was to take weekly backups with a logical growth rate of 1.4TB per week but with only a physical growth of 58GB per week, the approximate reduction could go up to more than 20 times within four months:

Reduction =
First Full + (Cumulative Logical Growth x Number of weeks) / Physical Full + (Cumulative Physical Growth x Number of weeks)

e.g. After 25 weeks
Reduction = 1TB + (1.4TB x 25) / 0.250TB + (0.058TB x 25)
= 35TB / 1.7TB
= 21 times less data is backed up

So how does Data Domain come up with such impressive results? Upon closer inspection, despite being considered the ‘latest technology’, Data Domain’s target based deduplication solution has actually been around since 2003, so in other words these guys have been doing this for years. Now in 2010 with the DD880, to term their latest ‘cutting edge’ would be somewhat misleading when a more suitable term would be ‘consistently advancing’. Those consistent advancements have come from the magic of the big black box being based on its CPU-centric architecture and hence not reliant upon adding more disk drives. So whenever Intel unveils a new processor, Data Domain does likewise with its incorporation into their big black box. Consequently the new DD880’s stunning results are the result of its incorporation of a quad-socket quad-core processor system. With such CPU power the DD880 can easily handle aggregate throughput to up to 5.4 TB per hour and single-stream throughput of up to 1.2 TB per hour while supporting up to 71 TB of usable capacity, leaving its competitors in its wake. Having adopted such an architecture, Data Domain have pretty much guaranteed a future of advancing their inline deduplication architecture by taking advantage of every inevitable advance on Intel's CPUs.

Unlike the source based offerings, Data Domain’s Target-based solution is controlled by a storage system rather than a host and thus takes the files or volumes from the disk and simply dumps them onto to the disk-based backup target. The result is a more robust and sounder solution to a high change-rate environment or one with large databases where RPOs can be met a lot easier than with a source-based dedupe solution.

Another conundrum that Data Domain’s solution brings up is the future of tape based backups. The cheap RAID 6 protected 1 TB / 500 GB 7.2k rpm SATA HDD disks used by the DD880 alongside the amount of data reduced via its deduplication also brings into question the whole cost advantage of backing up to tape. If there’s less data to back up and hence fewer disks than tape required, what argument remains for avoiding the more efficient disk to disk back up procedure? An elimination of redundant data with a factor of 20:1 brings the economics of disk backup closer than ever to those of tape backups. Couple that with the extra costs of tape backups often failing, the tricky recovery procedures of tape based backups as well as backup windows which are increasingly scrutinized; this could well be the beginning of the end of the Tape Run guys having to do their regular rounds to the safe.

Furthermore with compatibility already with CIFS, NFS, NDMP and the Symantec OpenStorage, word is already out that development work is being done to integrate closer with EMC’s other juggernauts VMware and Networker. So while deduplication and its many forms saturate the market and bring in major cost savings to backup architectures across the globe, it is Data Domain’s CPU based, target based inline solution which has the most promising foundation and future and currently unsurpassable results. $2.1 billion? Sounds like a bargain.