Data Deduplication

The Challenge

As data stores continue to grow and the need for retaining more and more organizational data for legal reasons increases, IT professionals are working to determine if their current backup strategies can keep up. Tapes – while offering easy transferability to an off-site location – can be extremely costly to store. It also can be very time-consuming to restore data from tapes. Alternatively, the cost of disk has decreased to the point where using disk-to-disk backup is a viable option. For customers using a combination of disk and tape backup solutions, data deduplication can help that cost come down even more, plus save valuable time at every level.

What is Data Deduplication?

Wikipedia defines data deduplication as “a specific form of compression where redundant data is eliminated.” Take the example of a 50 MB PowerPoint presentation emailed to 10 people. If each person stores the presentation in their home directory, we now have 500 MB allocated to storing the same data! If each person then forwards the presentation to 1 other individual and those people also store the presentation, we have 1G of storage dedicated to a single file! Incremental and differential backups aside, this one file will take up 1G of storage for its initial backup.

Data deduplication takes care of this redundancy by recognizing that the data in each of these individual files is the same. It therefore stores one copy of the file and creates pointers to the rest. Now, instead of using 1G of storage, 20 people have used a total of only 50 MB of disk space.

However, let’s assume that each person makes a change in one slide. Now the data across all the files is not the same. Some data deduplication products are smart enough to work on the subfile level: they locate the blocks of data that are the same, store those one time, and then store the differing blocks separately. Because of the pointers the data deduplication product creates, each person can retrieve their unique version of the file, even though it has been stored in separate blocks.

How Does It Work?

Deduplication technology works by comparing chunks of data and searching for duplicates. It does this by assigning a unique identifier to each chunk, calculated by a cryptographic hash function. When a duplicate is found, the file is removed and a link to the first file is created. If this file is changed, then a copy of the changed file or block is written to disk during the next backup.

Types of Deduplication Technology

There are two types of data deduplication technology currently in use:

* Post-process deduplication: As the name implies, post-process deduplication runs after the data is sent to the target device. The advantage of this is that since the deduplication process can be slow, time for backup is not lost waiting for deduplication to occur. The disadvantage is that it is impossible to predict how long the deduplication process will take. Also, since the data needs to be written to the target first, more disk space will be required until the process finishes.

* In-line deduplication: With in-line deduplication, the hash calculations are created on the target device as the data is written. If a duplicate is found, the new block of data is not stored. This method requires less storage on the target, but can be slower due to hash calculations and lookups taking a long time. Performance varies across vendors.

What Are the Advantages?

Data deduplication brings a wide variety of benefits to organizations:

* Save on storage space for disk-to-disk backups: According to the Enterprise Strategy Group’s report by Tony Asaro and Heidi Biggar entitled “Data De-duplication and Disk-to-Disk Backup Systems” (July 2007), “Through hands-on testing, ESG has found that data deduplication technologies can provide 10 times, 20 times, 30 times and even great reduction in capacity needed for backup.” Thus, companies can see savings not only in the disk needed for the primary backup, but also in the cost of disk for a secondary site, or in monthly charges for an off-site backup service.

* Save on heating and cooling: By decreasing the amount of disk needed, organizations can see a reduction in heating and cooling costs.

* Save on space: With less disk needed, organizations also save on the amount of floor/rack space needed to house the backup solution.

* Save on bandwidth: Less data going across the wire means lowered bandwidth costs.

* Decrease time and costs for data restoration: Recovery from disc is instantaneous, while recovery from tape can be slow and time-consuming. If the tape needed is in off-site storage, more time and costs will be incurred.

What Backup Vendors Support This Technology?

There are a host of vendors offering this technology, including ExaGrid, EMC DataDomain, and Barracuda Backup (formerly BitLeap until Barracuda bought them last year).

Where Can I Learn More?

Check Data Domain for whitepapers (like the one mentioned in this article) and a deduplication calculator. ESG’s report contains some great information, including questions to ask vendors when selecting a solution.

Conclusion

If you are considering a new backup strategy for your organization, taking a look at what data deduplication can do for you is a must. We feel that development of this technology is just getting started, and can only improve as more products hit the marketplace.