The Storage Management Blog
Getting the best out of your unstructured data
Finding file duplicates can be a chore
Posted by on September 9, 2011
Finding and managing duplicate files is often the first thing admins do to try to clean up storage, so in this blog we thought we’d take time to look at how to do this in a little more detail. Finding duplicated files can be a chore. If you haven’t done it before, there are usually many thousands of them on your network. In fact our experience shows that they can make up more than 30% of your used storage! If you include files stored as attachments in emails this can be even higher. You can easily produce a list of them with one of the many simple tools available – but this will just confirm the scale of the problem. The difficult part is finding time to do something about it. That requires a more sophisticated approach.
What do you want to achieve?
Previously we blogged about how to archive duplicates – and how to automate this, so you don’t have to spend time manually finding and dealing with file duplicates. How do you find the interesting ones in the first place – the ones that you want to spend time on? First you need to decide what you are trying to achieve, e.g.
- Clean-up as much storage as possible, as quickly as possible
- Find out which users are generating the most duplicates, and tell them about it
- Check where most file duplicates are being stored – and if I have problems with whole folders being duplicates
- Decide if I am concerned about any type of file, or just ones I hate, like audio or video files.
How accurate should I be?
Having decided what you are looking for, you need to decide how accurately you want to search. This is a tradeoff between speed and performance – unless you choose a tool where this can be automated to a schedule. If you want to clean-up across your entire network you need to consider this area carefully. Scaling up duplicates file search can be a challenge and is highly tool dependent.
A good example of this issue is whether or not to use “checksum” comparison to find duplication. In this method the checksum of each file is calculated and compared with each other file. To calculate a checksum the data in each file must be read and a calculation applied. When files get to larger sizes (e.g. 50MB or more) then calculating a checksum can take minutes, even in the best conditions (when a file is stored on a local filesystem and the PC or server is powerful and not running other tasks). Some tools let you do a “byte by byte” comparison of files, rather than calculating and comparing checksums. This takes even longer and arguably gives very little extra accuracy.
So how accurate is a checksum comparison of files? This depends on the algorithm used. With a 32 bit checksum algorithm we estimate that the likelihood of getting an incorrect duplication result on a large network will be once every 20 years or so… if you’re cleaning up your storage (rather than carrying out forensic investigation, for example) then you’re probably not too bothered by this margin for error. In fact, for most purposes you can probably dispense with checksum calculation altogether and just use meta data like file update date, size and file type to clean up.
How do I want to organize results?
You’ve decided what you want to achieve, and how accurate you want to be. Next its worth thinking about how to organize the results of your analysis. If you want to find out whose creating most duplication then looking at results by file owner would be useful. If you want to find duplicated directories then viewing by folder would be best. If you just want to clean up wasted storage then using file lists would be most easy.

An example of file duplicates organized by file

An example of file duplicates organized by creator

An example of file duplication by folder
The examples above come from our favourite storage management tool, SPACEWatch Storage Suite. Its Duplicates Finder lets you choose from many options to filter where – and what – you look for. It then uses databases-driven search to give very fast results. What’s more it will find duplicates across multiple servers, storage appliances – and even email systems.
What to do with the results?
So you’ve got the results you want, organized as you want them. What next? With SPACEWatch you can:
- Generate and email a report – e.g. to the owners of the duplicates
- Carry out a file action like copy, move, delete – or more advanced actions like archiving or sending to compressed folders
- Save the results – e.g. to publish on an intranet site, or use in another report
- Save the search – so you can run the exact same search with one click, or automate it to run to a schedule.
Building your business case
Posted by on September 5, 2011
Business cases are used to justify a proposed change. They include reasoning and background for those who will provide the resources to carry out the change. Typically they are a key step in securing the money required to implement a plan – like investing in more storage.
Usually, the easiest part of the business case is describing what you want to do – the “end state” or final infrastructure configuration. This is the “new” stuff, and suppliers are always at hand with white papers and product data sheets to help with the detail. Conversely, often the most difficult part is building the financial justification – what will be improved as a result of this change, or what cost will be reduced? The reason this is difficult is because it requires a detailed understanding of your current infrastructure – and no supplier can give you a data sheet covering that!
So what you need for your next storage business case is a painless way to find out what’s out there – and what the impact of a planned change might be. Step up SPACEWatch Storage Suite. Using it’s Scenario Planning tool its possible to choose all – or part – of your storage, choose a change scenario like “all files unused for 3 months”, then calculate in seconds the impact of that change. What’s more, the change impact is calculated by looking at what the historic impact would have been on your storage, then extrapolating this forward. Add in your own TCO (total cost of ownership) figures and it will provide the financial impact as well – all ready to be pasted into your business case paper.
In the example above we’ve chosen “all files unused for 3 months” as an example and considered what would happen if we applied this across all our storage. SPACEWatch lets us compare this with a “do nothing” scenario to clearly see the benefit of the change – if users were to continue using storage the way they always have. This example is interesting – not only is there clear and substantial savings, but it will take me a year or more to get back to the same level of storage use. Storage growth also becomes more linear, indicating that historically the growth in storage taken up by unused files has actually been increasing! Definitely time to act – and there’s the evidence to argue your case.
Why not download and try a free trial of SPACEWatch now – it will work with any size of network.
