Deduplication Meets Innovation

This project was designed and built with a partner in my Systems Administration class my final semester of college. The primary focus was file systems and the data stored in these systems. Storage availability and free storage space impacts everyone at every level of life. The BASH script designed and outlined in this document was created to make a system administrator's life easier by finding and identifying the duplicate files, and then storing them in a secondary location in the file system to be deleted (or kept). Convenience was the main goal of this project, allowing system administrators the ability to quaratine all duplicate files stored in the file system in one easily accessible location where they can be sorted and filtered. The system administrator can then go through that secondary location and remove or keep the files listed as duplicates at their discretion.

Introduction

As a system administrator, file systems give us a place to store all of our files and data. However, this virtual storage space has a dark side, and duplicate files love to hide in the shadows. If the storage on a device fills up, it is up to the system administrator to free up that storage or upgrade the company to a larger storage plan. Whenever someone runs out of storage, they usually try to find the largest items that can be deleted to free up the most storage with the least amount of effort required on their end. We have been conditioned to want the largest impact in the shortest amount of time. If there was a script that could be run that accomplishes this goal that means the bigger applications and files that would be deleted to just make room get to remain possibly useful. We started with a homemade approach to creating our own script, but quickly realized that with the time frame being so short it would be best to modify existing code that is free for download online and documented later in this paper. After we got the code we modified it majorly. One of the most important changes made was where instead of deleting the files the script moved them to a directory that the script makes (if it is non existent.) This way the system administrator can go through all the duplicate files in one place to decide if there was any information worth salvaging on them or whatever his reason would be for keeping them. Since we create the directory every time it is run they could delete it without reading it anytime they wish. Overall all of our initial tests were good and if we had the time of a regular semester this code would be very cool and definitely something anyone would be able to use.

Background

In research for this idea o duplication and the need for deduplication to take place it stands to reason that we must understand why this is happening and could it actually cause the problems we think it could cause. Looking into this more and doing research we found this quote from TechChannel, “Information Technology Intelligence Corp.’s (ITIC) 2018 Global Server Hardware, Server OS Reliability Survey (ibm.co/2LCX6gz), which surveyed more than 800 customers worldwide, found that 59 percent of respondents cited human error as the No. 1 cause of unplanned downtime” [3]. Knowing duplication of files can have a severe impact on a company, as a system administrator we need to always remain aware of even the smallest issues because they can lead to far bigger issues in the future.

Some more research revealed this big idea and thought of how duplicate files actually are to a big company. The article was listing all of the reasons duplicate data was bad, but one of the big ideas was that the company was paying to store the same data twice [1]. This article illuminated the potential money this could be costing a company for wasted storage. The bigger the company the more employees that can potentially waste chunks of storage for duplicate files or data. We know duplication is happening and we are aware that duplicates are costing companies money and storage space to keep, so how do we get rid of all the duplicates?

We found an article that describes a simple, yet powerful way to take care of the duplicate data problem that they like to call data deduplication. It is introduced as a solution to data duplication caused by human error. According to Tye,

“To counteract the problems outlined, the solution is a process called, unsurprisingly, data deduplication. This is a blend of human insight, data processing and algorithms to help identify potential duplicates based on likelihood scores and common sense to identify where records look like a close match. This process involves analysing your database to identify potential duplicate records, and unravelling these to identify definite duplicates” [2].

This suggests the exact method we are using and goes even further providing more insight with regards to the necessity of sorting duplicate files for easier cleanup. Similarly, Tye states, “Deduplication rules also need to be implemented, based on your own unique data issues, in order to create a bespoke deduplication strategy. The rules should take into account your decisions about how ‘strict’ you want to be with your deduplication, in terms of maintaining the balance between losing valuable customer data and having a clean database” [2]. The idea to create a temporary archive in the “tmp” directory, allowing the system administrator to parse through the duplicates at a time convenient to them, stemmed largely from Tye’s article.