Learning F# (via sample application) — Looking for file duplicates in your filesystem
January 2nd, 2010 2 commentsIntro — why I did write it
Time goes by, hard drives become larger and larger and it is no more strange to have more than 500 Gib of information you have downloaded or copied. After even a year spent working with PC, you are likely to have too much data to remember where you put it.
Surely, the right way to handle this is to organize your data properly, so that same (or similar) files are stored in the shared location and then duplicates are easily noticed.
The lazy way is to run a program to find duplicates for you.
F#
Any programming language will do to solve the problem. I did it with F#. This post will only be useful for people learning F#, as I do not provide a user-friendly program (yet). And running the code (luckily) requires programming skills.
The cool feature in F# is F# interactive, which is great for rapid development and doing research. Unlike typical write-it, compile-it, run-it scenario you can go with write-it and run-it-when-you-want approach.
Heuristic algorithm to find file duplicates
Straight-forward: equal files are those with equal content, names and properties.
Let’s lower this bound: properties, such as last modification date, does not matter. Name as well: you might have renamed that downloaded file half a year ago. Content … matters.
Let’s go further: if content equals, their hash values are equal too. But as reading those 500 Gib of data alone will take tremendous amount of time, let’s consider first N bytes only. What we will end up with is a list of estimated duplicates. (Afterall, for those candidates we can perform a full content check).
Code
Please criticize and comment (English grammar included). Your feedback is very welcome.
The pucrahses I make are entirely based on these articles.
You’ve really helped me underastnd the issues. Thanks.