Samefile - finds duplicate files

Name

samefile - find duplicate files

samearchive - find duplicate files, while keeping archives intact

Synopsis

samefile [-a | -A | -At | -L | -Z | | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVvx]

samearchive [-a | -A | -At | -L | -Z | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVv] dir1 dir2 [...]

Description

These programs reads a list of filenames (one filename per line) from stdin and output the duplicate files on stdin. samearchive is written for the special case where each directory acts as an archive of backup. The output will only contain filename pairs that have the same relative path from the archive base. Therefor the output of samearchive will be a

subset of samefile

The output exist out of six fields: the size in bytes, two filenames (with identical contence), the character = if the two files are on the same device, X otherwise, and the link counts of the two files. The output is sorted in reverse order by size as the primary key and a secondary key that depends on the user input.

Options

-0: Indicates that the input list of file names is NUL terminated, for example as generated by implementations of find(1) that support the -print0 option. Without this option, the file names are assumed to be newline terminated.
-A: Sort filenames alphabetically. (default)
-At: Sort filenames cronologicly using the modification date (oldest first). This option is not available when you’ve compiled the application with the low memory profile. This option is not available when you’ve compiled the application with the low memory profile.
-a: Do not sort files with same size alphabetically.
-g size: Compare only files with size greater than size bytes. (Default is 0.)
-H: Print human friendly statistic when at verbose level 2
-i: Allow files with the same device/i-node pair to be added to the binary tree. This might be useful if output will be fed into some other program.
-L: Sort filenames in reversed natural order using the number of times the file was hard linked.
-l: Do not report whether duplicate files are hard linked. This option reverses the effects of the -r option.
-m size: Compare only files with size less or equal than size bytes. Default is 0 which indicates there is no limit.
-q: This option keep the information you are recieved during the processes to a minimum. (Verbose level 0)
-r: Report whether duplicate files are hard linked. The separator string followed by the [bracketed] link count is appended to each name pair if they are hard links created with ln(1) . This option is incompatible with the -l option. Note that this kind of output has only four fields and will appear unsorted before the actual output of samefile.
-S sep: Use string sep as the output field separator, defaults to a tab character. Useful if filenames contain tab characters and output must be processed by another program, say awk(1) .
-V: Print the version information and exit.
-v: This option increases the amount of information you recieve while running samefile. At level 0 you will just see the error messages. At level 1 you will see warning messages indicating that samefile coudn’t do something. And at level 2 you will recieve information about the stages that samefile enters and some statistic when samefile finishes. Defaults to verbose level 1.
-x: By default the program will print just 1 x n lines for each set of matches, but when this option is used the program will print m x n lines for each set of matches. (i.e. when using the option -i and two files match and on is hard linked twice and the other is hard linked three time then you will get
6 lines instead of just 2 or 3.)
-Z: Sort filenames in reversed alphabetical order.
-Zt: Sort filenames in reversed cronological order using the modification date (youngest first). This option is not available when you’ve compiled the application with the low memory profile. This option is not available when you’ve compiled the application with the low memory profile.

Internals

These programs uses two stages to give optimum performance.

In the first stage, all non-plain files are skipped (directories, devices, FIFOs, sockets, symbolic links) as well as files for which stat(2) fails and files that have a size less than or equal to size or greater than size.

When the memory is full, samefile will try to store a part of the filenames temporarily in /tmp/samefile/<pid>. When samefile is not able to do this it will rais the minimum size and removes paths from the memory accordingly.

In the second stage the filenames that are hard linked are reported, assuming option -r was passed to the program. And the files are compared and identical filenames are reported after this.

For any i-node only one filename will be added (unless -i was requested.)

For each two i-nodes that match n lines will be printed that shows the first filename of the first i-node matched against all the filenames of the second i-node. Note however, that because only the first filename per i-node gets into the second stage, the output for a group of duplicate files with different i-node numbers is also minimized.

Suppose you have six duplicate files of size 100 in an i-node group consisting of the three i-nodes with numbers 10, 20 and 30 (the term file systems - it merely refers to a set of i-nodes addressing files with identical contents):

% ls -i
   10 file1     20 file4     30 file6
   10 file2     20 file5
   10 file3
% ls | samefile
100     file1   file4   =       3       2
100     file1   file6   =       3       1

The sum of the sizes in the first column is the amount of disk space you could gain by making all 6 files links to only one file or remove all but one of the files. To be precise, disk space is allocated in blocks - you will probably gain two blocks here, rather than 200 bytes. Note that it is not enough to just remove file4 and file6 (you would gain only 100 bytes because file5 still exists.) The proper way is to use the -i option. The output will look like:

100     file1   file4   =       3       2
100     file1   file5   =       3       2
100     file1   file6   =       3       1

Removing all files listed in the third field will leave only file1. Making all files hard links to file1 is easy. If the fourth field is a ‘‘=’’ do a forced hard link. If you need to know about all combinations of duplicate files, then you use both the -i and -x options. This produces:

% ls | samefile -ix
100     file1   file4   =       3       2
100     file1   file5   =       3       2
100     file2   file4   =       3       2
100     file2   file5   =       3       2
100     file3   file4   =       3       2
100     file3   file5   =       3       2
100     file1   file6   =       3       1
100     file2   file6   =       3       1
100     file3   file6   =       3       1
100     file4   file6   =       2       1
100     file5   file6   =       2       1

Files

/tmp/samefile/<pid>: When the list is to large to fit in to the memory, samefile tries to temporarily store the path on the disk by creaeting files within the directory /tmp/samefile/<pid>
/tmp/samearchive/<pid>: When the list is to large to fit in to the memory, samearchive tries to temporarily store the path on the disk by creaeting files within the directory /tmp/samefile/<pid>

Examples

Find all duplicate files in the current working directory:

% ls | samefile -i

Find all duplicate files in my HOME directory and subdirectories and also tell me if there are hard links:

% find $HOME -type f -print | samefile -r

Find all duplicate files in the /usr directory tree that are bigger than 10000 bytes and write the result to /tmp/usr (that one is for the sysadmin folks, you may want to ’amp’ - put it in the background with the ampersand & - this command because it takes a few minutes.)

% find /usr -type f -print | samefile -g 10000 > /tmp/usr

Find all duplicate files with in the system archives that live within the current working directory:

% find /path/to/backup/system-* | samearchive system-*

Diagnostics

inaccessible: path This is probably due to a ’permission denied’ error on files or directories within the given path for which you have no read permission.

unreadable: path The file could be opend for reading jet failed while reading. You shouldn’t encounter such a warnings but if you do, and recieve more than a few, this could be very well due to failing hard disk.

<file.cpp>:<line> message You can encounter such a errors when you’ve compiled the port with debugging information. Please report such messages to the author with some relevant information about how to reproduce this bug.

memory full: written amount path to disk The memory was full and a number of paths where temporarily written to disk.

memory full: changed minimum file size to number The memory was full and the program coudn’t temporarily write paths to disk, so it raised the minimum file size to the given number. At a later time you could rerun the program using the option -m to check that paths that where skipped and going to be skipped as a result.

memory full: aborting... to manny files with the same size There were just to manny files with the same size to fit in to memory from this point on. Try to split the list up and then run the program multiple times.

Notes

Input filenames must not have leading or trailing white space unless the white space is part of the filename.

Histor

samefile was first written by Jens Schweikhardt in 1996. It was later rewritten by Alex de kruijff in 2009 in order to improve the performace. In addition the program now was able to handle memory allocation problems due to large list and gained some addition options.

Bugs

The list is not sorted properly when using the option -x. This is not a bug but a feature. Proper sorting would either consume vast amounts of memory or time. The sorting options are there just to controle the output. (i.e. use -Zt if you intent to link with the file that was the most recently modified. You will find that file on the left.)

Author

Alex de Kruijff

SameSame

Squeeze every bit out of your hard disk