Name
samefile - find duplicate filessamearchive - find duplicate files, while keeping archives intact
Synopsis
samefile [-a | -A | -At | -L | -Z | | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVvx]
samearchive [-a | -A | -At | -L | -Z | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVv] dir1 dir2 [...]
Description
These programs reads a list of filenames (one filename per line) from stdin and output the duplicate files on stdin. samearchive is written for the special case where each directory acts as an archive of backup. The output will only contain filename pairs that have the same relative path from the archive base. Therefor the output of samearchive will be a
subset of samefile
The output exist out of six fields: the size in bytes, two filenames (with identical contence), the character = if the two files are on the same device, X otherwise, and the link counts of the two files. The output is sorted in reverse order by size as the primary key and a secondary key that depends on the user input.
Options
- -0
- Indicates that the input list of file names is NUL terminated, for example as generated by implementations of find(1) that support the -print0 option. Without this option, the file names are assumed to be newline terminated.
- -A
- Sort filenames alphabetically. (default)
- -At
- Sort filenames cronologicly using the modification date (oldest first). This option is not available when you’ve compiled the application with the low memory profile. This option is not available when you’ve compiled the application with the low memory profile.
- -a
- Do not sort files with same size alphabetically.
- -g size
- Compare only files with size greater than size bytes. (Default is 0.)
- -H
- Print human friendly statistic when at verbose level 2
- -i
- Allow files with the same device/i-node pair to be added to the binary tree. This might be useful if output will be fed into some other program.
- -L
- Sort filenames in reversed natural order using the number of times the file was hard linked.
- -l
- Do not report whether duplicate files are hard linked. This option reverses the effects of the -r option.
- -m size
- Compare only files with size less or equal than size bytes. Default is 0 which indicates there is no limit.
- -q
- This option keep the information you are recieved during the processes to a minimum. (Verbose level 0)
- -r
- Report whether duplicate files are hard linked. The separator string followed by the [bracketed] link count is appended to each name pair if they are hard links created with ln(1) . This option is incompatible with the -l option. Note that this kind of output has only four fields and will appear unsorted before the actual output of samefile.
- -S sep
- Use string sep as the output field separator, defaults to a tab character. Useful if filenames contain tab characters and output must be processed by another program, say awk(1) .
- -V
- Print the version information and exit.
- -v
- This option increases the amount of information you recieve while running samefile. At level 0 you will just see the error messages. At level 1 you will see warning messages indicating that samefile coudn’t do something. And at level 2 you will recieve information about the stages that samefile enters and some statistic when samefile finishes. Defaults to verbose level 1.
- -x
- By default the program will print
just 1 x n lines for each set of matches, but when this option is used
the program will print m x n lines for each set of matches. (i.e. when using
the option -i and two files match and on is hard linked twice and the other
is hard linked three time then you will get
6 lines instead of just 2 or 3.)
- -Z
- Sort filenames in reversed alphabetical order.
- -Zt
- Sort filenames in reversed cronological order using the modification date (youngest first). This option is not available when you’ve compiled the application with the low memory profile. This option is not available when you’ve compiled the application with the low memory profile.
Internals
These programs uses two stages to give optimum performance.
In the first stage, all non-plain files are skipped (directories, devices, FIFOs, sockets, symbolic links) as well as files for which stat(2) fails and files that have a size less than or equal to size or greater than size.
When the memory is full, samefile will try to store a part of the filenames temporarily in /tmp/samefile/<pid>. When samefile is not able to do this it will rais the minimum size and removes paths from the memory accordingly.
In the second stage the filenames that are hard linked are reported, assuming option -r was passed to the program. And the files are compared and identical filenames are reported after this.
For any i-node only one filename will be added (unless -i was requested.)
For each two i-nodes that match n lines will be printed that shows the first filename of the first i-node matched against all the filenames of the second i-node. Note however, that because only the first filename per i-node gets into the second stage, the output for a group of duplicate files with different i-node numbers is also minimized.
Suppose you have six duplicate files of size 100 in an i-node group consisting of the three i-nodes with numbers 10, 20 and 30 (the term file systems - it merely refers to a set of i-nodes addressing files with identical contents):
% ls -i 10 file1 20 file4 30 file6 10 file2 20 file5 10 file3 % ls | samefile 100 file1 file4 = 3 2 100 file1 file6 = 3 1
The sum of the sizes in the first column is the amount of disk space you could gain by making all 6 files links to only one file or remove all but one of the files. To be precise, disk space is allocated in blocks - you will probably gain two blocks here, rather than 200 bytes. Note that it is not enough to just remove file4 and file6 (you would gain only 100 bytes because file5 still exists.) The proper way is to use the -i option. The output will look like:
100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file1 file6 = 3 1
Removing all files listed in the third field will leave only file1. Making all files hard links to file1 is easy. If the fourth field is a ‘‘=’’ do a forced hard link. If you need to know about all combinations of duplicate files, then you use both the -i and -x options. This produces:
% ls | samefile -ix 100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file2 file4 = 3 2 100 file2 file5 = 3 2 100 file3 file4 = 3 2 100 file3 file5 = 3 2 100 file1 file6 = 3 1 100 file2 file6 = 3 1 100 file3 file6 = 3 1 100 file4 file6 = 2 1 100 file5 file6 = 2 1
Files
- /tmp/samefile/<pid>
-
When the list is to large to fit in to the memory, samefile tries to temporarily store the path on the disk by creaeting files within the directory /tmp/samefile/<pid>
- /tmp/samearchive/<pid>
-
When the list is to large to fit in to the memory, samearchive tries to temporarily store the path on the disk by creaeting files within the directory /tmp/samefile/<pid>
Examples
Find all duplicate files in the current working directory:
% ls | samefile -i
Find all duplicate files in my HOME directory and subdirectories and also tell me if there are hard links:
% find $HOME -type f -print | samefile -r
Find all duplicate files in the /usr directory tree that are bigger than 10000 bytes and write the result to /tmp/usr (that one is for the sysadmin folks, you may want to ’amp’ - put it in the background with the ampersand & - this command because it takes a few minutes.)
% find /usr -type f -print | samefile -g 10000 > /tmp/usr
Find all duplicate files with in the system archives that live within
the current working directory:
% find /path/to/backup/system-* | samearchive system-*
Diagnostics
inaccessible: path This is probably due to a ’permission denied’ error on files or directories within the given path for which you have no read permission.
unreadable: path The file could be opend for reading jet failed while reading. You shouldn’t encounter such a warnings but if you do, and recieve more than a few, this could be very well due to failing hard disk.
<file.cpp>:<line> message You can encounter such a errors when you’ve compiled the port with debugging information. Please report such messages to the author with some relevant information about how to reproduce this bug.
memory full: written amount path to disk The memory was full and a number of paths where temporarily written to disk.
memory full: changed minimum file size to number The memory was full and the program coudn’t temporarily write paths to disk, so it raised the minimum file size to the given number. At a later time you could rerun the program using the option -m to check that paths that where skipped and going to be skipped as a result.
memory full: aborting... to manny files with the same size There were just to manny files with the same size to fit in to memory from this point on. Try to split the list up and then run the program multiple times.
See Also
samearchive-lite(1) sameln(1) samesame(1) find(1) ls(1)
Notes
Input filenames must not have leading or trailing white space unless the white space is part of the filename.
Histor
samefile was first written by Jens Schweikhardt in 1996. It was later rewritten by Alex de kruijff in 2009 in order to improve the performace. In addition the program now was able to handle memory allocation problems due to large list and gained some addition options.
Bugs
The list is not sorted properly when using the option
-x. This is not a bug but a feature. Proper sorting would either consume vast
amounts of memory or time. The sorting options are there just to controle
the output. (i.e. use -Zt if you intent to link with the file that was the
most recently modified. You will find that file on the left.)
Author
Alex de Kruijff