How to find file duplicates only matching size, not md5

Hey guys, first post here.
I thought I'd share this, since it was quite useful for me. It may not be the most elegant solution, but it worked for me .
So, I have a terabyte (or so) of data, spanning 51,000+ files. Some are greater than 4GiB, many/most are ~4MiB. I have a (relatively) slow 1.6GHz Athlon, which doesn't really like md5-ing huge files , and I didn't need md5, just a size comparison. Unfortunately, all of the duplicate-file-finding tools I tried (including fslint, fdupes, dmerge, and dupseek) didn't have an option to not md5/sha1/hash in some form or another, or I did not find it .
So I decided to use fdupes (it's nice and simple for my needs), and trick it out .
I renamed /usr/bin/md5sum as md5sum.old, and created /usr/bin/md5sum with the following content:
#!/bin/bash
du $1|awk '{print $1}'|md5sum.old
Yes, my bash skills suck, and there are probably better ways of doing it, but again, it worked in this situation . So it's basically taking the du output of the file passed to md5sum (hopefully, again, it worked), shortening that to just the size, and then md5'ing that, which is much faster, and based on the size.
So now, you just run (e.g.):
fdupes -r /path/to/files > /path/to/duplicatefilelist
The -r is for recursive, and I'm redirecting it to a file because I know there will be a lot of dupes. If you don't have many, you could skip that.
Then, fdupes scans the number of files, and has a nice progress percentage so you can see how much is done. After this is done, you'd want to remove /usr/bin/md5sum, and restore the old md5sum .
I hope this is useful to someone else, and please share if there's a better way of doing this.
Thanks!

SahibBommelig wrote:
You also might want to try rmlint (self-advertisement...) or rdfind, which both are able to find twins much faster than fdupes.
If you really want only to compare the size you'd better go with karol's method.
Ah, I find that md5's aren't that slow after all (yes, I realize that rmlint is smarter than simply md5'ing every file ). rmlint is very nice, thank you!
One thing though...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?
Thanks.

Similar Messages

  • How to find files in your iTunes Library & NOT in Your iTunes Music folder AND visa versa.

    Finding Files in iTunes Music folder that are not Listed in iTunes
    Make  2 Smart Play Lists and 1 Static Playlist as follows:
    1. Make a smart playlist called “All Files” with this rule: “Artist” is not “123456789″ (or any nonsense name that won’t be in your library).
    2. Make a static playlist called “All Live Files”.
    3. Make a smart playlist called “Missing Files” with these rules: Match all of the following rules, Playlist is “All Files”, Playlist is not “All Live Files”
    Preform the following operations:
    4. Select all the files from “All Files” and drag them into “All Live Files”. The dead files marked  ! will not copy over.
    5. “Missing Files” will contain all of your dead files. Select all and delete (Option+Delete). Voila, a nice clean iTunes library.
    I keep these three playlists in their own folder. Whenever I gather more than a couple dead tracks for whatever reason, I delete all the tracks in “All Live Files” and repeat steps 4 and 5.
    Finding Files in iTunes Library Are Not Present in iTunes Music Folder
    Search on ‘Net for an Applescript called “Music Folder Files Not Added.” This will be at a site called Doug’s Applescripts and costs a mere $1.99. Install and follow instructions. This script also happens of clean up gremlins such as Podcasts that won’t go away, etc.
    <Edited by Host>

    From the top of the page where the scripts live...
    The general method of use is to download the script to a folder of your choice, e.g. your Desktop, Downloads folder or create a folder at ...\iTunes\Scripts. Select a playlist or highlight some tracks in iTunes and then double-click on the script to execute it. If no specific tracks are selected the script will try to work with all tracks in the current playlist. Some scripts offer a choice of track by track confirmation of changes or fully automatic processing of the selection. Many of the scripts can optionally display a progress bar while running.
    You are strongly advised to backup your entire library, or at the very least the iTunes Library.itl file, before use. Test the behaviour of your chosen script on a small group of files first to make sure it does what you want before applying it to large numbers of files.
    Most builds of Windows will execute *.vbs scripts when you double-click them. If that doesn't happen then you might need to visit the Add/Remove Programs or Programs & Features control panel to enable the Windows Scripting Host. I can track down details if you have issues.
    Backing up and restoring data is an area that is often glossed over. Most people don't try to learn much about it until they've lost something important. (Me too )
    If your iTunes library was in the usual layout then normally copying the whole iTunes folder from the User's Music folder in the old computer to the User's Music folder in the new one will usually work fine. Ideally this is done before iTunes is installed so that there is no empty library to replace, and all settings are picked up from the old library. This post contains more details.
    tt2

  • How to find & delete duplicates?

    How to find & delete duplicates?

    So far as i can tell the only software solution to finding and removing duplicates is this application by Brattoo. I have yet to try it, but i'm going to purchase it tonight and give it a go. I will return and let you know how it works.
    http://brattoo.com/propaganda/#photos
    Key Features (from their website):
    Detects duplicates
    Detects imported thumbnails
    Detects imported thumbnails of faces
    Detects missing image files
    Detects corrupt JPEGs
    Easily find and annihilate duplicates created internally by Photos or during import.
    Compare images using different algorithms to detect and understand differences.
    Detect duplicates using effective algorithms using electronic checksums like SHA1.
    Detect duplicates by using file specific meta data such as filename, dimensions, filesize, Exif creation date or date of creation.
    Mark duplicates with a description to make them easily found using Photos features like search or smart folders.
    Makes your Photos slimmer and faster.
    Free updates!

  • Regarding find the duplicates using match transformation

    Hi ,
    I want to find the duplicates on mutiple fields , how can i pass i the input  to match transformation. I gave the input by mergeing the input fields and pass it as input. with out concatenating any alternative way to give input to the match transformation and find the duplicates.
    Thanks & Regards,
    Ramana.

    Hi Sarthak,
    Thanks for your response. I am not finding the cross field duplicates.  I want to find the duplicates on multiple fileds not across the fields.
    E.g. :   take customer data
      CUSTOMER NO      CUSTNAME     STREET     CITY     COUNTRY     POBOX     ZIPCODE     TELEPHONE
    1000     C.E.B. BERLIN     Kolping Str. 15     Berlin     DE               06894/55501-0
    1001     C.E.B. BERLI     Kolping Str. 15     Berlin     DE               06894/55501-0
    1002     C.E.B. BERLIN     Kolping Str.      Berlin     DE               06894/55501-0
    1003     C.E.B. BERLIN     Kolping Str. 15     Berli     DE               06894/55501-0
    Here we have 4 records are potential duplicates because  in the second record customer name  last character N missing , in the 3rd record street missing the street number and in the 4th record city missing the last character n.  if we concatenate and find the duplicates means we will get it as duplicates. Without concatenating any possibilities are there finding the duplicates for this type of data.
    Thanks in Advance.
    Regards,
    Ramana.

  • How to find files and subdirectories in a directory

    can anyone tell me how to find files and subdirectories in a directory .

    Here's a code snippet,
    http://javaalmanac.com/egs/java.io/TraverseTree.html

  • How to find the duplicate modifiers against the same SKU

    Hi Gurus,
    how to find he duplicate modifiers existing against the same SKU.
    Any help or advise is helpful..
    Thanks
    Yasar

    here is an example that might be of help.
    SQL> select * from employees;
    YEAR EM NAME       PO
    2001 02 Scott      91
    2001 02 Scott      01
    2001 02 Scott      07
    2001 03 Tom        81
    2001 03 Tom        84
    2001 03 Tom        87
    6 rows selected.
    SQL> select year, empcode, name, position,
      2         row_number() over (partition by year, empcode, name
      3                            order by year, empcode, name, position) as rn
      4    from employees;
    YEAR EM NAME       PO         RN
    2001 02 Scott      01          1
    2001 02 Scott      07          2
    2001 02 Scott      91          3
    2001 03 Tom        81          1
    2001 03 Tom        84          2
    2001 03 Tom        87          3
    6 rows selected.
    SQL> Select year, empcode, name, position
      2    From (Select year, empcode, name, position,
      3                 row_number() over (partition by year, empcode, name
      4                                    order by year, empcode, name, position) as rn
      5            From employees) emp
      6   Where rn = 1;
    YEAR EM NAME       PO
    2001 02 Scott      01
    2001 03 Tom        81
    SQL>

  • HT3231 how to find file after using migration

    how to find files after migration has finished

    If you don't have any file you need in your old account, open System Preferences > Users and Groups and delete the old user. If not, I recommend you to migrate your files of your old account to the new one using Shared folder or an USB drive

  • Can't find file to sync photos - why not?!?

    I recently re-organised my photos on my PC & changed the file location for the photos to sync to my iPod. I reconnected the iPod & it went through to organise the photos in a new photo cache as I would expect...but then when it went to actually move the pictures onto the iPod, I get told 'can't find file' and so it does not sync
    I can see the pics in the cache, the location of the photos on iTunes is right, but they don't seem to talk to each other & so nothing happens
    anyone seen this before or have any suggestions??
    thanks
    Ian

    ok one more question
    The file syncing issue now seems to be resolved & the ipod connects to the PC without too much argument now
    the only outstanding issue is that I have seom videos within my pictures. Previously these existed within the Pictures folder, but I could still play them
    However now this doesn't happen & in the Pictures folder I just get the first frame of the video, without being able to play them. I don't know why suddenly it won't play them
    They are all in mp4 format. as most are home videos I wanted to keep them in the Pictures folder rather than move them into the official Videos section
    thanks

  • How do we delete duplicates songs easily and not one at a time in iTunes?

    how do we delete duplicate songs easily and not one by one in iTunes?

    I've written a script called DeDuper which can help remove unwanted duplicates. See this  thread for background.
    tt2

  • Video download from SD card 2 iPad 2 interrupted, file corrupted.  Cannot find file to delete.  Does not appear in Photos, or is not visible using Windows File Manager.  Consumed 2.5G of space!  Can anyone help

    Video download from SD card to iPad 2 interrupted,  corrupted.  Cannot find file to delete.  Does not appear in Photos, or is not visible using Windows File Manager.  Consumed 2.5G of space!  Can anyone help?

    Video download from SD card to iPad 2 interrupted,  corrupted.  Cannot find file to delete.  Does not appear in Photos, or is not visible using Windows File Manager.  Consumed 2.5G of space!  Can anyone help?

  • How i find my i5c is genuine or not

    how i find my i5c is genuine or not?

    See this thread: https://discussions.apple.com/message/23719985#23719985

  • HT2905 how do i hide duplicates in itunes 10, not delete them

    how do i hide duplicates in itunes 10, not delete them

    You can't really "hide" content within the iTunes interface, though at a push you could mislabel some of your Music tracks as Audiobooks if you don't mind the wrong content showing up in that section of the library.
    Perhaps if you tell us the primary reason for wanting to hide things we might be able to suggest an alternative approach.
    tt2

  • How to find file type (Binary/Ascii) using Java IO

    Hi
    I want to find whether the file is binary file or not.
    Can any one please help.
    Thanks,
    Joy

    Encephalopathic wrote:
    I'm no IO expert, but aren't all files binary?Yes.
    I think what the OP is after is how to tell if a file contains only characters (e.g. ASCII characters). This can be done by reading the file and checking if every byte is a valid ASCII character more or less. An example would be limiting byte values to 9 (tab), 10 (LF), 12 (FF), 13 (CR), and 32-127.

  • How to find files added in the last 8 hours that caused disk full

    I used a program called Eagle Filer to archive about 20 Gigs of email from Entourage to a library on an external drive. Oddly, this process made my main (startup) disk full, which had about 16 Gigs available. Clearly this thing wrote a lot of data to my startup disk as well, but I can't seem to find it. It happened in the last 8 hours. What's the best way to figure out where these data are? Because they're individual emails, each file may not be enormous, although I'm guessing the parent folder must be. Simply doing a Spotlight for "files created in the last day" isn't revealing it. Maybe they're hidden somehow? How do I find it where all this junk landed?
    thanks,
    Mike

    To find files in your home directory:
    find ~ -newerBt '8 hours ago'
    To find files on your drive that you have permission:
    find / -newerBt '8 hours ago' 2>/dev/null
    To find all files
    sudo find / -newerBt '8 hours ago'
    To see the file size, add "-ls"

  • Resized partitions how to fix file system to new size.

    So I had to resize my /boot partition which required me to resize my root partition. How do I fix the file system to match the new size of the partition?

    I do not understand why this was moved? This is not a new install, I was tuning my hooks to use systemd replaceing udev, autodetect and base.
    running
    mkinitcpio -p Linux
    resulted in an error that said there was not enough space. and that the image was probably not correctly made.
    I based my decision on the placement of the post by the following post which seemed to be in the same category.
    https://bbs.archlinux.org/viewtopic.php?id=173359
    Is there a place that provides a more narrow definition of topics that are acceptable for fourm categories?

Maybe you are looking for

  • "Project or Library file is corrupt" when trying to open XControl

    This is not a question, this is an account of my learning experience that I hope saves you from throwing your monitor through the wall. When trying to open an XControl, I received the following error: "Project or Library file is corrupt" Knowing Proj

  • NullPointerException when deploying sateless session bean

    I'm deploying a simple Stateless Session Bean, my deployment descriptors are being generated with xdoclet. I'm getting the following exception: weblogic.management.ApplicationException: Exception:weblogic.management.ApplicationException: prepare fail

  • How Can I Run Servlet in tomcat

    Hi My Friends Please can any one tell me how can I Run Servlet in tomcat using my own Virtual Directory � my ask about , what is the structure it must be make it in the hierarchy of the sub folder of Virtual Directory , and in watch folder it must be

  • Java prog is very slow

    hai i am beginner level in java prog i insert data in ms sql server 2000. i am insert 5000 record in database . but i am application take long time .pls give advise for me very quick import java.sql.*; public class SqlDatabase           public Connec

  • UCM & URM Integration

    Dear All I'm setting up an integration between UCM & URM through UCM Adapter, I went through all the steps from defining an outgoing provider, to mapping and synchronization of content and metadata, and all was done successfully. However, I'm still u