Optimizing a search for duplicate files in a filesystem

We have a large directory that people have been dumping to for years (~200k files), and I want to go through and look for duplicate files. The first part of this was just scanning through and checking for duplicate file names.  I wrote a simple VI to serarch for duplicates, but it is very inefficiant because of multiple string comparisons. Even optimized as much as I could think to do, I'm only scanning about 1k files / minute.  I have attached my code - any help would be appreciated.
Attachments:
search.vi ‏22 KB

This is trivial using the variant attributes mentioned earlier.  Attached is my method which uses the Variant Repository XNode I developed.  Not sure on performance comparisions.
EDIT:  I'm betting the thing that will take the longest will be performing the MD5, I'm sure a CRC32 could be done faster which is probably just as good in this situation, or even a sum of the bytes of the file as a quick way of identifying the file.
Unofficial Forum Rules and Guidelines - Hooovahh - LabVIEW Overlord
If 10 out of 10 experts in any field say something is bad, you should probably take their opinion seriously.
Attachments:
Find Duplicate Files.vi ‏23 KB

Similar Messages

  • How to search for duplicate files?

    Is there a quick and easy way to search for duplicate files? (and other space hogs) I would like to do some hard drive spring cleaning. Trying to free up some space. Any suggestions? Thanks in advance, Carl

    Here are some apps to try:
    Singular, http://www.emeraldion.it/software/macosx/singular.html
    DupeCheck, http://dfay.fastmail.fm/dupecheck/ (uses the terminal suff and Spotlight under the hood)
    FileBuddy, http://www.skytag.com/filebuddy/

  • Search for duplicate files?

    Hi all,
    I am writing a program to search the drives of a computer and return
    a list of duplicate files(where two or more copies of a file may exist), and have an option to delete one of the files based on date, etc... I also want to be able to search by extension. I have looked at the list() methods in the File API however have had trouble identifing directories from files. I wish to search the entire drives(all files, subfolders and hidden files).
    Thanks for any help,
    Ciara

    ... had trouble identifing directories from files ...
    You said you looked at the API. Look again. You'll see that there are isDirectory() and isFile() methods.
    If isDirectory() returns true then you'll probably want to recursively call your method.

  • Searching for duplicates

    In iTunes you can search for duplicates of a music file (easily). Can we do the same thing with an iMac using Leopard? If so, could someone please tell me how?
    Cheers,
    sws

    I would probably not buy it if +all you want to do+ is search for duplicate files. If it's other capabilities are of use to you then maybe you might consider it. I bought SC6 way back in the day but as the OS and other utilities had matured I found that I had less and less use for it.
    If for instance you only want to find duplicate photos then perhaps Duplicate Annihilator may be more appropriate?
    Again, wait and maybe others will chime in, if not then google a term like 'find duplicates' and see what comes up.

  • Tell me Logic for search for duplicate words(or strings) in a large file.

    Search for duplicate words (or strings) in a text file containing one word per line. For each word that occurs more than once in the flat file output should be as follows
    <word> <number of occurrences> <line numbers in the file where the word occurs>
    For example, if the word Hello occurs thrice in a file at lines 100, 178 and 3456 the output should read
    Hello, 3, [100, 178, 3456]

    Incidentally i wrote similar code some days back. You need to do some modifications to get the exact output you want, but i hope it will be of some help.
    One more thing its written using JAVA5
    public class Test
         private static final String COLLECTIONS_TEXT = "C:\\Documents and Settings\\amrainder\\Desktop\\Collections.txt";
         public static void main(String[] args) throws IOException
              findDuplicateWords();
         private static void findDuplicateWords() throws IOException
              Collection<String> words = new LinkedHashSet<String>();
              File file = new File(COLLECTIONS_TEXT);
              StreamTokenizer streamTokenizer = new StreamTokenizer(new FileReader(file));
              int token = streamTokenizer.nextToken();
              while(token != StreamTokenizer.TT_EOF)
                   if(token == StreamTokenizer.TT_WORD)
                        words.add(streamTokenizer.sval);
                   token = streamTokenizer.nextToken();
              System.out.println(words);
    }Cheers,
    Amrainder

  • PSE10 Crashes when searching for duplicate photos

    Hello,
    I recently upgraded to PSE10 and imported my catalog from a different machine (which had been running XP).  Unfortunately, I ended up with a bunch of duplicate entries in the catalog, one with the path c:\users\... the other with c:\my documents\...  These paths both point to the same image on the hard drive.
    When I try to Search for Duplicate Photos (from Find>By Visual Searches), PSE invariably crashes.  I am running PSE10 under Windows 7 (full system info at the end of this message).
    Any idea why this might be happening, and is there a different way I can remove the duplicate entries in the catalog?
    Thanks,
    Chris
    Full system info:
    Elements Organizer 10.0.0.0
    Core Version: 10.0 (20110914.m.17521)
    Language Version: 10.0 (20110914.m.17521)
    Current Catalog:
    Catalog Name: My Photos
    Catalog Location: C:\ProgramData\Adobe\Elements Organizer\Catalogs\My Photos\
    Catalog Size: 52.3MB
    Catalog Cache Size: 391.7MB
    System:
    Operating System Name: Windows 7
    Operating System Version: 6.1
    System Architecture: Intel CPU Family:6 Model:5 Stepping:2 with MMX, SSE Integer, SSE FP
    Built-in Memory: 3.8GB
    Free Memory: 705.3MB
    Important Drivers / Plug-ins / Libraries:
    Microsoft DirectX Version: 9.0
    Apple QuickTime Version: Not installed
    Adobe Reader Version: 10.1
    Adobe Acrobat Version: Not installed
    CD and DVD drives:
    F: (hp DVDRAM BUS: 1 ID: 1 Firmware: GT20L)

    If I understand well, what you are calling 'duplicates' are duplicate entries in the catalog, the same media being called either by its original path+filename or shortcut path+filename.
    The organizer will try open two different files to compare them, but when trying to open the second 'shortcut', the file will be already blocked (in use). That could explain the failure of the comparison for similarity.
    Try creating a new catalog with similar images, for instance jpegs and png copies. Does the visual search work ?
    About backup and restore : If my supposition is right, that should work if you try to restore to a new custom location with keeping the file structure : the backup should create real file duplicates on different paths. If it works, finding visual duplicate should work, although this process may take a very long time.

  • No "search for Duplicates" found in address book

    Hey, on my intel imac I've got no "search for duplicates" available to me in Address Book. It shows up in my macbook, and my lampstand powerbook. All three are running 10.4.10, Adressbook is Version 3.1.2 (321)
    Also, when I sync with .Mac it doubles up all of my entries in the libraries on both computers.
    Any ideas?

    AB search should search notes. quit Address book and delete the file homedirectory/library/application support/address book/AddressBook-v22.abcddb. then start Address Book and see if search works correctly now. if it doesn't reindex your hard drive with spotlight. add your hard drive to spotlight's privacy pane for a few seconds and then remove it from there. wait for reindexing to finish and try AB search again.

  • How can I search for duplicate photos in iPhoto libraries?

    How can I search for duplicate photos in iPhoto 9.5.1 library.  Installed on iMac 2.7 GHz Intel Core i5, Mavericks 10.9.5.

    These applications will identify and help remove duplicate photos from an iPhoto Library:
    iPhoto Library Manager - $29.95
    Duplicate Annihilator - $7.95 - only app able to detect duplicate thumbnail files or faces files when an iPhoto 8 or earlier library has been imported into another.
    PhotoSweeper - $9.95 - This app can search by comparing the image's bitmaps or histograms thus finding duplicates with different file names and dates.
    DeCloner - $19.95 - can find duplicates in iPhoto Libraries or in folders on the HD.
    DupliFinder - $7 - shows which events the photos are in.
    iPhoto AppleScript to Remove Duplicates - Free
    PhotoDedupo - $4.99 (App Store) -  this app has a "similar" search feature which is like PhotoSweeper's bitmap comparison.  It found all duplicates
    Duplicate Cleaner for iPhoto - free - was able to recognize the duplicated HDR and normal files from an iPhone shooting in HDR
    Some users have reported that PhotoSweeper did the best in finding all of the dups in their library: iphoto has duplicated many photos, how...: Apple Support Communities.
    If you have an iPhone and have it set to keep the normal photo when shooting HDR photos the two image files that are created will be duplicates in a manner of speaking (same image) but there are only twp apps that detected the iPhone HDR and normal photos as being duplicates:  PhotoSweeper and Duplicate Cleaner for iPhoto.  None of the other apps detected those two files as being duplicates as they look for file name as well as other attributes and the two files from the iPhone have different file names.
    iPLM, however, is the best all around iPhoto utility as it can do so much more than just find duplicates.  IMO it's a must have tool if using iPhoto.

  • Is it possible to search for Duplicates in Aperture v3.1

    Hi Guys,
    I'm hoping you guys can help me,
    I'm re-scanning my family photo archive at a much higher resolution,
    and i can't seem to find a way of searching for duplicates of the same image, but at different resolutions.
    i have achieved this in the past using "ExpressionMedia2" (previously called "iViewMediaPro"),
    This program had a command to "Search for Similar",
    with a slider to move from close match to exact match,
    which was pretty accurate, no matter what size the images were.
    I'm tempted to use this program,
    but i'm not sure how Aperture will handle images that have been deleted by another program.
    and how can i strip orphaned previews out of the Aperture Library ?
    I'm new to Aperture, and like it's way of working,
    but, it's very powerful and does have many un-intuitive commands within it
    Hoping you guys can help me,
    billy

    Aperture has no in-library dupe detection. It only has it on import.
    As for other apps deleting files from the Aperture library -- they shouldn't do that. If they are merely deleting a file at the Finder level ( that is referenced into AP), then that image will be marked as Missing in Aperture. You can do a Find for all those your self, or you can make a smart album, then delete at your convenience.

  • I have 3 older ext. hard drives that I've utilized many times. Today while searching for old files, one of the three is no longer recognized by my PowerMac.  Any suggestions?

    I have 3 older ext. hard drives that I've utilized many times. Today while searching for old files, one of the three is no longer recognized by my PowerMac. The drive is not listed in Disk Utility.  Any suggestions?

    Is the computer in you equipment line:
    Dual Core Intel Xenon
    (which is not a PowerMac but a Mac Pro) the one you are asking about, or do you have an older PowerMac?
    If a Mac Pro, their forums are here:
    Mac Pro
    and, as Mac Pros have a totally different architecture from the pre-2005 Macs this forum covers, you may not have the same issues that can affect the older models. If someone didn't notice your equipment line, you could get advice that doesn't apply.
    If you really have a pre-2005 PowerMac, read on.
    If the stubborn external is USB and does not have its own power brick (i.e., it gets power only from the computer's UBS ports--"bus powered"), it may not be getting enough power. As electric motors age, they can demand more power than when new, and the power available on any USB port is limited.
    The typical workabouts to making a computer recognize an aging, bus-powered USB drive are:
    Get a powered USB hub (has its own power brick
    Get a "Y" USB cable: 1 Meter USB 2.0 A to 5 Pin Mini B Cable - Auxiliary USB "Y" Power Design for external hard drives.
    The second gets power from two USB ports on the computer and often that's enough.
    Remember that the USB ports on your keyboard seldom provide enough power even for a thumb drive, so be sure to use the USB ports on the back of the computer.

  • How Can I Search for Media files by Length?

    I want know how can I search for media files (mov, mp4, flv and movie files in general) by their length?
    If this search function is confirmed as definitely not possible on the OS, I would appreciate someone recommending app that might be able to perform such a search?
    Many thanks

    No possible way. Because the files may be enoded with differing codecs, compression, etc., you can only sort by file size, not the 'length' of the movie files.
    Clinton

  • How to search for a file and copy it to somewhere else in terminal

    So basically how can I search for a file on my computer named "testingtesting.txt" and copy it to my desktop using terminal? I have very little experience in terminal, so I was going to try and use the mdfind command, then store that output as a variable, and use that variable as the source for the cp command, but I feel like there is probably a much simpler method. So basically how could I find a file named "testingtesting.txt", copy it, and paste it to my desktop using terminal.

    Is there any particular reason that you must use Terminal?
    You could just download the free EasyFind from the App Store and find the file quickly. Do whatever you wish with it.
    Good luck,
    Clinton

  • How to search for a file?

    Hi I am currently using the below code to detect a file automaticaly in a thumbdrive. So currently I have to state the path directory to detect the particular file. Is there a way to search for the file in the thumbdrive using Java?
    public class Detect {
    public static void main(String[] args)
    File f = new File("f:\\");
    while (! f.exists())
    try { Thread.sleep(500); }
    catch (InterruptedException e) {}
    System.out.println("drive inserted!");
    }

    You'll need to recursively search through a top directory and all its subdirectories for all files. Here is a link how to do this.
    http://www.javapractices.com/Topic68.cjp
    You can search for all drives A through Z for the file (using something like new File("F:") and catching the exception if not found (and dont do anything with the exception). But exclude C and D drive since they are usually hard drives and take forever to search through. Also, you may want to use File.separator instead of '/" or '\' in your code since windowsXP uses one while Unix uses the other and you would want your code to be able to run on either operating system.
    Are you building a web page with this? if so, it will not work since your search code would be running on the server and not on the client's machine. Even if you were able to run it on the client machine via his browser, you cant because of security reasons (end users dont want to give you access to thier drives). If its a web page, look into file upload html tag called <input type="file". This allows the end-user to navigate on his machine to where the file is and download it without you having to search for it.
    If you writing an application instead of a web page, I think there is an applet you can use that will do the same thing as <input type="file".
    Searching though a directory structure isn't good since it takes too long for users t wait.

  • How can i search for multiple file names (images) in bridge?

    Hello everyone!
    Does anyone know how to search for multiple file names in bridge?
    That is to copy & paste something like this: _MG_2152, _MG_2177, _MG_2194, _MG_2195, _MG_2202, _MG_2212, _MG_2219, _MG_2261, _MG_2362, _MG_2401
    Not using several criterias in the search box that is. That takes too long.
    Thanks
    Steffen Rikenberg Photographer
    Oslo, Norway.
    www.steffenrikenberg.no

    Try this add-on [https://addons.mozilla.org/it/firefox/addon/find-all/ Find All]

  • Search for a file with part of a filename in a given folder.

    How can I search for a file with part of a filename in a given folder?
    Or can I change the columns in Advanced Search so that the heading is filename?

    thanks, Michel. it had occurred to me to put a scratch tag on the folder, but I was hoping for an easier way. in future I'll do as you suggest though, because it's easier than what I have been doing.
    daryl

Maybe you are looking for