On finding duplicate images

I got a new shiny MacBook in my new job at Bakken & Baeck, and figured it was time for a new start, so I am de-commissioning my old MacBook and with it, the profile and files that are so old it used to be on a PowerBook. Most things were easy, until I got to the photos. Over the years I have imported photos to the laptop while travelling, but always tried to import them again to my real backed-up photo archive at home when I got there, unless my SD card was full while travelling, or I forgot, or something else went wrong. That means I am fairly sure MOST of the photos on the laptop are also in my archive, but also fairly sure some are not.
And of course, each photo, be it an out-of-focus, under-exposed test-shot, is a little piece of personal memory, a beautiful little diamond of DATA, and must at all cost NOT BE LOST.

The photos are mostly in iPhoto (but not all), in a mix of old-style mixed up in iPhotos own folder structures, and in folders I have named.

“Easy” I thought, I trust the computer, I normally use Picasa, it will detect duplicates when importing! Using


find . -iname \*.jpg -print0 | xargs -0 -I{} cp -v --backup=t {} /disks/1tb/tmp/photos/

I can copy all JPGs from the Pictures folder into one big folder without overwriting files with duplicate names (I ❤ coreutils?), then let Picasa sort it out for me.

Easily done, 7500 photos in one folder, Picasa thinks a bit and detects some duplicates, but not by far enough. Several photos I KNOW are archived are not flagged as dupes. I give up trusting Picasa. I know who I can trust:

md5sum!

(In retrospect, I should have trused Picasa a bit, and at least removed the ones it DID claim were duplicates)

So, next step, compute the md5sum of all 7500 new photos, and of all 45,000 already archived photos. Write the shell-script, go to work, return, write the python to find all duplicates, delete the ones from the laptop.

Success! 3700 duplicates gone! But wait! There are still many photos I know for a fact are duplicates, I pick one at random and inspect it. It IS the same photo, but one JPG is 3008×2008 and the other is 3040×2024, also the white-balance is very slightly different. Now I understand, back when I had more time, I shot exclusively in RAW, these are two JPG produced from the same RAW file, one by iPhoto, one by UFRaw, the iPhoto one is slightly smaller and has worse color. Bah.

Now it’s getting later, my Friday night is slipping away between my fingers, but I am damned if I give up now. Next step: EXIF data! Both files have EXIF intact, and both are (surprise!) taken at the same time. Now, I don’t want to go and look up the EXIF tag on all 45,000 archived photos just now, but I can filter by filename, if two files have the same basename (IMGP1234) AND are taken at the same time, I am willing to risk deleting one of them.

So with the help of the EXIF parsing library from https://github.com/ianare/exif-py and a bit of python:

it is done! Some ~3500 more duplicates removed!

I am left with 202 photos that may actually be SAVED FROM ETERNAL OBLIVION! (Looking more carefully, about half is actually out-of-focus or test shots, or nonsense I probably DID copy to the archive, but then deleted) It was certainly worth spending the entire evening 3 days in a row on this!

Now I can go back to hating hotmail for having deleted all the emails I received pre 2001…

One comment.

  1. You did it! The professional way. Otherwise, you would always, until your deathbed, thought that this one great shot may be lost.

    I told you I accdentially deleted the best shots I did in my life when copying the files from CF to Laptop? Full moon midnight long time exposure group shots with all my friends in great posture looking like daylight postcards on a remote carribean beach. Impossinle to replicate. See, thats how you delete priceless photos – drunk in the carribean.

Post a comment.