Page 1 of 1

Image data (only) checksum/digest

Posted: Sun Sep 01, 2019 8:32 pm
by Widgets
For one of my projects, I have ended up with many images (more than can be visually/manually compared in a reasonable time frame) scattered all over my older & more recent PCs on the LAN. Most of these images would have started out from one base image, but over time, they might have been worked on to clean up, add some metadata etc, by various utilities, commercial & otherwise. (Of course, image format changes, such as PNG <> JPG, are excluded from this comparison)
While it is easy enough to find out where files with the same name are on a network using some of the dup finder utilities, I would like to be able to determine which files contain identical image data, even though the metadata might differ.
This means that a comparison of overall image file digest/checksum won't give the answers I am looking for.

Instead, what I am trying to do is to get a checksum/digest for the image data only. Unfortunately I am not versed well enough with how wxWidgets represents and manipulates the image data (for now, JPG, PNG & possibly TIFF only) and I am hoping that someone on this forum might be able to point me in the right direction, or have information on how to go about this, assuming it seems possible at all.
So far, my search has been fruitless, either because there is nothing out there or I have not used the proper search criteria.

Re: Image data (only) checksum/digest

Posted: Sun Sep 01, 2019 9:11 pm
by doublemax
wxImage is static format with width*height*3 bytes RGB data. If the image has an alpha channel, it's a separate width*height byte buffer. Just calculate a CRC32, MD5 / whatever over the image data.

Re: Image data (only) checksum/digest

Posted: Mon Sep 02, 2019 12:21 am
by Widgets
Thank you.
That was one of my options, but with so many to check, I did not want to go too far down a blind alley without some guidance. There seems to be a chance that some of the applications used for editing may have written back even the image data at a different compression setting compared to what it was originally. I did not want to get too many false negatives - in fact at this stage I have no clue what to expect from this exercise.
I will have a go at this over the next few days and report back

Re: Image data (only) checksum/digest

Posted: Mon Sep 02, 2019 5:15 am
by PB
As you wrote, if lossy compression was used (JPG), merely the data checksum will not be enough: depending on compression setting unmodified TIFF image saved as JPEG can have different pixels than the original. One will have to compare the images by actual picture content (OpenCV?) which can get more complicated and less reliable.

Re: Image data (only) checksum/digest

Posted: Tue Sep 03, 2019 2:55 pm
by Widgets
Quite right, it won't be easy and I'll have to see just what I can look for to try and sort this out.
Still, even being able to identify some images which only differ in 'peripheral' data will be a big help - I hope :-)
My first and very limited test shows a very distinct difference between the SHA-1 for the file vs. that for the RGB wxImage data only.
(FWIW, since my main aim is to identify differences, I won't worry too much about the newer digests which put more emphasis on security etc)

Do yo have any thoughts on how to handle any alpha channel if present?
A separate hash? I have not investigated calculating a hash in parts, ie, RGB first and then if present continue with the alpha data.
Concatenate the buffers, if alpha is present?
Someone else seems to be using for 'similarities' tests, but again, my relatively limited test have not convinced me that it is a good alternative. But then again, I have neither investigated the original purpose nor how closely the implementation in the link comes to matching the 'paper' specs.

Re: Image data (only) checksum/digest

Posted: Tue Sep 03, 2019 4:18 pm
by doublemax
Concatenate the buffers, if alpha is present?
That's what i would do.

To compare two images of the same size, one way is to calculate the sum of squared distances between corresponding pixels and set a threshold value under which images are considered similar. But this is very performance expensive because you have to compare each image to all other images of the same size.

It's better if you're able to calculate a signature for each image and then just sort all images by that signature. E.g. you could scale each image down to a 32x32 greyscale image. That would give you a 1024 byte signature which is fast enough for comparisons. But i don't know if that's good enough for the type of images you need to compare.

Re: Image data (only) checksum/digest

Posted: Tue Sep 03, 2019 7:47 pm
by Widgets
Up to now I had considered only a brute-force, full size image comparison and for my limited tests, that did not seem to imply any undue computational effort, though my reaction may well change once I look at the overall picture.
In that case I may well have to consider comparing only reduced size images. FWIW, this is not any sort of technically defensible problem or issue. All of the images involved are either family photographs or scans of documents related to my research for our family tree. Typically, they are readily identifiable even as thumbnails.

As for time constraint or computational effort, my expectation is that the initial data collection would run as a side task on one of the PCs on the LAN, with intermittent attention to change Cd or DVDs. Part of this process is to collect what information I can and save the details in a database. This database will then be used to sift through the data so that I can collect and consolidate the information into a base set of images and documents.

Basically, I am re-inventing a document/image management system, but one under my control, one where I know where the data is kept and how to get at it. I've tried a few free as well as trial commercial apps for this, but have not found anything I can both afford and feel confident that it will do what I need, now and down the road, and won't either up the cost of support or be abandoned. :shock:
Part of the need for this efforts are the several times I trusted the promises of whatever program I tried, part of it is because I have learned a lot about what I really need or want :)
Hence I very much appreciate all help and comments.