Code: identical_image_checker

I was labelling images to fine-tune the object detection model. Of course occasionally you might find funny samples in these images, but overall it’s still a tedious work that will make you dizzy and feel less and less confident of what you’re doing.

There were 1,526 images in my labelling folder. I have labelled them in these months on and off. Sometimes I would stared at the image and had a feeling that “Wait, I’ve labelled this one before… Haven’t I?”

It’s impossible for me to check all the 1,526 images manually and pick out identical ones. I would like to give ImageMagick a try but I couldn’t find comprehensible examples for me. Furthermore, I am not very good at processing files and paths stuff in shell script. So I decided to use OpenCV for image comparison and Python for processing the comparison results.

Methods to compare two images

There are some possible methods to determine whether two images are identical or similar to some degree. For my application, the similar images could be treated as some kinds of image augmentation, so I can just leave similar images alone. What I need to do is much simpler: Just find out the identical/duplicated images, which might be gathered from different sources and therefore have different filenames.

The following function shows my work step:

def is_identical(files):
  f1, f2 = files
  img1 = cv2.imread(f1, cv2.IMREAD_UNCHANGED)
  img2 = cv2.imread(f2, cv2.IMREAD_UNCHANGED)
  gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
  gray2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
  identical = (gray1.shape == gray2.shape and 
      not(np.bitwise_xor(gray1,gray2).any()))
  if identical:
    MP_idendical_list.append((f1,f2))

It receives a pair of image filenames and check whether these two images have same image size as well as completely identical pixels. If the two images are identical, their filename pair will be appended to the list which is created by multiprocessing.Manager as follows:

import multiprocessing as MP

manager = MP.Manager()
MP_idendical_list = manager.list()

The list have to be created using this approach because it will be accessed by multiprocessing workers. Ordinary list will be empty and therefore is not usable.

Pair the images for comparison

The next problem is how to pick up two images for comparison without missing any of them? It looks like a \(C^n_r\) problem to me, and Python has itertools.combinations to deal with it.

Just use itertools.combinations(n, r) and Python will generate a generator of \(C^n_r\). Note the n part here is an iterable, which in my case is the list contains all the image filenames to be checked.

How big is the combination?

To pair any two of my \(1,526\) images, there are \(C^{1526}_{2}=1,163,575\) combinations!

Show the progress along with multiprocessing

For such a big number, the program will take a long time to finish its work. Without any indication, user (it’s me) won’t be able to know information about the processing progress and it is definitely not a good design.

To my knowledge, tqdm is the only solution.

In addition to showing progress, another consideration is how to speed up the calculation so that we can reduce the processing time as much as possible. Multiprocessing package comes to the rescue.

However, using tqdm along with multiprocessing is not so straightforward (to me). Luckily I managed to come up a working solution as the following block in the main() function:

  # set None to use the number returned by cpu_count()
  with MP.Pool(processes=None) as p: 
    with tqdm(total=process_len) as pbar:
      for _ in p.imap_unordered(is_identical, itertools.combinations(files_list, 2)):
        pbar.set_description('Identical pairs {}'.format(len(MP_idendical_list)))
        pbar.update()

Final tweak

The previously mentioned MP_idendical_list has a problem. For example, if we have three identical images in the working directory, and their filenames are imgA.jpeg, imgB.jpeg, and imgC.jpeg, then in the MP_idendical_list they will be listed as the follows:

MP_idendical_list = [
  ('imgA.jpeg', 'imgB.jpeg'),
  ('imgB.jpeg', 'imgC.jpeg'),
  ('imgA.jpeg', 'imgC.jpeg')
]

It’s not only redundant but also hard to read by us in later usage.

To solve this problem, I found a solution in the Stack Overflow post and slightly modified it to be the clean_duplicated function in my program.

Test with all the images

I ran the program in a notebook with Intel i7 CPU using 8 workers. The processing time was 4 hours!

Here is the output of tqdm:

Identical pairs 3: 100%|███████████████████████████████████████████████████| 1163575/1163575 [4:01:48<00:00, 80.20it/s]

Apparently, my solution is usable in small image sets but not scalable. If you know better solutions (I believe there should be many) please let me know.