Code: identical_image_checker
I was labelling images to fine-tune the object detection model. Of course occasionally you might find funny samples in these images, but overall it’s still a tedious work that will make you dizzy and feel less and less confident of what you’re doing.
There were 1,526 images in my labelling folder. I have labelled them in these months on and off. Sometimes I would stared at the image and had a feeling that “Wait, I’ve labelled this one before… Haven’t I?”
It’s impossible for me to check all the 1,526 images manually and pick out identical ones. I would like to give ImageMagick a try but I couldn’t find comprehensible examples for me. Furthermore, I am not very good at processing files and paths stuff in shell script. So I decided to use OpenCV for image comparison and Python for processing the comparison results.
Methods to compare two images
There are some possible methods to determine whether two images are identical or similar to some degree. For my application, the similar images could be treated as some kinds of image augmentation, so I can just leave similar images alone. What I need to do is much simpler: Just find out the identical/duplicated images, which might be gathered from different sources and therefore have different filenames.
The following function shows my work step:
def is_identical(files):
f1, f2 = files
img1 = cv2.imread(f1, cv2.IMREAD_UNCHANGED)
img2 = cv2.imread(f2, cv2.IMREAD_UNCHANGED)
gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
gray2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
identical = (gray1.shape == gray2.shape and
not(np.bitwise_xor(gray1,gray2).any()))
if identical:
MP_idendical_list.append((f1,f2))
It receives a pair of image filenames and check whether these two images have
same image size as well as completely identical pixels.
If the two images are identical, their filename pair will be appended to the list
which is created by multiprocessing.Manager
as follows:
import multiprocessing as MP
manager = MP.Manager()
MP_idendical_list = manager.list()
The list have to be created using this approach because it will be accessed by multiprocessing workers. Ordinary list will be empty and therefore is not usable.
Pair the images for comparison
The next problem is how to pick up two images for comparison without missing any of them? It looks like a \(C^n_r\) problem to me, and Python has itertools.combinations to deal with it.
Just use itertools.combinations(n, r)
and Python will generate a generator
of \(C^n_r\).
Note the n
part here is an iterable
, which in my case is the list contains
all the image filenames to be checked.
How big is the combination?
To pair any two of my \(1,526\) images, there are \(C^{1526}_{2}=1,163,575\) combinations!
Show the progress along with multiprocessing
For such a big number, the program will take a long time to finish its work. Without any indication, user (it’s me) won’t be able to know information about the processing progress and it is definitely not a good design.
To my knowledge, tqdm is the only solution.
In addition to showing progress, another consideration is how to speed up the calculation so that we can reduce the processing time as much as possible. Multiprocessing package comes to the rescue.
However, using tqdm
along with multiprocessing
is not so straightforward (to me).
Luckily I managed to come up a working solution as the following block in the
main()
function:
# set None to use the number returned by cpu_count()
with MP.Pool(processes=None) as p:
with tqdm(total=process_len) as pbar:
for _ in p.imap_unordered(is_identical, itertools.combinations(files_list, 2)):
pbar.set_description('Identical pairs {}'.format(len(MP_idendical_list)))
pbar.update()
Final tweak
The previously mentioned MP_idendical_list
has a problem.
For example, if we have three identical images in the working directory,
and their filenames are imgA.jpeg
, imgB.jpeg
, and imgC.jpeg
,
then in the MP_idendical_list
they will be listed as the follows:
MP_idendical_list = [
('imgA.jpeg', 'imgB.jpeg'),
('imgB.jpeg', 'imgC.jpeg'),
('imgA.jpeg', 'imgC.jpeg')
]
It’s not only redundant but also hard to read by us in later usage.
To solve this problem, I found a solution in
the Stack Overflow post
and slightly modified it to be the clean_duplicated
function in my program.
Test with all the images
I ran the program in a notebook with Intel i7 CPU using 8 workers. The processing time was 4 hours!
Here is the output of tqdm
:
Identical pairs 3: 100%|███████████████████████████████████████████████████| 1163575/1163575 [4:01:48<00:00, 80.20it/s]
Apparently, my solution is usable in small image sets but not scalable. If you know better solutions (I believe there should be many) please let me know.