Label detection advise needed

Hi, guys!
I would like to see your suggestion about this:

I’m working on an application that recognizes labels on postal items.
Attached below are two sample files. One of them has a white label on a relatively dark background, and the other - a similar white label on a much lighter background.
Currently, I apply a color mask in the range of 200 to 255 for each of the color channels, then a threshold on the resultim and recognize the contours with cv2.findContours(). (i’m skipping the filtering stuff of the found contours, minAreaRect() and so on here)
This approach gives very good results for files with a darker background, but not so good in cases where the background is lighter - then the mask covers a large part of the package, leading to wrong results.
If you look at the file with a light background, it is clear that the label is clearly outlined and visible to the human eye, and I would like to know how to find it with better accuracy.
What would you advise me for a solution that would work on many different photos?
Maybe fine-tuned canny-edge detection or may be deep-learning approach?
Thanks in advance!

maybe you can work on it “inward out”,
detect barcode, text spotting (MSER, EAST,DB),
and infer the label borders from that

A U-Net with as few as 5 labeled samples would detect it easily. You just mask the label area and with some good augmentations like random rotation / flip / warp you will get reliable results, you can use findContours or boundingRect later with the predicted mask.

Even better augmentation would be separating the package and the label then rotating them independently on the same origin. You can fill the void created by label using inpaint.

I use this method for detecting anything even if I need it only once and it takes my 10 mins at max.

U-Net is something new for me - any guides samples available for using it with OpenCV?

It’s an architecture of convolutional neural network. You can find it anywhere, here’s a good one: GitHub - ternaus/TernausNet: UNet model with VGG11 encoder pre-trained on Kaggle Carvana dataset

I would give you more details but I’m not sure if it’s in the scope of this forum.

1 Like

take my word, – it is ;]


Thanks @berak. I needed to create a repo of my U-Net, so I used this as an excuse :slight_smile:

Here’s my U-Net implementation. You just need to create a dataset folder with images and masks folders and place the files in them. Corresponding image and mask file names should be the same.

Model input size is 256x256. I don’t have the package dataset but I used 16 images from a face dataset for test.

Here’s the training images and masks I used.

After about 15 mins of training, here’s the result.

It’s not great but for 16 labeled images, it’s perfectly usable.

The code is very small and easy to use, reading usage message should be enough.


He actually sent me 10 images privately and I used 8 for training and 2 for evaluating. I just made the images square but in practice it’s better to localize the box first and zoom-in more since 256x256 is not a high resolution.

Also using a bigger dataset (especially if it will be used in production) would make a huge difference.

Anyhow it worked pretty good imo. Here’s the results of two test samples (one being in the original post).


You just resize the masks to original image size and you’re good to go.


Well i used the same 10 pictures, resized them to 256x256 and produced the mask-equivalents.
I ran the training script ~ 40 minutes ago and now it is on epoch #2500.
The training machine runs Debian 10; i7 CPU; 8 GB RAM; GTX 1050 2GB

It is still running, until epoch 10000, i guess.

As a novice, I have the following questions (sorry if the answers are obvious):

  1. Is it mandatory for the training files to be square and exactly in these dimensions, 256x256 or can I use for example 512x512 and rely on more correct results?
  2. What does the second optional parameter to training.p do?
  3. You say that your training took 10 minutes, how do you achieve such speed?
  4. My goal is to process about 20,000 such unique shipments per month. In your opinion, what is the reasonable size for the data set and what size should the images be?

@wency.www ,

  1. U-Net is a fully convolutional network so there’s actually no limitation on the input size, but this particular architecture works best with the 256x256 resolution. If I need higher resolution segmentation, I usually use pix2pixHD generator which is a similar network.

  2. If you stop training or it crashes somehow and later want to continue where you left off, you can give the latest model as a second parameter and it will continue from there. Also used for fine tuning and other purposes.

  3. The only difference is that I use 2080ti, it must be that. Also 10k epoch is just a random number, you can stop training if you’re happy with the test set results.

  4. The dataset size depends on the task of course, for instance if you want to create a generic face segmentation model, you’ll want to cover as much as different ethnicity, gender, pose and lighting condition as possible, but if you just want to segment a particular video with a single person, you can create a dataset with 10-50 photos of most distinct poses and it will segment almost perfectly.

The key here is to construct dataset in a way that it has a good variation and cover most of the cases (like brown box, white bag, black bag etc.).

U-Net is very fast and you can enable AMP to make it even faster, it’s a flag in the training script. It uses 16 bits floating point rather than 32 so makes the computation faster on GPU, be aware though it will make it slower on CPU for CPU’s don’t support half precision natively.

So as for higher resolution, you can use two step U-Net, one can detect the box, and once you localized the box, by just resizing the mask to original image size and use boundingRect, you can use second U-Net to localize label but with a much better image.

It’s basically trial and error as most of the deep learning tasks. Good luck.

1 Like

A heartfelt thank you! I really couldn’t hope for such a detailed guide. I will test the results and if I have questions, I hope it will not be a problem to ask them here. Thank you for your time

1 Like

Well, time for some results.
With the model version got on epoch #8200 i tested it on 36 real images (some of them among traing files).
Initially i tested with the original file dimensions, but this does not work - it seems that the evaluated images need to be squared, so i decided to rescale/crop them to 1024x1024.
From the results - 29 of the masks are OK, which is ~80% accuracy - quite good (on the same set of images my own record was 26/36)!
What surpised me are the results on 005, 019 and 038. They are very similar to what i got using color masks.

Regarding the training - if i decide to add 100 more photos/masks to train, do i need to restart the training from scratch or i can use the last compiled model?

Just to clarify, did you use 1024x1024 or 256x256 images as model input? 1024 is not optimal for this model, I strongly recommend using 256 and resizing the mask to 1024 later. Mask will be fine after 4x upscale.

Train and test samples should be the same format as you realized. Increasing dataset size should improve the accuracy.

Also I added an optional vertical flip augmentation, so make sure to update the code and change line 125 (after update) as:
img, mask = random_transform(img, mask, vflip=True)

It will probably help for your dataset as it increases variation.

Another thing, 8200 epoch for such a small dataset can be too much, although U-Net doesn’t overfit very easily, it can hurt model’s performance of generalization. You should watch the predictions and if they look ok, try evaluation on test set, and if test set masks doesn’t improve anymore, it’s better to stop training.

Used 256x256 for training and 1024x1024 for the evaluation. I’ll try with 256x256 evaluation also.

And do i need to restart the training from scratch after adding files, or i can use the most recent model version for checkpoint?

Since the model has never seen 1024 images during the training, 80% is much better than I expected.

If the dataset didn’t change dramatically, starting from a pre-trained model (checkpoint) is better.

And it is even better with 256x256 model input - 89%!

nice to hear, another tip is, if you will grow the dataset, mask the failed cases by hand rather than randomly picking more data.

Well, just for the record - i removed all files from the training dataset and added one additional file/mask.
After that - i restarted the traininng from the last available checkpoint for ~5 minutes.
Stopped the training and ran the same testing images - this however, significally reduced the accuracy.
When ran the test on the original model - the things are back correct.
So - i missing something on the training process. Do i need to keep initial training files when i train additional one?

definitely, model is alive and the weights are constantly changing as you train, if you continue training with just a single example, it will overfit to this sample very quickly.

in order to generalize, more data is always better, so you should add data. the motivation behind adding hard examples is that model will correct itself, but it should see the rest of the dataset as before.