Tracking with opencv. Errors with threads and VideoCapture

Hello everyone! I make system of counting for manufacture. I have yolov5 for classification. OpenCV for recieve Video stream via Rtsp.
My current problems are:

  1. When I turn on my system with my local machine, I have one result in counting. When I try to turn on my system with server, I have another result. I have enought resources for my system on server.
  2. When I try to turn on my system with cv.imshow(“frame”, frame), My video is stucking throught a few seconds WITHOUT AN ERRORS (however, not every time). How can I trace my errors (maybe logging module) and get stability of my system?
  3. Which camera should I use for my business? now, I have 25 ph per sec, with 263/265 codec.
  4. I need to connect about 100 cams in my system (every cam is counting different things). I tryed to use threads, but I cannot add more 6 threads (although I try to connect more than 6 cams). What should I do?

Thanks a lot for every opinion!

import datetime
import cv2 as cv
import torch
import numpy as np
import numpy as np
from tracker import *

now =
now2 =
count_set =
cap = cv.VideoCapture("rtsp://")

model = torch.hub.load("/network/yolov5", 'custom', source='local', path='/network/', force_reload=True) # yolov5n - yolov5x6 or custom
model.conf = 0.5
model.iou = 0.2
width = cap.get(cv.CAP_PROP_FRAME_WIDTH)
height = cap.get(cv.CAP_PROP_FRAME_HEIGHT)
device = torch.device('cuda')
sum_count = 0

area = [(width/2-0.15*width, 0), (width/2-0.05*width, 0), (width/2-0.05*width, height), (width/2-0.15*width, height)]
area1 = [(width/2-0.4*width, 0), (width/2-0.05*width, 0), (width/2-0.05*width, height), (width/2-0.4*width, height)]
tracker = Tracker() #модуль для обновления координат в каждый момент времени
ret, frame =
if ret == False:
cv.normalize(frame, None, alpha=0, beta=1000, norm_type=cv.NORM_MINMAX)
results = model(frame)
res = results.pandas().xyxy[0]
# print(results.pandas().xyxy, "RES")
list = []
for index, row in results.pandas().xyxy[0].iterrows():

x1 = row['xmin']
y1 = row['ymin']
x2 = row['xmax']
y2 = row['ymax']
b = str(row['name'])
boxes_ids = tracker.update(list)

for box_id in boxes_ids:
x,y,w,h,id = box_id

cv.rectangle(frame, (int(x),int(y)), (int(w),int(h)), (255,0,255), 2)
cv.putText(frame, str(id), (int(x),int(y)), cv.FONT_HERSHEY_PLAIN, 1, (255, 0, 0), 2)
result1 = cv.pointPolygonTest(np.array(area, np.int32), (int(w),int(h)), False)

if result1>0:

count = len(area_set)
if count <=2:
count = 0

print("Корзинок проехало: ", count)
print('Всего проехало: ', sum_count)

now1 =
if (((now1-now).total_seconds())>=2) and len(area_set)<=2:
area_set = set()
now2 =
if ((now1-now).total_seconds())>=60:
with open("time.txt", 'a') as my_file:
if count>=2:
a = "C "+ str(now)[11:19] + ' ПО ' + str(now1)[11:19] + " Количество "+str(count) + '\n' + 'Всего с начала работы системы проехало ' + str(sum_count) + ' корзинок' '\n'

area_set = set()

now =
cv.imshow("frame", frame)

except Exception as e:

if cv.waitKey(10) == ord('q'):
# cap.release()
# cv.destroyAllWindows()


module tracking

import math

class Tracker:
def __init__(self):
# Store the center positions of the objects
self.center_points = {}
# Keep the count of the IDs
# each time a new object id detected, the count will increase by one
self.id_count = 0

def update(self, objects_rect):
# Objects boxes and ids
objects_bbs_ids = []

# Get center point of new object
for rect in objects_rect:
x, y, w, h = rect
cx = (x + x + w) // 2
cy = (y + y + h) // 2

# Find out if that object was detected already
same_object_detected = False
for id, pt in self.center_points.items():
dist = math.hypot(cx - pt[0], cy - pt[1])

if dist < 30:
self.center_points[id] = (cx, cy)
# print(self.center_points)
objects_bbs_ids.append([x, y, w, h, id])
same_object_detected = True

# New object is detected we assign the ID to that object
if same_object_detected is False:
self.center_points[self.id_count] = (cx, cy)
objects_bbs_ids.append([x, y, w, h, self.id_count])
self.id_count += 1

# Clean the dictionary by center points to remove IDS not used anymore
new_center_points = {}
for obj_bb_id in objects_bbs_ids:
_, _, _, _, object_id = obj_bb_id
center = self.center_points[object_id]
# Update dictionary with IDs not used removed
self.center_points = new_center_points.copy()
return objects_bbs_ids
# Update dictionary with IDs not used removed
self.center_points = new_center_points.copy()
return objects_bbs_ids

contact a company for industrial machine vision applications.

From your questions I suspect that this

is the right answer. However I maybe wrong if my response below doesn’t apply to you.

  1. Do you mean that you get different results when streaming from RTSP? If so the most common cause of this is that you are not requesting frames fast enough and they are being lost.
  2. If (1) then calling cv.imshow() is just compounding your issue by adding extra delay.
  3. If you mean for compatibility with OpenCV then any which support RTSP should work. That said there are so many camera’s on the market that its impossible to predict how they will perform so I would test some camera’s, choosing ones provide the resolution/quality you require.
  4. Is this on a single system, which GPU are you using for this? This will only be possible if you decode on the GPU (not cv.VideoCapture). I can’t see any way around using threads for this, at least one per camera.

Thanks a lot for answer!

  1. I mean that I try to use my local machine and server. From server and local I 've different results (bc yolo detect with different velocity.
  2. Understood!
  3. Understood!
  4. I need turn on a lot of cameras for my system (through server). I thought that I can do it with threads. I need to connect camera with yolo (pretrained with my custom dataset) and turn on parallely 100 cams.
    How can I do it?

statements like these tell me this task exceeds your training and expertise significantly. I’m not saying threads are bad. I’m saying, if that’s the level you’re thinking about this, you are missing the big picture.

there is simply no way you’re gonna decode and run AI on 100 video streams on a single computer.

you can do some napkin math. your bottlenecks will be the AI inference and the actual video decoding.

you need to deploy “smart cameras” that do the processing right there at the edge, and only report the important data back (detections/counts).

industrial vision does that all the time. there are manufacturers that simply make those things (and all the smarts that go into them), and there are integrator companies that know how to use those things and will deliver you an entire solution including setup and operation.