4K processing on Nvidia Jetson Xavier NX


I’m in a project that requires processing images for object detection. We are using yolov5 and the code from the following website Object Detection using YOLOv5 OpenCV DNN in C++ and Python (learnopencv.com)

We want to do 4K processing and one requirement is to use the following function

blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (3840, 2160), swapRB=True, crop=False)

However the processing gets stuck/crashes (out of memory), as I’m doing 4 processings in series (4 yolov5 models for 4 different classes).

Does anyone know the minimum device to make these kind of processings? Right now we are using an Nvidia Jetson Xavier NX. I contacted Nvidia and they told me to use tensort, but unfortunately this would require a lot of changes that right now can’t be made.

I’m using opencv-python with cuda enabled.


explain the expected benefit of using 20 times the original 640x640 resolution, given memory / compute restraints, please !

(i mean, ridiculous attempt, given table from the tut you qoute:)

why try to defeat a multi-class object detection model like this ??
yolO ?

going by those numbers, a 3840 by 2160 image would take 3-7 seconds to infer, assuming the model is fully convolutional (even if strided).

downscale the image, or process it in tiles. you clearly don’t have the RAM for the full resolution.


The 4K resolution is because we think that more detail of the object is better for accuracy of the model. Besides, if the object is far away or very small, a 4k image resolution has more pixels of the object than a 640 image.

Also the 4 different models for the 4 classes is because some models overfit after 50 epochs and other models are more accurate at 100 epochs. I mean, if every class was trained at 100 epochs, I would get worse results than If I have 4 different models.


Does it even make sense to perform inference at 4K resolution in terms of getting more detail from the object? I mean, when we resize an image, we lose information, and the idea is to have detections of far away objects. How much accuracy (or confidence) we would gain with this approach?


without seeing your pictures, I can’t say for sure.

distance doesn’t matter. size of object relative to the picture also doesn’t matter.

apparent size of an object matters. that is the pixel size of the object, regardless of how large the picture is.

to a point. “enough” is a thing. you need enough resolution so that objects are comfortably recognizable. more doesn’t help more infinitely.

I don’t have pictures right now. But, lets say we have an object that in 4K as an apparent size of 20x20 pixels. Imagine that we make two trainings one with the 4K image and another with a resized version of this image (to 640x640 for example). If in inference we find a an object with a similar size of the training one, is it expected to have more chances of detecting the object in a 4K training/4K inference or in a 640x640 training/640x640 inference? Are the differences so considerable? What about resizing to 1280X1280?

Thank you