Which approach to chose to icon localisation + classification?

Hi, I’m working on a project, in which I would like to localise and classify icons on smarphone-screens (to press them with a stylus held by a robotic arm). The range of icons will be expanded over time. What approach should I take? I’ve been thinking about the following ones:

-train a network for semantic segmentation to localise icons on the screen (as they will be in certain positions - in a grid - but the position of certain icons can differ from phone to phone, as well as user to user)
-use one-shot learning to train siamise networks to classify app-icons (using some older or distorted versions of the original icon as an anchor and other icons as a negative)

-use some image processing to detect the icons on the screen - knowing they have square-shaped boundaries or some methods used in text detection (projection of the binarized image on the vertical and horizontal axes - detecting the areas in which the icons are located) - or other apriori information
-use one-shot learning on the icons as described above or train one simple CNN, which should be then retrained each time after addig a new icon.

the answer depends on many things, do you see the screen over camera, do you have standard top view or variable perspective view, is the phone standard or there are different models / screen sizes.

in the best case you simply split the phone screen to a constant grid and just use l1 or ssim to measure icon similarity.

in the worst case you get the phone mask with segmentation, use findContours to get contours, simplify it to a rectangle (not exactly but what I mean is a contour with 4 coordinates), use to top and bottom coordinates to calculate a perspective transform, normalize the screen with transform, use a object detection regression model or segmentation to detect icons, use a CNN classifier to classify icons.

and anything in between :slight_smile: