Segmenting an input video of a sentence in sign language into individual words

Input: A video of a sentence in sign language. Input video contains gestures of each individual word to complete a sentence.
Example: How are you. For this input the input video is a combination of gestures of each word how, are, you.
My task is to segment the input video into individual gestures and later perform some operations on individual gestures.
I referred some articles and papers which states that this task can be acheived based on gradient values of each frame. If the gradient value remain constant that means its an end or start of a gesture.
I am aware of calculating gradient values using Sobel but unable to apply it here to segement the input video. Looking for some suggestions to understand and move forward in right direction.

btw, which sign language are you talking about ?

that this task can be acheived based on gradient values of each frame

that probably needs more explanation
(do they mean: temporal gradients (Sobel is spatial) ?)