Patent attributes
A system comprising a non-transient computer-readable storage medium having stored thereon instructions and at least one hardware processor configured to execute the instructions, to receive a video sequence; divide the video sequence into one or more scenes based on scene boundaries, wherein each scene comprises a plurality of temporally-contiguous image frames, and wherein said scene boundaries are being determined based on a similarity metric between two temporally-contiguous image frames meeting a dissimilarity threshold; and, for each scene of the one or more scenes, (i) generate a plurality of preliminary classifications of an object appearing in at least some of said image frames in the scene, wherein each of said plurality of preliminary classifications has a confidence score, and (ii) calculate a combined classification of the object based on said plurality of preliminary classifications, wherein each of said preliminary classifications is weighted in accordance with its confidence score.