Patent attributes
Systems and methods for improving video understanding tasks based on higher-order object interactions (HOIs) between object features are provided. A plurality of frames of a video are obtained. A coarse-grained feature representation is generated by generating an image feature for each of for each of a plurality of timesteps respectively corresponding to each of the frames and performing attention based on the image features. A fine-grained feature representation is generated by generating an object feature for each of the plurality of timesteps and generating the HOIs between the object features. The coarse-grained and the fine-grained feature representations are concatenated to generate a concatenated feature representation.