Patent attributes
Described herein is a system for improving sentiment detection and/or recognition using multiple inputs. For example, an autonomously motile device is configured to generate audio data and/or image data and perform sentiment detection processing. The device may process the audio data and the image data using a multimodal temporal attention model to generate sentiment data that estimates a sentiment score and/or a sentiment category. In some examples, the device may also process language data (e.g., lexical information) using the multimodal temporal attention model. The device can adjust its operations based on the sentiment data. For example, the device may improve an interaction with the user by estimating the user's current emotional state, or can change a position of the device and/or sensor(s) of the device relative to the user to improve an accuracy of the sentiment data.