Patent attributes
The systems and methods described herein may generate multi-modal embeddings with sub-symbolic features and symbolic features. The sub-symbolic embeddings may be generated with computer vision processing. The symbolic features may include mathematical representations of image content, which are enriched with information from background knowledge sources. The system may aggregate the sub-symbolic and symbolic features using aggregation techniques such as concatenation, averaging, summing, and/or maxing. The multi-modal embeddings may be included in a multi-modal embedding model and trained via supervised learning. Once the multi-modal embeddings are trained, the system may generate inferences based on linear algebra operations involving the multi-modal embeddings that are relevant to an inference response to the natural language question and input image.