A method comprises receiving an input of at least one image associated with an issue. In the method, text is extracted from the at least one image, and an intent is determined from the extracted text. The method further includes recommending a response to the issue based at least in part on the determined intent, and transmitting the recommended response to a user. The extracting, determining and recommending are performed at least in part using one or more machine learning models.