Obtain access to a three-dimensional point cloud representation of an object including poses of a scanning digital camera and corresponding video frames. Down-sample the three-dimensional point cloud representation to obtain a set of region-of-interest candidates. Filter the region-of-interest candidates to select those of the region-of-interest candidates having appearance changes, which distinguish different visual states, as selected regions of interest, based at least in part on the poses of the camera. Generate region of interest images for the selected regions of interest from corresponding ones of the video frames; and train a deep learning recognition model based on the region of interest images. the trained deep learning recognition model can be used, for example, to determine a visual state of the object for repair instructions.