Apparatus for processing image data associated with at least one input image, including a convolutional neural network, CNN,-based encoder configured to provide a plurality of hierarchical feature maps based on the image data, a decoder configured to provide output data based on the plurality of feature maps, wherein the decoder includes a convolutional long short-term memory, Conv-LSTM, module configured to sequentially process at least some of the plurality of hierarchical feature maps.