Systems and methods for detecting anomaly in video data are provided. The system includes a generator that receives past video frames and extracts spatio-temporal features of the past video frames and generates frames. The generator includes fully convolutional transformer based generative adversarial networks (FCT-GANs). The system includes an image discriminator that discriminates generated frames and real frames. The system also includes a video discriminator that discriminates generated video and real video. The generator trains a fully convolutional transformer network (FCTN) model and determines an anomaly score of at least one test video based on a prediction residual map from the FCTN model.