A method may include obtaining a video collected by a visual sensor, the video including a plurality of frames and detecting one or more objects from the video in at least a portion of the plurality of frames. The method may also include determining a first detection result associated with the one or more objects with a trained self-learning model. The method may further include selecting a target moving object of interest from the one or more objects at least in part based on the first detection result. The trained self-learning model may be provided based on a plurality of training samples collected by the visual sensor.