A method and system for indexing a video sequence comprised of video information and audio information, wherein the video sequence has been created by at least one camera. The video sequence is separated into motion segments that were created while the camera was in motion and into still segments that were created while the camera was in a fixed position based on changes in the camera motion mode. Each still segment is partitioned into episodes based on changes in the audio information in each still segment, and the video sequence is indexed with an identifier that signifies at least a start or an end of an episode contained in the video sequence.