In some cases, one or more heuristics can be automatically generated using a small dataset of segments previously labeled by one or more domain experts. The generated one or more heuristics along with one or more patterns can be used to assign training labels to a large unlabeled dataset of segments. A subset of segments representing an occurrence of verbal harassment can be selected using the assigned training labels. Randomly selected segments can be used as being indicative of a non-occurrence of verbal harassment. The selected subset of segments and randomly selected segments can be used to train one or more machine learning models for verbal harassment detection.