Methods, system, and media for determining similar malware samples are disclosed. Two or more malware samples are received and analyzed to extract information from the two or more malware samples. The extracted information is converted to a plurality of sets of strings. A similarity between the two or more malware samples is determined based on the plurality of the sets of strings.