Patent attributes
One embodiment of the invention provides a method for utility-preserving text de-identification. The method comprises generating corresponding processed text for each text document by applying at least one natural language processor (NLP) annotator to the text document to recognize and tag privacy-sensitive personal information corresponding to an individual, and replacing some words in the text document with some replacement values. The method further comprises determining infrequent terms occurring across all processed texts, filtering out the infrequent terms from the processed texts, and selectively reinstating to the processed texts at least one of the infrequent terms that is innocuous. The method further comprises generating a corresponding de-identified text document for each processed text by anonymizing privacy-sensitive personal information corresponding to an individual in the processed text to an extent that preserves data utility of the processed text and conceals the individual's personal identity.