A candidate attribute combination of a first data set is identified, such that the candidate attribute combination meets a data type similarity criterion with respect to a collection of data types of sensitive information for which the first data set is to be analyzed. A collection of input features is generated for a machine learning model from the candidate attribute combination, including at least one feature indicative of a statistical relationship between the values of the candidate attribute combination and a second data set. An indication of a predicted probability of a presence of sensitive information in the first data set is obtained using the machine learning model.