Techniques for machine learning (ML) and natural language processing (NLP) are described. One technique enables the creation of a clean training dataset through just a few API calls. Another technique provides an automated process for generating a domain-specific lexicon, which is then used to generate ML training datasets, in a manner that requires little to no human labor. Another technique gathers ML training data from domain-specific public sources, which are more likely than typical public sources to contain focused terminology and to be free from errors, thus resulting in trained ML models that provide more accurate inferences.