Patent attributes
A largely automated method of categorizing spend data is provided that does not require a prior in-depth knowledge of an organization's transactional data. Natural language processing is applied to text data from transactional data to generate a consolidated cleaned data set (CDS) containing information for categorization. Logs for transactions are clustered based on similarity, forming the minimal data set (MDS). An automated algorithm selects a subset of high-value clusters that are categorized by requesting users to manually categorize one or more representative logs from each cluster of the subset. A model is then trained using the subset of manually categorized clusters and used to predict spend categories for the remaining logs with high accuracy. The AI engine automatically analyzes the predictions based on client context and either auto-tunes the machine learning model or identifies a new subset of clusters to be manually categorized. This loop may continue until 95%-100% of the spend is categorized.