An emotion estimation system includes a feature amount extraction unit, a vowel section specification unit, and an estimation unit. The feature amount extraction unit analyzes recorded produced speech to extract a predetermined feature amount. The vowel section specification unit specifies, based on the feature amount extracted by the feature amount extraction unit, a section in which a vowel is produced. The estimation unit estimates, based on the feature amount in a vowel section specified by the vowel section specification unit, an emotion of a speaker.