20. Frequency Analysis
Find the most frequent words in a text document.
In this assignment, we will find the most frequent words in Jane Austen’s Sense and Sensibility.
Your goal is to print the top 100 most frequent words and their frequency count. Your method should be fast and flexible so that you can handle longer documents and more or less than the top 100.
- Read the information in the file using
File
andScanner
. - Clean up the information by removing non-word characters. Make everything lowercase for consistency. This step is surprisingly tricky to get perfect. I recommend
replaceAll
using the regular expression"\\W+"
to match at least one non-word character. See a handy cheat sheet (pdf) for more information. - Create a
Map
to count each word and their frequency. - Create your own custom
WordCount
class that will hold one word and its frequency together.- Constructor
- Write a
toString
method. - Write a
compareTo
method and implement theComparable
interface. This should be written so that the highest frequency comes first.
- Make a
TreeSet
with theWordCount
class using the data from the file. - Print the top 100 most frequent words in your
TreeSet
.