CS405
HW #4
Due Friday, December 11,
midnight
40 points
total
1) (20 pts). You would like to write an email add-on so that when the user clicks on an email message a window containing a list of related email messages will automatically be displayed. This will give the user context for the current message and free the user from having to search through folders for any related messages.
To implement the engine for
this idea, use the vector space document model to represent each email message.
Each term in the email should be weighted using a tf-idf value. You
can then compare a selected email to every other email using one of the
similarity metrics for weighted terms (take your pick between inner product,
dice, cosine, or jaccard. Inner product is definitely the
easiest).
Here are the steps your
program should take:
Here are the 6 sample messages to use. 1-3 are from the Green and Gold and 4-6 are about academic conferences.
1.txt
2.txt
3.txt
4.txt
5.txt
6.txtHere is the list of stop words: stoplist.txt
This program will be easiest to implement in a language that has built-in or easy support for hash tables and dynamic data structures.
2) (20 pts) Write a program in any language to implement your choice of one of the following and briefly discuss the results. If you want extra credit (up to 20 more points) you could do both problems.
a) Genetic algorithm that learns the boolean even-parity problem. In this problem you are given a boolean value composed of k bits. If there is an even number of one's then the parity is 1. If there is an odd number of one's then the parity is 0. Your program should learn the parity given the bit pattern.
Use k = 8 and all 256 cases for training. It is up to you to design the genome representation, crossover, and mutation operations, population size, and fitness metric. Most likely you will want to use some fitness metric such as % correct on the training data, or perhaps the LaPlace heuristic.
Your program should come close to learning correctly for all training data (and ideally would learn all of them).
b) Kohonen network to classify characters. Use the following input data as training: train.txt
This file contains a LED-style representation of 10 digits, from 0-9. The first line in the file is the number of digits, followed by the actual digit, and then a 4x5 bitmap representation of the digit. Your program should have twenty input nodes, where each node corresponds to one of the bits in a digit.
Use a 2D Kohonen map that is 10x10 with randomly initialized weights set to numbers between 0-1. The program should make 1000 iterations through the training data. For each iteration find the winning node with weights closest to the input pattern, adjust its weights to be closer to the input, and then also adjust the weights in a neighborhood of approximately 2 cells around the winner. You don't need to implement the sombrero or gaussion function, just an approximate radius of 2 cells is fine.
After the network has been trained, label the winning nodes with the category (i.e. the digit) for each of the training patterns. Then allow the user to input a pattern from the keyboard and output the predicted class. It should predict the class by finding the labelled node that is closest to winning node given the input pattern (since the actual winning node based on the input might not be labelled at all when only 10 out of the 10x10 nodes are labelled).