CS 470: Applied Software Development Project
2.0 Data Source
4.0 Experimental Results
6.0 Future Work
Collaborative filtering has become a popular method for delivering recommendations to individuals on a wide range of items, most typically books, movies, music and news articles. The basic idea behind collaborative filtering is to automate word-of-mouth. We all rely on friends and family to recommend books, movies and music to us; collaborative filtering aims to expand each person’s small network of friends to the greater realm of the Internet. Collaborative filtering works by finding individuals with similar tastes, and making recommendations based on their likes and dislikes. This relies on the idea that if two individuals have similar tastes on a number of items, they are likely to have similar tastes on other items as well. However, there are a number of problems with typical collaborative filtering methods, which can lead to poor recommendations in many situations. This paper presents a method of combining typical collaborative filtering techniques with content-based analysis of the items in order to provide accurate recommendations for a wide range of situations.
Collaborative filtering has developed as a means to create accurate recommender systems. Recommender systems allow individuals to receive recommendations on a wide range of items, including books, movies, music and news articles. Ratings data is collected from a large group of users for a given set of items. These ratings can be collected either explicitly by asking a user to rate an item they are familiar with (on a 1-5 scale, for example), or implicitly by inferring a user’s likes and dislikes based on their behavior. Many online stores, such as Amazon and Barnes & Noble, collect ratings implicitly; when a user buys an item, it is assumed that they like that item. By comparing a user’s likes and dislikes to other users, groups of users with similar tastes are formed, called a neighborhood. To determine the predicted rating of an item for a particular user, the (possibly weighted) average rating of that item is calculated from the user’s neighborhood.
In the past, collaborative filtering systems have been able to produce accurate recommendations for a wide range of items. However, several problems cannot be overcome by a typical collaborative filtering system:
|50 First Dates
Table 1: User profiles of movies rated
Table 1 shows the profiles of two users, with the movies that each has rated on a 1-5 scale. Since the two users have not rated any of the same items, a typical collaborative filtering system will not classify them as neighbors. However, it is clear that both users enjoy Adam Sandler movies. Collaborative filtering alone cannot discover this similarity between the two users, and instead views them as having no similarity.
A content-based approach to collaborative filtering is able to overcome these problems. By analyzing the content information of an item, it is possible to deliver accurate recommendations for items with few ratings, and for users that have only rated a few items. Additionally, it is possible to compare users that have no common ratings. This paper will discuss a method of combining classic collaborative filtering techniques with content analysis of the items in order to deliver more robust and accurate recommendations to users. Finally, this paper will discuss several strategies for improving the performance of the proposed methods.
The effectiveness of the proposed method is demonstrated by a movie recommendation system. The rating data used for this system has been obtained from the EachMovie project . Content information for each movie is obtained from the Internet Movie Database (IMDb) . The following information for each movie is collected into a local database: actors, director, plot description and genre.
The EachMovie project was conducted by the
For the purposes of this study, the ratings are converted to either a positive or a negative rating. A rating between three and five is considered positive, while a rating of one or two is considered negative. When the predicted ratings are generated, they are calculated simply as a positive or a negative rating.
Unlike other systems that use only a subset of this data , the entire dataset is used for testing the effectiveness of this system. The ratings data is randomly split into a training and test set; the training set contains approximately 75 percent of the ratings, while the test set contains the remaining 25 percent.
The content information for each movie is collected from the Internet Movie Database (IMDb). Information for each movie, such as actors, director, plot summary and genre, are gathered and entered into a local database. This data is provided in tab-delimited text files, which are parsed to collect this information for each movie from the EachMovie dataset.
Before a movie’s content information is added to the local database, the frequency of each word is weighted based on the total number of terms in the bag-of-words.
Each user is assigned two feature lists, one for the movies they have rated positively, and another for the movies they have rated negatively. For each movie that a user has rated positively, the feature list of that movie is added to the user’s positive feature list. Similarly, the feature list for each movie a user has rated negatively is added to the user’s negative feature list.
It is possible to overcome many of the limitations of collaborative filtering methods by combining them with content-based analysis. However, overall system performance remains problematic. Three methods are attempted by this study with the aim of addressing this problem. The first two attempt to improve performance by reducing the search space. The third circumvents collaborative filtering methods altogether by making purely content-based predictions.
It is possible to reduce the search space significantly by pre-processing the ratings data in order to cluster users into smaller, more manageable groups. When the predictions are to be made, the system must only analyze a user’s cluster group, instead of the entire data set. Clustering of users (and/or items) has been implemented by several collaborative filtering systems [5,6] as a method of improving performance.
For this study, a relatively simple clustering algorithm was devised:
It is important to note that with this content-based approach, the terms found in a user’s feature list (or the feature list of a cluster) are what is being used to compute the Pearson Correlation Coefficient. The feature list of a cluster is comprehensive of the feature lists of all its users.
Obviously, this is not the most efficient solution possible; the worst-case runtime is O(n2). However, it is not necessary for this algorithm to be run every time a new rating is added; small changes in the rating data should not affect the structure of the clusters greatly. Instead, the algorithm could be run on a nightly basis to reorganize the clusters. Furthermore, the focus of this study is not the development of an efficient clustering algorithm. Thus, the described algorithm is sufficient for the implementation of this system.
Once the cluster groups are formed, the system must only search a user’s cluster group in order to make predictions for that user. From a user’s cluster group, their 10-nearest neighbors (called their neighborhood) are calculated, again using the Pearson Correlation Coefficient. The predicted rating for a particular item is the average rating of that item from the user’s neighborhood.
A simpler method for reducing the search space is to select a random group of users that will serve as a (hopefully) representative sample of the entire set of users. For this study, 5000 users are randomly selected to serve as the sample group. A new random group is generated to make predictions for each user.
This is a much simpler and easier method to implement than was the clustering of users. The questions that remain are how representative a sample of 5000 random users will be, and how accurate of predictions is that group able to produce?
Once a random group of users is selected, the method of prediction is identical to that used with the cluster groups. From a user’s random group, their 10-nearest neighbors are calculated, and the predicted rating for a particular item is the average rating of that item from the user’s neighborhood.
The final method of prediction is the most simple of the three. By completely circumventing the methods of collaborative filtering, predicted ratings are made by simply analyzing the content information of the items a user has rated. Just as we had previously compared feature lists between users, it is also possible to compare the feature list of a user directly to the feature list of an item.
Once the predicted ratings were made by each of the three described methods, those predictions were analyzed to assess the quality of the prediction method.
Accuracy = (TP + TN) / (TN + TP + FN + FP)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
TP: number of true positive predictions
TN: number of true negative predictions
FP: number of false positive predictionsFN: number of false negative predictions
Figure 1: Calculations for accuracy, precision and recall
Figure 2: Comparison of algorithm results
As was expected, the method of clustering users resulted in the greatest performance. Of the three, this method produced the highest levels of accuracy, precision and recall. However, a large amount of pre-processing of the ratings data is required to generate the clusters of users. For this reason, this method may not be appropriate for all applications.
While simpler to implement, the other two methods each produced reasonably high levels of accuracy, precision and recall to make them viable choices. Very little pre-processing is required for either of these methods, and their run-time requirements are much more in line with a real-time system. No matter which of the three methods is most appropriate for a specific application, each of them overcomes the problem of system performance.
This content-based approach to collaborative filtering is successfully able to overcome the difficulties outlined in section 1.0. A user that has rated only a few items is able to receive recommendations based on their feature list. A user must only rate one or two items in order for them to have a usable feature list. Additionally, it is now possible to give recommendations for an item that has only a few ratings. Instead of making predictions based on the ratings that an item has already received, the prediction is made by comparing the feature lists of the user directly to the feature list of the item. Finally, it is possible to compare users that have no common ratings, allowing the system to discover meaningful relationships between users that would not have otherwise been known. This also allows all users to be considered as potential neighbors, thus increasing the chances of finding similar users.
There is still a great deal of work that can be done to improve upon this content-based approach to collaborative filtering. As mentioned in section 3.1, the efficiency of the clustering algorithm used by this study could be greatly improved. There are a number of efficient clustering algorithms that could be used instead, such as k-means or hierarchical clustering. Additionally, a more complex hybrid of collaborative filtering and content-based analysis could be developed to improve the predictions. A method proposed by  produces pseudo ratings by purely content-based means to create a dense ratings matrix. It then uses the combination of actual ratings and pseudo ratings to compare users. This approach eliminates the need for users to have co-rated items to be considered neighbors.
There exist many appropriate applications of recommender systems; this study has demonstrated that a content-based approach to collaborative filtering is a superior method for delivering such recommendations.
 EachMovie Project. http://research.compaq.com/src/eachmovie
 Internet Movie Database (IMDb). http://www.imdb.com
 P. Melville, R.J. Mooney and R. Nagarajan. “Content-boosted collaborative filtering.” In
Proceedings of the SIGIR Workshop on Recommender Systems, 2001.
 M.J. Pazzani. “A framework for collaborative, content-based and demographic filtering.”
Artificial Intelligence Review, 13(5-6):393-408, 1999.
 M. O’Connor and J. Herlocker. “Clustering items for collaborative filtering.” Presented at
the Recommender Systems Workshop at Conference on Research and Development in Information Retrieval, 1999.
 L.H. Ungar and D.P. Foster. “Clustering methods for collaborative filtering.” Retrieved
 “Computing Pearson's correlation coefficient.” http://davidmlane.com/hyperstat/A51911.html
 M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes and M. Sartin. “Combining
content-based and collaborative filters in an online newspaper.” Presented at the ACM
SIGIR Workshop on
 MovieLens Project. http://movielens.umn.edu