The futility of the “The Netflix Prize”

“If you liked this book then you will like these other books”. If you are an Amazon user you probably could not avoid this, presumably, tailored recommendation.

Amazon and other, mostly e-commerce sites, allow users to grade the products they sell, usually on a scale from 1 to 5. From there, they try to forecast what you are most likely to purchase next, based on your and other people’s purchasing patterns.

The next step in the search for the ultimate oracle, is to find an algorythm that will forecast, as accurately as possible, your likes and dislikes.

Netflix, put-up a prize of $1,000,000 for anyone who will devise such an algorithm. In particular, the aim is to improve on the current ” CinematchSM ” software whose ” job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We [Netflix] use those predictions to make personal movie recommendations based on each customer’s unique tastes. And while Cinematch is doing pretty well, it can always be made better.”

This is most likely a futile pursuit as far as this particular endeavour is concerned.

There are at least four fundamental reasons as to why this is so.

1. Relative grading as opposed to absolute.

Look at your own grading behaviour. Let’s say that you think a film deserve a 4 stars. You then realise that it scores an average of 3 stars. Therefore, you will not give a 4 stars but a 5 stars vote, in order to skew the average.

Are you influenced by the existing rating of 3.5 ?

Are you influenced by the existing rating of 3.5 ?

2. Different people grade different features.

Consider a 007 film; let us say that you give it a 3 out of 5 stars. But what exactly is the “3” reflecting? It all depends on who you are.

If you are have been a fan since the 1960’s, your grade might reflect a relative judgement with respect to all 007 films. You rank them in your mind from 1 to 5 (presumably those with Sean Connery).

If, on the other hand, you are interested solely on the creative direction of the film, you might consider this 007 with respect to all other films you saw by this director.

There are countless features people can consider and their importance differs from one person to the next. Worse for forecasters, different characteristics can shift in their importance in the same person. For example, acting might be paramount in a film such as “The Remains of the Day” while less in “The Empire Strikes Back”.

3. One’s mood and feelings.

A final set of problem with trying to forecast grades, is that they are subject the individual’s mood at the time of grading.

Let us take the film “Under the Tuscan Sun for instance. A viewer might feel like the film was worth a 5 right after he watched it. However, this person does not immediately log in the rating. Then, this person has an argument with the significant other. By the time gets rated, it might may be a 1.

4. Opinion makers.

Other important events happen between watching the film and casting a rating. For instance, a chance conversation with your older brother who is an amateur film-critic and whose opinion you respect might turn what started as a 3 towards a 5 or a 1.

Other important factors might include the self-selected nature of the people who rate the film. Was it a good film but is only rated by people who cared about rating at that particular moment? How many potential ratings were not given and might they have had a significant impact?

In a Wired Magazine article, we read that “The benchmark Netflix uses for the contest is called root mean square error, or RMSE. Essentially, this measures the typical amount by which a prediction misses the actual score. When the competition began, Cinematch had an RMSE of 0.9525, which means that its predictions are typically off by about one point from users’ actual ratings. That’s not very impressive on a five-point scale: Cinematch might think you’re likely to rate a movie a 4, but you might rank it a 3 or a 5. To win the million, a team will have to make predictions accurate enough to lower that RMSE to 0.8572.”

There are simply too many relative, time-dependent, idiosyncratic variables to render this effort much more accurate than what it is. In other words, the software will have to know how one feels, what is the scale a person choses for this particular rating, the emotional state of mind and what conversations took place on regarding the film. Perhaps this is too tall of a challenge in the existing state of technology.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s