Stratasphere: June 2010

Friday, June 4, 2010

Netflix prize paper and iTunes Genius

http://www2.research.att.com/~volinsky/netflix/
http://www.technologyreview.com/blog/guest/25267/

The netflix recommendation engine is an interesting problem. I ran across a mention of this in an article on iTunes Genius.

The iTunes genius system is a simply leveraging a massive data set to appear cleaver. I respect that its taken a lot of work to get it to work but the essential strategy is not particularly special. Its just the effect of the massive data set that allows it to be viable. Its the same as any system that has a huge "memory" and can effectively leverage it to improve its performance.

The netflix problem is similar but its more of an optimization problem. They are still doing the same thing as any recommendation engine in that they are trying to match a product with a consumer.

It would be interesting to try to look at the properties of the product vs the properties that the consumer thought they were looking for vs the properties of previous products that the consumer had consumed and their rating of that product.

This is all based on a classification problem as well. How subjective/objective are the properties that are being discussed?

There is another difference. The magnitude of the experience. A music track ( iTunes problem ) is a couple of minutes of your life; while a movie may be a couple of hours. If you don't like a song, its a fairly small cost to discard it or not even discard it. But a movie that you don't like has a large cost and you will probably avoid it completely in the future, so it generates a much stronger response.

The experiences is also different. Over the course of a two hour movie, the watcher may go through a range of experiences ( especially with a good narrative arc. ) So they may try to report a much more varied response when asked if they liked the movie or not. If you look at some of the film review forums there is a lot of aspects that get discussed. While music tracks are much quicker and get a much simpler discussion ( like or not like ). Anyway, these are just data points at the end of the day.

In summary, the iTunes problem is a simple recommendation engine with fairly simple data points and a large set of sample training data. The netflix problem is two fold, the first is getting a good recommendation engine and the second is getting it to present a result in a reasonable time. The second part is just an optimization problem.

The recommendation engines have two input problems. The first is classification of the properties of the product being recommended. The second is getting useful data from a consumer about what they might like. Its then just a matter of finding all the possible matches and ranking them using some ranking scheme.

Fair enough this is a problem with real scale issues but it can be simplified by splitting the search space in a couple of ways and doing some pre-computing.

The fact that people are so predictable means that you can probably pre-computer a great deal of this and build a set of "stereotype" user profiles and keep them up to date then build an individual profile for each actual user as a function of the nearest "stereotype" with a customized set of deltas to represent their divergence from the stereotype.

It would probably be easy enough at scale to build a hierarchy of stereotypes and move the actual user between more or less specialized stereotypes as their taste changes. Then it simply becomes a matter of searching through the stereotypes for the nearest match rather than doing a comparison of that actual user with each and every film in existence.
All you would need to do is to update the stereotypes as each new film is added to the database. Even if there were a few thousand stereotypes, it would still be nice and cheap to keep it all up to date. Sort of an intermediate processing strategy.

The number of stereotypes would probably be something like the number of permutations of combination of the properties of the product minus the silly and unpopular. The list could probably be simplifying even further by collapsing similar stereotypes for the less popular and increasingly specializing those that are popular. This could then be managed with an evolutionary strategy.

Once the problem starts to be described in terms of entities its possible to play all sorts of social and population games with them.

Thursday, June 3, 2010

Thought exercise on applying Neural Nets used to sort galaxy images

http://www.space.com/businesstechnology/computer-learns-galaxies-100601.html

Article on using a neural net to sort galaxies. Good application of known technology but that's not the point I'm interested in. My interest is how the tool is applied to "help" a human function more effectively.

Imagine the scenario if you can, a human slaving away over a pile of images of galaxies and sorting them into the relevant type piles. No problem except for boredom and scaling. The human can sort them into all the type piles, plus a "weird" pile and maybe a "problem" pile for the ones they are unsure about. Later on have another look at the weird and problem piles, maybe with some friends to help. Finally get them all sorted and start again. Keep in mind that the flow of images never stops.

Now get a computer to do it. Easy enough, but slightly semantically different. Sort all the easy ones, put all the "maybe" ones into a third pile, the "problem" ones into another pile and finally the "weird" ones into another. Pass the weird and problem ones to the humans and have them spend some quality time sorting them out.

The beauty with a neural net is that you can now feed the weird and problem items back in ( with their new classification applied by the human think tank ) as training data and improve the performance of the neural net. This process can occur every time the system finds weird and problem data.
I remember reading someones idea about exceptions in software as being "an opportunity for more processing". If you think of the whole system ( neural net + data + humans ) as a single system then each edge case becomes the opportunity to improve the system.

All in all its a pretty boring job, classifying galaxies based on an image ( I assume there is a lot more to it, so work with my line of through rather than the actuality) but the one thing the job does have is a huge, rich data stream and a fairly straight forward classification problem.

So the question arises, could the computer do a job beyond the capacity of the human classifiers? The whole idea of applying a classification structure to a set of data points is to simplify and apply a human scale structure to the data for some purpose. But what if the software was used instead just to add meta data to the images in much finer granularity than the simple classification scheme used by humans. (This could then be a simplification of the meta data if humans wanted to search for a set of images at some later point)

Taken to its logical conclusion however, this would generate a set of data that was as complex as the original data stream and provided no additional value. (Interesting that "additional value" in this case equates to "simplified") So perhaps this is not actually a classification problem, rather its a search problem. In that the data already exists in the original image/data stream (different wave length images, xray, radio etc of the galaxy) so rather that trying to use the software to add metadata to each image to simplify any future searches, it would be better to have a faster search engine that could look at all the original images in the database and return a set that matched the search parameters without having the additional layer of metadata.

Keep in mind that the meta data is going to be only as accurate as the system (human or NN) that applied it in the first place. All neural nets have some "certainty" or confidence function that essentially means "I am this sure that this image should go in that pile". The implicit inverse of this statement is that the neural net is also "this" sure that the image should NOT go in each of the other possible piles.
And the if the neural net is always being retrained, then it may improve over time and change its ideas about which pile earlier images should have gone into. So the meta data may change and evolve.

The other thing is that the meta data scheme may change. Obviously with computers it just a matter of re-classifying all the ealier work. This is just a factor of applying computing power to the problem. This may or may not be physically or economically viable but is theoretically the solution.

Which gets me back to my earlier point about not bothering with a metadata scheme and just building a database of images and building a better search engine that can work from the raw data rather than from some pre-constructed but potentially flawed index of meta data that may or may not have evolved.

Conceptually neat but may be impractical in reality. This then leads into an argument about how to "optimise" the solution so it becomes practical. Which probably leads back to doing some sort of pre-sort, which then leads to a finer grained sort, which then leads to applying metadata to help the sort, which then leads back to the original point of building a neural net to apply meta data so a big dumb search engine can build and index and return a result in a reasonable amount of time. Circle complete.

We get to a point where its a game of pick-your-compromise. The three corners of this equation are search time, completeness of search, correctness of search.

And the same optimization strategies keep recurring, more computing power, per-processing, constant improvement, partial results, imperfect results etc.

As I said, pick your compromise.

Perhaps, rather than seeing the meta-data as a subset or simplification of the data within the image for search and indexing purposes (and the context it was captured. Time, date, source device blah blah) use the pre-processing to value add to the data. Add data that helps to shape future analysis rather than categorisation.
Look for interesting features and make predictions based on current state of the art knowledge but also do some enrichment of the data set by integrating it with other data sets and make notes on any gaps in the data or aspects that need to be re-examined from a better angle. Aim for completeness.

This becomes another game of impractical activity but is fun none the less.

Imagine being able to split the data on a star into layers and drill down into the spectral frequencies of a particular star, and then find that there is some frequency that has been incompletely documented and have the system automatically schedule some telescope time to re-capture that in a future pass but also learn to capture that aspect for all future images because some researchers are interested in that aspect.

So the system could evolve in response to use. Which raises the issue of data that can be generated from the data set. Do we store that for future re-use or is it more efficient (and less flawed) to discard it and re-generate it when its next needed (based on the assumption that the tool used to re-generate it later will potentially be better and include less flaws and errors). This then becomes merely a factor of available computing power at any point in time. And with the cloud, we can start to do some really big data crunching without the previous compromises. It then becomes a factor of how cleaver the tool and the tool creators are. (parallelism + marshaling + visualization = Data Geek Bliss )

I would be very interested in the size of the neural net they used and some of the other factors, such as number of classification classes and all the other fun details but the study seems to be both unnamed and the only identified source may or may not be involved. (His page shows some similar work)

An issue with all this waffling is the actual quality of the data in the data stream. Its far from being "perfect" as its astronomical images in various wavelengths reaching us across space take with very good but imperfect devices and then sampled into some digital format with additional limitations, artifacts and assumptions. So to build a perfect system based on imperfect data is possibly another case of me over-engineering something.

Such is life.

Wednesday, June 2, 2010

Simplify for sanity

Reduce, simplify, clarify.

This seems to be the theme for my week at the moment. I have been cleaning out and clearing up at home, at work and on the web. Nothing spectacular but its all been lightening the load that I have been dragging around. My task at the moment has been to simplify all my web properties and remove the duplication between them. I am about 50% done so far. Got a couple more sites that need a refresh and some profiles on various other sites that need to be cleansed and I will be up to date.

Probably just in time to do it all again, but its worth doing anyway.