Thursday, June 3, 2010

Thought exercise on applying Neural Nets used to sort galaxy images

http://www.space.com/businesstechnology/computer-learns-galaxies-100601.html

Article on using a neural net to sort galaxies. Good application of known technology but that's not the point I'm interested in.  My interest is how the tool is applied to "help" a human function more effectively.

Imagine the scenario if you can, a human slaving away over a pile of images of galaxies and sorting them into the relevant type piles. No problem except for boredom and scaling. The human can sort them into all the type piles, plus a "weird" pile and maybe a "problem" pile for the ones they are unsure about. Later on have another look at the weird and problem piles, maybe with some friends to help. Finally get them all sorted and start again. Keep in mind that the flow of images never stops.

Now get a computer to do it. Easy enough, but slightly semantically different. Sort all the easy ones, put all the "maybe" ones into a third pile, the "problem" ones into another pile and finally the "weird" ones into another.  Pass the weird and problem ones to the humans and have them spend some quality time sorting them out.

The beauty with a neural net is that you can now feed the weird and problem items back in ( with their new classification applied by the human think tank ) as training data and improve the performance of the neural net. This process can occur every time the system finds weird and problem data.
I remember reading someones idea about exceptions in software as being "an opportunity for more processing".  If you think of the whole system ( neural net + data + humans ) as a single system then each edge case becomes the opportunity to improve the system.

All in all its a pretty boring job, classifying galaxies based on an image ( I assume there is a lot more to it, so work with my line of through rather than the actuality) but the one thing the job does have is a huge, rich data stream and a fairly straight forward classification problem.

So the question arises, could the computer do a job beyond the capacity of the human classifiers?  The whole idea of applying a classification structure to a set of data points is to simplify and apply a human scale structure to the data for some purpose. But what if the software was used instead just to add meta data to the images in much finer granularity than the simple classification scheme used by humans. (This could then be a simplification of the meta data if humans wanted to search for a set of images at some later point)

Taken to its logical conclusion however, this would generate a set of data that was as complex as the original data stream and provided no additional value. (Interesting that "additional value" in this case equates to "simplified") So perhaps this is not actually a classification problem, rather its a search problem. In that the data already exists in the original image/data stream (different wave length images, xray, radio etc of the galaxy) so rather that trying to use the software to add metadata to each image to simplify any future searches, it would be better to have a faster search engine that could look at all the original images in the database and return a set that matched the search parameters without having the additional layer of metadata.

Keep in mind that the meta data is going to be only as accurate as the system (human or NN) that applied it in the first place. All neural nets have some "certainty" or confidence function that essentially means "I am this sure that this image should go in that pile".  The implicit inverse of this statement is that the neural net is also "this" sure that the image should NOT go in each of the other possible piles.
And the if the neural net is always being retrained, then it may improve over time and change its ideas about which pile earlier images should have gone into. So the meta data may change and evolve.

The other thing is that the meta data scheme may change.  Obviously with computers it just a matter of re-classifying all the ealier work. This is just a factor of applying computing power to the problem. This may or may not be physically or economically viable but is theoretically the solution.

Which gets me back to my earlier point about not bothering with a metadata scheme and just building a database of images and building a better search engine that can work from the raw data rather than from some pre-constructed but potentially flawed index of meta data that may or may not have evolved.

Conceptually neat but may be impractical in reality.  This then leads into an argument about how to "optimise" the solution so it becomes practical. Which probably leads back to doing some sort of pre-sort, which then leads to a finer grained sort, which then leads to applying metadata to help the sort, which then leads back to the original point of building a neural net to apply meta data so a big dumb search engine can build and index and return a result in a reasonable amount of time. Circle complete.

We get to a point where its a game of pick-your-compromise.  The three corners of this equation  are search time, completeness of search, correctness of search.   


And the same optimization strategies keep recurring, more computing power, per-processing, constant improvement, partial results, imperfect results etc.

As I said, pick your compromise.

Perhaps, rather than seeing the meta-data as a subset or simplification of the data within the image for search and indexing purposes (and the context it was captured. Time, date, source device blah blah) use the pre-processing to value add to the data. Add data that helps to shape future analysis rather than categorisation.
Look for interesting features and make predictions based on current state of the art knowledge but also do some enrichment of the data set by integrating it with other data sets and make notes on any gaps in the data or aspects that need to be re-examined from a better angle.  Aim for completeness.

This becomes another game of impractical activity but is fun none the less.

Imagine being able to split the data on a star into layers and drill down into the spectral frequencies of a particular star, and then find that there is some frequency that has been incompletely documented and have the system automatically schedule some telescope time to re-capture that in a future pass but also learn to capture that aspect for all future images because some researchers are interested in that aspect.

So the system could evolve in response to use. Which raises the issue of data that can be generated from the data set. Do we store that for future re-use or is it more efficient (and less flawed) to discard it and re-generate it when its next needed (based on the assumption that the tool used to re-generate it later will potentially be better and include less flaws and errors).  This then becomes merely a factor of available computing power at any point in time. And with the cloud, we can start to do some really big data crunching without the previous compromises.  It then becomes a factor of how cleaver the tool and the tool creators are. (parallelism + marshaling + visualization = Data Geek Bliss )   

I would be very interested in the size of the neural net they used and some of the other factors, such as number of classification classes and all the other fun details but the study seems to be both unnamed and  the only identified source may or may not be involved. (His page shows some similar work)

An issue with all this waffling is the actual quality of the data in the data stream. Its far from being "perfect" as its astronomical images in various wavelengths reaching us across space take with very good but imperfect devices and then sampled into some digital format with additional limitations, artifacts and assumptions.  So to build a perfect system based on imperfect data is possibly another case of me over-engineering something.

Such is life.

No comments:

Post a Comment