Friday, August 20, 2010

Tools for research students

I keep encountering problems that research students are having that have long been solved in other industries. Nothing new here, its just frustrating to find someone struggling to re-invent the wheel.

My current wish list for research students would be:

Project Management Tool
Basic task and resource tracking, critical path analysis, Gantt charts
Microsoft Project is the simplest and easiest we have accessible. 
Single user is fine. Little or no collaboration needed.

Project file management
Subversion with TortoiseSVN are my favorite combination.
Still a little bit complex to explain and use but its the best I have so far.


The other issues I constantly deal with is research students trying to re-invent the wheel on their project processes.
  1. Formulate a hypothesis
  2. Come up with an experiment to try to destroy that hypothesis
  3. Perform the experiment to collect data 
  4. Evaluate the results of the experiment against the hypothesis
  5. Publish the results, data and ideas generated

How hard is that conceptually? I get that it takes some repetition to understand and appreciate the subtly of the scientific method, but these are research students. They are supposed to have seen this idea in print at least once.

http://en.wikipedia.org/wiki/Scientific_method

I keep having conversations with students who are doing an experiment to "find" something or "prove" something.....  it bothers me.  All this being said, I remember as a student how weird it seemed the first time I was confronted with the ideas of hypothesis testing. It seems totally ass about. So I forgive without reservation and try my best to explain the ideas again... but it still bothers me.

I have the sneaking suspicion that I might be getting a little out of touch with my own ignorance. I may have been doing the same thing too long. Its all getting a bit familiar and I am starting to imagine that I am seeing patterns.  I think a little bit of fear and uncertainty keeps me grounded. I have the disturbing habit of feeling like I know what I am doing a little too frequently at the moment.

Still there are surprises every day. Its just not the surprises of discovery and success, because I have had all those already, now its just the surprises of violated assumptions and forgotten but important details and meetings.

Moving on. It seems like I didn't have as much to talk about on the subject I started with as I thought. Such is life.

Friday, July 16, 2010

BrainVision Analyzer 2 Workshop on EEG & TMS and EEG & fMRI

Ahhhh, professional development courses.....

This one  was held over three days at the QBI at UQ and hosted by JLM Accutek, the Australian distributors for Brain Products.The lecturer was Dr. Ingmar Gutberlet. 

The course was very intensive. Three days of technical demonstrations and in depth software tutorial sessions. I'm still digesting everything that we covered. I guess it will only really sink in once I get some serious practice time back home.

Being on campus at UQ has also been something quite thought provoking. Its pretty intimidating to go from a relatively tiny regional campus to one of the G8 campuses. Something of a culture shock. I have probably got just as much to think about from the campus experience and the people I've met as the content of the course.

One thing that does need some comment is the quality of the accommodation. I have to say that for the price we paid, I feel we didn't get value for money.

The room takes some figuring out. The weather is freezing at night because of the river and you need to wake up intermittently and turn on the air conditioner which turns out to sound like a small jet engine.  This makes sleeping a bit challenging.
The plumbing is terrible. They're on a water saving kick so someone has gone around and sabotaged the shower with some flow restrictor and a water saving nozzle. The difference between arctic and third degree burns is a very fine line.
And then there's the alarm clock. This consists of a 20 ton excavator tearing up a giant hole right beside the building.  Strangely enough the digging starts at about 7am every day and seems to be done for the day about half an hour later. Perhaps its just my persecution complex....

I have to say that I was surprised by the attendance pattern of some of the other attendees. I get that they're busy and have other calls on their time but it seems like such a waste to sign up and show up for only a couple of sessions. Fully half the attendees were AWOL most of the time.  Makes you wonder what they were getting out of it that was worth the price. 

I think many of the attendees were there to get some practical skills that were applicable to a particular problem they were facing in their work. Perhaps they were just more able to discriminate the sessions that were appropriate for their work.  I was a bit of a kid in a candy shop. Everything was good.

Some of the software was a bit rugged; that's the nature of these kinds of systems, half of its a hack and half of its done but lacking polish.  Usually its just amazing that it works as well as it does.  It's an incredibly complex domain to work with and the market place is both saturated and the customers are non-uniform, so the number of users of most features may be quite low.  Makes for a hard business environment and low margins.

The people here are different. I've never before been surrounded by such a bunch of high achievers.  This is no bad thing as it has provided a real learning experience.  There are so many things I need to work on that are just not getting exercise at Coffs.  I understand some of the more traveled staff a little better now.

I've spent the time harvesting ideas from everything. The workshop, the people, the campus, the software, the uni website. Maybe it was just the scary amount of coffee I've been drinking to try to stay awake and the sense of being away from the usual distractions.  Now I just need the time to write some of it up before it all turns to smoke.

I need to figure out a good time to leave tomorrow to miss the rush hour traffic. It was insane coming up. I managed to hit the rush about 110km south of Brisbane and was in rush hour traffic for more than an hour at freeway speeds. Not really good when the fatigue is at its maximum.

Back to thinking and catching up on all the work that's been piling up....

Thursday, July 1, 2010

Building a Calibration Wand for a Phasespace Motion Capture System from a Walkingstick

This post is documenting an interesting hardware hack.

The background.

A research project has just landed that involves using our Phasespace motion capture system. Since its been idle for some time, I turned it on to check it out and remind myself how all the bits worked.  Obviously it was broken.    

So after replacing a video card in the hub computer and figuring out that the batteries in the LED driver units were dead, I finally got the rig up and talking. Then found that the calibration wand was non-functional.  Goes without saying really... IGOR rule 101 states "Any equipment left in the proximity of students will be TOUCHED, no matter what you say, how well its locked up or how many signs are erected."

The wand is one of those "damage magnets"! Its just too visually attractive.  People are fascinated with it and will ask about it first out of all the equipment. It's just too pretty to live!

Anyway, today's IGOR hack-of-the-day is to build a calibration wand for a Phasespace (http://www.phasespace.com/) motion capture system.

Step 1 Scrounging
Find something to use as the wand shaft.  Search the store rooms and the junk pile in my office.  Nothing... nothing ... nothing. Almost going to go to metal fabrication and scrounge there when I find an old walking stick that was given to someone as a joke.  Perfect. Also it has a bit more chic than a length of plastic pipe or whatever else I might have found.

Step 2 Procurement
Think quick and figure out how to attach a string of Multi-LED's to the stick without destroying the multi-LED's. They have a Velcro backing so all I need is some Velcro and some hot-melt glue.  Time for a "Bunnings Run"(TM)

Shopping list
Hot-melt glue gun and reloads
Some Velcro cable holding tape
More cable ties



Step 3 Assemble the stick 

  Here you can see the walking stick measured up and with pieces of the Velcro tape glued strategically in place. Alternating by 90 degrees around the front of the stick.

Here is a detail of two Velcro pads.

When one of the pads is out of alignment. Rip it off and do it again.


And a final overview of the stick and Velcro assembly.

Step 4 Building the wiring loom
Wire spool
Punch down tool
Punch down connectors

Now measure out the wire. Remember to add a bit of slack between each LED position as they are fiddly to position and you don't want them under any tension. Velcro vs wire will also end one way. Wire wins!
Careful of the punch down tool too. It doesn't really work the way its intended on heavy insulated speaker wire. Mostly it tries to puncture your finger rather than securing the wire.

I use a knife to split the wire strands and then remove some of the insulation to help the punch down connector make a good connection. 


Once you have all the connectors on and facing the right way. Put some hot glue in the back of each one to make sure its not going to come off the loom. Let it cool and pull off the hot glue cobwebs.

Step 5 Assembly


Assemble the stick. Lots of cable ties make it look better.


Add a LED driver. Cable ties make everything good.


Now, plug in and turn the whole system on, put it into calibration mode and you can test your wiring.  Note how only three of the eight LED's work. Debugging time!  Cut off all the cable ties....

Now take it apart again, pull the cable out of the connectors and hot glue, clean the glue off, cut away a little bit more insulation and re-assemble the wiring loom. This time, before you put the glue in each connector, assemble and test using calibration mode again. If the connector still does not work, cut away a little more insulation until you have bare wire and then punch it down into the connector again. When all are working. Glue them up again. 

Note the working LED's this time.

I now have a functional Calibration Wand. All I need to do is change the values in the wand.rb file to match the position of the LED's on this wand and I can get the system calibrated. Get out the ruler and begin measuring...

And that's all folks. Pretty straight forward.

Friday, June 4, 2010

Netflix prize paper and iTunes Genius

http://www2.research.att.com/~volinsky/netflix/
http://www.technologyreview.com/blog/guest/25267/

The netflix recommendation engine is an interesting problem. I ran across a mention of this in an article on iTunes Genius.

The iTunes genius system is a simply leveraging a massive data set to appear cleaver. I respect that its taken a lot of work to get it to work but the essential strategy is not particularly special. Its just the effect of the massive data set that allows it to be viable. Its the same as any system that has a huge "memory" and can effectively leverage it to improve its performance.

The netflix problem is similar but its more of an optimization problem. They are still doing the same thing as any recommendation engine in that they are trying to match a product with a consumer.

It would be interesting to try to look at the properties of the product vs the properties that the consumer thought they were looking for vs the properties of previous products that the consumer had consumed and their rating of that product.

This is all based on a classification problem as well.  How subjective/objective are the properties that are being discussed?

There is another difference. The magnitude of the experience.  A music track ( iTunes problem ) is a couple of minutes of your life; while a movie may be a couple of hours.  If you don't like a song, its a fairly small cost to discard it or not even discard it.  But a movie that you don't like has a large cost and you will probably avoid it completely in the future, so it generates a much stronger response.

The experiences is also different.  Over the course of a two hour movie, the watcher may go through a range of experiences ( especially with a good narrative arc. ) So they may try to report a much more varied response when asked if they liked the movie or not. If you look at some of the film review forums there is a lot of aspects that get discussed.  While music tracks are much quicker and get a much simpler discussion ( like or not like ).  Anyway, these are just data points at the end of the day.

In summary, the iTunes problem is a simple recommendation engine with fairly simple data points and a large set of sample training data.  The netflix problem is two fold, the first is getting a good recommendation engine and the second is getting it to present a result in a reasonable time.  The second part is just an optimization problem.

The recommendation engines have two input problems. The first is classification of the properties of the product being recommended. The second is getting useful data from a consumer about what they might like. Its then just a matter of finding all the possible matches and ranking them using some ranking scheme.

Fair enough this is a problem with real scale issues but it can be simplified by splitting the search space in a couple of ways and doing some pre-computing.

The fact that people are so predictable means that you can probably pre-computer a great deal of this and build a set of "stereotype" user profiles and keep them up to date then build an individual profile for each actual user as a function of the nearest "stereotype" with a customized set of deltas to represent their divergence from the stereotype.

It would probably be easy enough at scale to build  a hierarchy of stereotypes and move the actual user between more or less specialized stereotypes as their taste changes.  Then it simply becomes a matter of searching through the stereotypes for the nearest match rather than doing a comparison of that actual user with each and every film in existence.
All you would need to do is to update the stereotypes as each new film is added to the database.  Even if there were a few thousand stereotypes, it would still be nice and cheap to keep it all up to date. Sort of an intermediate processing strategy.

The number of stereotypes would probably be something like the number of permutations of combination of the properties of the product minus the silly and unpopular.  The list could probably be simplifying even further by collapsing similar stereotypes for the less popular and increasingly specializing those that are popular. This could then be managed with an evolutionary strategy.

Once the problem starts to be described in terms of entities its possible to play all sorts of social and population games with them.

Thursday, June 3, 2010

Thought exercise on applying Neural Nets used to sort galaxy images

http://www.space.com/businesstechnology/computer-learns-galaxies-100601.html

Article on using a neural net to sort galaxies. Good application of known technology but that's not the point I'm interested in.  My interest is how the tool is applied to "help" a human function more effectively.

Imagine the scenario if you can, a human slaving away over a pile of images of galaxies and sorting them into the relevant type piles. No problem except for boredom and scaling. The human can sort them into all the type piles, plus a "weird" pile and maybe a "problem" pile for the ones they are unsure about. Later on have another look at the weird and problem piles, maybe with some friends to help. Finally get them all sorted and start again. Keep in mind that the flow of images never stops.

Now get a computer to do it. Easy enough, but slightly semantically different. Sort all the easy ones, put all the "maybe" ones into a third pile, the "problem" ones into another pile and finally the "weird" ones into another.  Pass the weird and problem ones to the humans and have them spend some quality time sorting them out.

The beauty with a neural net is that you can now feed the weird and problem items back in ( with their new classification applied by the human think tank ) as training data and improve the performance of the neural net. This process can occur every time the system finds weird and problem data.
I remember reading someones idea about exceptions in software as being "an opportunity for more processing".  If you think of the whole system ( neural net + data + humans ) as a single system then each edge case becomes the opportunity to improve the system.

All in all its a pretty boring job, classifying galaxies based on an image ( I assume there is a lot more to it, so work with my line of through rather than the actuality) but the one thing the job does have is a huge, rich data stream and a fairly straight forward classification problem.

So the question arises, could the computer do a job beyond the capacity of the human classifiers?  The whole idea of applying a classification structure to a set of data points is to simplify and apply a human scale structure to the data for some purpose. But what if the software was used instead just to add meta data to the images in much finer granularity than the simple classification scheme used by humans. (This could then be a simplification of the meta data if humans wanted to search for a set of images at some later point)

Taken to its logical conclusion however, this would generate a set of data that was as complex as the original data stream and provided no additional value. (Interesting that "additional value" in this case equates to "simplified") So perhaps this is not actually a classification problem, rather its a search problem. In that the data already exists in the original image/data stream (different wave length images, xray, radio etc of the galaxy) so rather that trying to use the software to add metadata to each image to simplify any future searches, it would be better to have a faster search engine that could look at all the original images in the database and return a set that matched the search parameters without having the additional layer of metadata.

Keep in mind that the meta data is going to be only as accurate as the system (human or NN) that applied it in the first place. All neural nets have some "certainty" or confidence function that essentially means "I am this sure that this image should go in that pile".  The implicit inverse of this statement is that the neural net is also "this" sure that the image should NOT go in each of the other possible piles.
And the if the neural net is always being retrained, then it may improve over time and change its ideas about which pile earlier images should have gone into. So the meta data may change and evolve.

The other thing is that the meta data scheme may change.  Obviously with computers it just a matter of re-classifying all the ealier work. This is just a factor of applying computing power to the problem. This may or may not be physically or economically viable but is theoretically the solution.

Which gets me back to my earlier point about not bothering with a metadata scheme and just building a database of images and building a better search engine that can work from the raw data rather than from some pre-constructed but potentially flawed index of meta data that may or may not have evolved.

Conceptually neat but may be impractical in reality.  This then leads into an argument about how to "optimise" the solution so it becomes practical. Which probably leads back to doing some sort of pre-sort, which then leads to a finer grained sort, which then leads to applying metadata to help the sort, which then leads back to the original point of building a neural net to apply meta data so a big dumb search engine can build and index and return a result in a reasonable amount of time. Circle complete.

We get to a point where its a game of pick-your-compromise.  The three corners of this equation  are search time, completeness of search, correctness of search.   


And the same optimization strategies keep recurring, more computing power, per-processing, constant improvement, partial results, imperfect results etc.

As I said, pick your compromise.

Perhaps, rather than seeing the meta-data as a subset or simplification of the data within the image for search and indexing purposes (and the context it was captured. Time, date, source device blah blah) use the pre-processing to value add to the data. Add data that helps to shape future analysis rather than categorisation.
Look for interesting features and make predictions based on current state of the art knowledge but also do some enrichment of the data set by integrating it with other data sets and make notes on any gaps in the data or aspects that need to be re-examined from a better angle.  Aim for completeness.

This becomes another game of impractical activity but is fun none the less.

Imagine being able to split the data on a star into layers and drill down into the spectral frequencies of a particular star, and then find that there is some frequency that has been incompletely documented and have the system automatically schedule some telescope time to re-capture that in a future pass but also learn to capture that aspect for all future images because some researchers are interested in that aspect.

So the system could evolve in response to use. Which raises the issue of data that can be generated from the data set. Do we store that for future re-use or is it more efficient (and less flawed) to discard it and re-generate it when its next needed (based on the assumption that the tool used to re-generate it later will potentially be better and include less flaws and errors).  This then becomes merely a factor of available computing power at any point in time. And with the cloud, we can start to do some really big data crunching without the previous compromises.  It then becomes a factor of how cleaver the tool and the tool creators are. (parallelism + marshaling + visualization = Data Geek Bliss )   

I would be very interested in the size of the neural net they used and some of the other factors, such as number of classification classes and all the other fun details but the study seems to be both unnamed and  the only identified source may or may not be involved. (His page shows some similar work)

An issue with all this waffling is the actual quality of the data in the data stream. Its far from being "perfect" as its astronomical images in various wavelengths reaching us across space take with very good but imperfect devices and then sampled into some digital format with additional limitations, artifacts and assumptions.  So to build a perfect system based on imperfect data is possibly another case of me over-engineering something.

Such is life.

Wednesday, June 2, 2010

Simplify for sanity

Reduce, simplify, clarify.

This seems to be the theme for my week at the moment. I have been cleaning out and clearing up at home, at work and on the web.  Nothing spectacular but its all been lightening the load that I have been dragging around.  My task at the moment has been to simplify all my web properties and remove the duplication between them. I am about 50% done so far.  Got a couple more sites that need a refresh and some profiles on various other sites that need to be cleansed and I will be up to date.

Probably just in time to do it all again, but its worth doing anyway.

Saturday, May 29, 2010

Search strategies

Ever lost something in your house?  Thought you knew where it was but turns out you didn't?  When you go looking for it, its just not there.  What do you do?

Search nearby? Search in ever widening circles around the spot where it should be? Try to retrace steps? Look in the lost-and-found basket? Ask someone else? Systematically begin searching everywhere? Quarter the house and start a search grid?  Do a sampled search of specific areas? Try to apply probability to where it most likely could be? Employ search agents( not your children... really, it doesn't work.)

There are some interesting strategies for searching for an thing in an unknown environment. There are a few ways to try to optimize the search but they are often dependent on properties of either the thing, the environment or the search tool(s).  Not always generalizable.

As you might have guessed, I have lost something.