Stratasphere: Big Data

Showing posts with label Big Data. Show all posts

Thursday, November 10, 2011

Automating Greed and Fear

http://blog.recursivity.com/post/12321213411/the-big-picture-true-machine-intelligence-predictive

https://www.greedandfearindex.com/tour/?_escaped_fragment_=/#!/

This is a really interesting project that relates to some work I am doing at the moment. I want to know more about it. Revisit this later.

http://www.forbes.com/sites/danwoods/2011/11/06/can-mapreduce-be-made-easy/

This is a good article on the commercialization of MapReduce and the integration with business. Kinda sounds like the same story as BI... just five years later. I paraphrase..."you can know stuff if you have a big enough database and this kinda smart layer of software that you can just use..." avoiding the issue of giving technically illiterate people very powerful tools with nuanced semantics....

Its the dream of every decision maker... to have infinite knowledge to procrastinate about. But, as with most things, the devil is in the detail.... when you deal with millions of details all at once, it makes a damn big devil.

Tuesday, November 8, 2011

Big Data generates Opportunity

http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars

Leveraging big data to generate innovation. This kind of emergent opportunity is fascinating. Lots of thoughts to be had here. Need to read again.

Wednesday, September 28, 2011

Big data patterns

http://www.iheavy.com/2011/08/26/5-things-are-toxic-to-scalability/

Some good thoughts on large scale data driven apps here.

Thursday, July 21, 2011

Big data case studies

http://gigaom.com/cloud/5-real-world-uses-of-big-data/

Some case studies using big data.

Tuesday, July 5, 2011

Signal to noise ratio on Linkedin

Made the mistake of joining a few groups on Linkedin. The topics looked interesting but the content has turned out to be drivel. Reminds me why I really hate social networks. They are just awash with noise. I guess this in itself is a kind of signal; at a meta level; but....

I think thats one of the mistakes when looking at a massive noisy data set is to find patterns much easier. My sanity checker is simply look at the magnitude. If you find 5000 people talking about a brand name, do you care? Do you spend time digging into it to understand why? Certainly if your a consultant you try to make something out of it... but as a researcher you simply calcuate the magnitude of the signal in the sample and see if you have something thats statistically unusual. I have the sneaking suspicion that quite a few consultants would be finding another job if their clients understood this kind of simple acid test.

Anyway, my video render job has just about finished, so time to go scrape some more malware of a computer and debug a software job. Just another day at the office.

Tuesday, May 31, 2011

NodeXL Network Graph Tool

http://nodexl.codeplex.com/

I have been using this recently and its lots of fun.

Thursday, May 19, 2011

http://chronicle.com/article/Dumped-On-by-Data-Scientists/126324/

Article on problems with large data sets in research.

Flaws in data anonomyzing

http://radar.oreilly.com/2011/05/anonymize-data-limits.html

Article on some rules for dealing with public data sets and the problems with the idea of "anonymous" in data sets.

Wednesday, April 27, 2011

Crowdsourcing is Evil

I recently had a look at a site/app called Kaggle. (www.kaggle.com.au) and have come to the conclusion that crowd sourcing can be used for evil.

This is not specifically an exercise in Kaggle bashing... I have also formed an intense dislike of a number of other crowd source systems.
My fundamental dislike is that it takes a few hundred highly trained people and all their time and rewards a single person/team a few hundred bucks while on the other side of the equation, providing the "sponsor" with a solution that can be leveraged for significant gain.

This simply turns the value of all that labor and knowledge into virtually a worthless commodity. Each person had to pay dearly in time and money for the education to even get into the competition, they then had to spend the time, labor and equipment resources to actually compete and then (mostly) did not get any material reward. To add insult to injury, the sponsor gets to keep the solution all for themselves. No public good what so ever.

The worst problem is that this kind of system provides a fairly deterministic result for the "sponsor". In that the vast majority of the challenges result in a solution (where the data set actually has a useful solution) this means that this is a viable and insanely cheap way to do R&D rather than employ a bunch of knowledge workers and finance all the failures, you only need to pay a pittance for the best solution from a buffet of options. SCORE!!!!.

Suddenly using something like Kaggle is soooooo much more compelling than actually employing all those highly trained but expensive knowledge workers that the western economies have been hoping like hell would start to pay the bills after their tax base all went to the developing world. DUH!

So.... not only does it suck to be a knowledge worker in a country that has no R&D sector... but it sucks to be a knowledge worker anywhere. You now have to "compete" in a game to try to "win" a few cents on the dollar for your labor.... lulz.

There is a shiny inner core to this cloud of choking gunk... the problems that are being solved and the solutions have two nasty stings in the tail for a company hoping to turn them into a product... the first is implementation and the second is .... maintenance.

So while a company can skip the majority of the R&D cost and time required to get a solution, they still need people to product-ise it and keep the system running. At some point they will have to hire someone somewhere who cares enough about the product to actually do some work.
If they out-source everything their product will only look good on paper. There will be no support, no in-house knowledge and no way to evolve. The question is what sort of clock-speed this realisation will have. If the crowd sourcing model destroys the knowledge economy before the downside of crowd sourcing catches up with the head of the cycle, will there be any knowledge economy left?

Thursday, February 24, 2011

Dumped on by Data

http://chronicle.com/article/Dumped-On-by-Data-Scientists/126324/

Another article about the impact of IT and big data on researchers and their ability to function.
This implicitly identifies a lot of associated issues with researchers that is of interest from an IT perspective.

Friday, June 4, 2010

Netflix prize paper and iTunes Genius

http://www2.research.att.com/~volinsky/netflix/
http://www.technologyreview.com/blog/guest/25267/

The netflix recommendation engine is an interesting problem. I ran across a mention of this in an article on iTunes Genius.

The iTunes genius system is a simply leveraging a massive data set to appear cleaver. I respect that its taken a lot of work to get it to work but the essential strategy is not particularly special. Its just the effect of the massive data set that allows it to be viable. Its the same as any system that has a huge "memory" and can effectively leverage it to improve its performance.

The netflix problem is similar but its more of an optimization problem. They are still doing the same thing as any recommendation engine in that they are trying to match a product with a consumer.

It would be interesting to try to look at the properties of the product vs the properties that the consumer thought they were looking for vs the properties of previous products that the consumer had consumed and their rating of that product.

This is all based on a classification problem as well. How subjective/objective are the properties that are being discussed?

There is another difference. The magnitude of the experience. A music track ( iTunes problem ) is a couple of minutes of your life; while a movie may be a couple of hours. If you don't like a song, its a fairly small cost to discard it or not even discard it. But a movie that you don't like has a large cost and you will probably avoid it completely in the future, so it generates a much stronger response.

The experiences is also different. Over the course of a two hour movie, the watcher may go through a range of experiences ( especially with a good narrative arc. ) So they may try to report a much more varied response when asked if they liked the movie or not. If you look at some of the film review forums there is a lot of aspects that get discussed. While music tracks are much quicker and get a much simpler discussion ( like or not like ). Anyway, these are just data points at the end of the day.

In summary, the iTunes problem is a simple recommendation engine with fairly simple data points and a large set of sample training data. The netflix problem is two fold, the first is getting a good recommendation engine and the second is getting it to present a result in a reasonable time. The second part is just an optimization problem.

The recommendation engines have two input problems. The first is classification of the properties of the product being recommended. The second is getting useful data from a consumer about what they might like. Its then just a matter of finding all the possible matches and ranking them using some ranking scheme.

Fair enough this is a problem with real scale issues but it can be simplified by splitting the search space in a couple of ways and doing some pre-computing.

The fact that people are so predictable means that you can probably pre-computer a great deal of this and build a set of "stereotype" user profiles and keep them up to date then build an individual profile for each actual user as a function of the nearest "stereotype" with a customized set of deltas to represent their divergence from the stereotype.

It would probably be easy enough at scale to build a hierarchy of stereotypes and move the actual user between more or less specialized stereotypes as their taste changes. Then it simply becomes a matter of searching through the stereotypes for the nearest match rather than doing a comparison of that actual user with each and every film in existence.
All you would need to do is to update the stereotypes as each new film is added to the database. Even if there were a few thousand stereotypes, it would still be nice and cheap to keep it all up to date. Sort of an intermediate processing strategy.

The number of stereotypes would probably be something like the number of permutations of combination of the properties of the product minus the silly and unpopular. The list could probably be simplifying even further by collapsing similar stereotypes for the less popular and increasingly specializing those that are popular. This could then be managed with an evolutionary strategy.

Once the problem starts to be described in terms of entities its possible to play all sorts of social and population games with them.

Thursday, June 3, 2010

Thought exercise on applying Neural Nets used to sort galaxy images

http://www.space.com/businesstechnology/computer-learns-galaxies-100601.html

Article on using a neural net to sort galaxies. Good application of known technology but that's not the point I'm interested in. My interest is how the tool is applied to "help" a human function more effectively.

Imagine the scenario if you can, a human slaving away over a pile of images of galaxies and sorting them into the relevant type piles. No problem except for boredom and scaling. The human can sort them into all the type piles, plus a "weird" pile and maybe a "problem" pile for the ones they are unsure about. Later on have another look at the weird and problem piles, maybe with some friends to help. Finally get them all sorted and start again. Keep in mind that the flow of images never stops.

Now get a computer to do it. Easy enough, but slightly semantically different. Sort all the easy ones, put all the "maybe" ones into a third pile, the "problem" ones into another pile and finally the "weird" ones into another. Pass the weird and problem ones to the humans and have them spend some quality time sorting them out.

The beauty with a neural net is that you can now feed the weird and problem items back in ( with their new classification applied by the human think tank ) as training data and improve the performance of the neural net. This process can occur every time the system finds weird and problem data.
I remember reading someones idea about exceptions in software as being "an opportunity for more processing". If you think of the whole system ( neural net + data + humans ) as a single system then each edge case becomes the opportunity to improve the system.

All in all its a pretty boring job, classifying galaxies based on an image ( I assume there is a lot more to it, so work with my line of through rather than the actuality) but the one thing the job does have is a huge, rich data stream and a fairly straight forward classification problem.

So the question arises, could the computer do a job beyond the capacity of the human classifiers? The whole idea of applying a classification structure to a set of data points is to simplify and apply a human scale structure to the data for some purpose. But what if the software was used instead just to add meta data to the images in much finer granularity than the simple classification scheme used by humans. (This could then be a simplification of the meta data if humans wanted to search for a set of images at some later point)

Taken to its logical conclusion however, this would generate a set of data that was as complex as the original data stream and provided no additional value. (Interesting that "additional value" in this case equates to "simplified") So perhaps this is not actually a classification problem, rather its a search problem. In that the data already exists in the original image/data stream (different wave length images, xray, radio etc of the galaxy) so rather that trying to use the software to add metadata to each image to simplify any future searches, it would be better to have a faster search engine that could look at all the original images in the database and return a set that matched the search parameters without having the additional layer of metadata.

Keep in mind that the meta data is going to be only as accurate as the system (human or NN) that applied it in the first place. All neural nets have some "certainty" or confidence function that essentially means "I am this sure that this image should go in that pile". The implicit inverse of this statement is that the neural net is also "this" sure that the image should NOT go in each of the other possible piles.
And the if the neural net is always being retrained, then it may improve over time and change its ideas about which pile earlier images should have gone into. So the meta data may change and evolve.

The other thing is that the meta data scheme may change. Obviously with computers it just a matter of re-classifying all the ealier work. This is just a factor of applying computing power to the problem. This may or may not be physically or economically viable but is theoretically the solution.

Which gets me back to my earlier point about not bothering with a metadata scheme and just building a database of images and building a better search engine that can work from the raw data rather than from some pre-constructed but potentially flawed index of meta data that may or may not have evolved.

Conceptually neat but may be impractical in reality. This then leads into an argument about how to "optimise" the solution so it becomes practical. Which probably leads back to doing some sort of pre-sort, which then leads to a finer grained sort, which then leads to applying metadata to help the sort, which then leads back to the original point of building a neural net to apply meta data so a big dumb search engine can build and index and return a result in a reasonable amount of time. Circle complete.

We get to a point where its a game of pick-your-compromise. The three corners of this equation are search time, completeness of search, correctness of search.

And the same optimization strategies keep recurring, more computing power, per-processing, constant improvement, partial results, imperfect results etc.

As I said, pick your compromise.

Perhaps, rather than seeing the meta-data as a subset or simplification of the data within the image for search and indexing purposes (and the context it was captured. Time, date, source device blah blah) use the pre-processing to value add to the data. Add data that helps to shape future analysis rather than categorisation.
Look for interesting features and make predictions based on current state of the art knowledge but also do some enrichment of the data set by integrating it with other data sets and make notes on any gaps in the data or aspects that need to be re-examined from a better angle. Aim for completeness.

This becomes another game of impractical activity but is fun none the less.

Imagine being able to split the data on a star into layers and drill down into the spectral frequencies of a particular star, and then find that there is some frequency that has been incompletely documented and have the system automatically schedule some telescope time to re-capture that in a future pass but also learn to capture that aspect for all future images because some researchers are interested in that aspect.

So the system could evolve in response to use. Which raises the issue of data that can be generated from the data set. Do we store that for future re-use or is it more efficient (and less flawed) to discard it and re-generate it when its next needed (based on the assumption that the tool used to re-generate it later will potentially be better and include less flaws and errors). This then becomes merely a factor of available computing power at any point in time. And with the cloud, we can start to do some really big data crunching without the previous compromises. It then becomes a factor of how cleaver the tool and the tool creators are. (parallelism + marshaling + visualization = Data Geek Bliss )

I would be very interested in the size of the neural net they used and some of the other factors, such as number of classification classes and all the other fun details but the study seems to be both unnamed and the only identified source may or may not be involved. (His page shows some similar work)

An issue with all this waffling is the actual quality of the data in the data stream. Its far from being "perfect" as its astronomical images in various wavelengths reaching us across space take with very good but imperfect devices and then sampled into some digital format with additional limitations, artifacts and assumptions. So to build a perfect system based on imperfect data is possibly another case of me over-engineering something.

Such is life.

Monday, April 26, 2010

Archive from 5_6_07 - Knowledge harvesting

Just had a thought about harvesting knowledge in a public space.
Articles on web sites are written and posted by the author. In general there is no structure or expectation for feedback or update. They are "published". The only knowledge captured is that of the author. A hierarchy of one.
Compare that with a site like codeproject (www.codeproject.com)where an article( specifically about some programming topic ) is written and posted by the author but has a threaded forum attached to the bottom where comments and discussion can be collected and retained. The author remains the top level arbiter of the discussion with ( I assume ) edit rights to both the article and the discussion. The discussion participants retain edit rights to their contribution. A two level hierarchy.
Look now at something like slashdot. (www.slashdot.org) Here the author of the post contributes it then its chewed upon by the community. The signal to noise ratio is so low in the following discussion that the author has little hope of managing or replying to the discussion posts. The site employs a voting system to attempt to help the signal to be elevated above the noise. The author looses edit rights to the article; which are taken over by one of the site editors. Discussion posts are editable by their contributors and there is influence by moderation and voting. This creates a complected three level hierarchy with some complicated and non-deterministic rules for predicting what the final knowledge collection might look like. This in itself is an interesting example of a system to harvest knowledge out of an essentially chaotic community. The beauty of the system is that it works. The drawback is that the voting system can be 'gamed' and will corrupt the process if it happens on a large scale.
One of the key elements is that the dynamic updates promote the signal upward while 'modding' down the rubbish. It is not discarded but is visually diminished and tagged so it can be filtered out if desired. However at any time later, this rubbish can still be examined by researchers.
Compare this to a wiki system. Again it uses an article as the cornerstone of the knowledge accumulation. This article then is modified directly by random contributors. They can add, edit or delete information. Most wiki's seem to have the capacity for a simple comments system which is attached to the page. This is un-threaded and in my experience relatively unused. However it does make for an ideal place to ask questions and make requests. The problem being that without a responsible maintainer, there is no one to service those requests reliably. Its the responsibility of the 'community' to chaotically make improvements.

Wiki systems have an interesting 'rollback' feature that allows changes to be 'undone'. This mitigates the damage of authors work being deleted maliciously. However as the author and the subsequent contributors are effectively peers, who is to say that the original author's contribution is better than the deletion made by the peer.
Wiki systems spawn pages. These are hyper linked to older pages and the resulting knowledge base becomes a standard hyper linked mess. There is no way to automatically assess quality or topic or relevance to any particular subject different from a normal keyword search or page rank system.
Now look at discussion forums. Commonly called 'forums'. These usually have a threaded and date sorted post structure. A number of separate 'forums' are often collected under a major subject heading or site. These sub-forums are usually just arbitrary separations of the topic area or activity of the 'community' that they are servicing. The membership of the community is often fairly chaotic but share some commonality. Either a topical interest or some common property.

A community member can make a 'post' at the root level of the forum. This post can then be commented upon by other forum members who add to the thread. The original author has edit control over the post and any subsequent comments they might make in the resulting thread. These threads can branch into sub thread; making for a hierarchy of comment and response.

The root of the forum has three classes of post. There are 'recent' posts that are sorted to the top of the list by date. 'older' posts that have moved down the list simply due to the date they were posted and what are called 'stickies'. These 'stickies' are often posts addressing common topics or rules of conduct for the forum. They are fixed at the top of the forum and remain there irrespective of the date sorting.

The only other meta data that can be automatically derived from a post-thread object is its activity, number of posts, frequency of posts, statistics on who has posted, size of the posts and the number of thread branches. Otherwise the knowledge encoded in the structure will slowly drift away from the root of the forum.
Should write a paper on this at some point....

Stratasphere