Monday, April 26, 2010

Archive from 5_6_07 - Knowledge harvesting

Just had a thought about harvesting knowledge in a public space.
Articles on web sites are written and posted by the author. In general there is no structure or expectation for feedback or update. They are "published". The only knowledge captured is that of the author. A hierarchy of one.
Compare that with a site like codeproject (www.codeproject.com)where an article( specifically about some programming topic ) is written and posted by the author but has a threaded forum attached to the bottom where comments and discussion can be collected and retained. The author remains the top level arbiter of the discussion with ( I assume ) edit rights to both the article and the discussion. The discussion participants retain edit rights to their contribution. A two level hierarchy.
Look now at something like slashdot. (www.slashdot.org) Here the author of the post contributes it then its chewed upon by the community. The signal to noise ratio is so low in the following discussion that the author has little hope of managing or replying to the discussion posts. The site employs a voting system to attempt to help the signal to be elevated above the noise. The author looses edit rights to the article; which are taken over by one of the site editors. Discussion posts are editable by their contributors and there is influence by moderation and voting. This creates a complected three level hierarchy with some complicated and non-deterministic rules for predicting what the final knowledge collection might look like. This in itself is an interesting example of a system to harvest knowledge out of an essentially chaotic community. The beauty of the system is that it works. The drawback is that the voting system can be 'gamed' and will corrupt the process if it happens on a large scale.
One of the key elements is that the dynamic updates promote the signal upward while 'modding' down the rubbish. It is not discarded but is visually diminished and tagged so it can be filtered out if desired. However at any time later, this rubbish can still be examined by researchers.
Compare this to a wiki system. Again it uses an article as the cornerstone of the knowledge accumulation. This article then is modified directly by random contributors. They can add, edit or delete information. Most wiki's seem to have the capacity for a simple comments system which is attached to the page. This is un-threaded and in my experience relatively unused. However it does make for an ideal place to ask questions and make requests. The problem being that without a responsible maintainer, there is no one to service those requests reliably. Its the responsibility of the 'community' to chaotically make improvements.

Wiki systems have an interesting 'rollback' feature that allows changes to be 'undone'. This mitigates the damage of authors work being deleted maliciously. However as the author and the subsequent contributors are effectively peers, who is to say that the original author's contribution is better than the deletion made by the peer.
Wiki systems spawn pages. These are hyper linked to older pages and the resulting knowledge base becomes a standard hyper linked mess. There is no way to automatically assess quality or topic or relevance to any particular subject different from a normal keyword search or page rank system.
Now look at discussion forums. Commonly called 'forums'. These usually have a threaded and date sorted post structure. A number of separate 'forums' are often collected under a major subject heading or site. These sub-forums are usually just arbitrary separations of the topic area or activity of the 'community' that they are servicing. The membership of the community is often fairly chaotic but share some commonality. Either a topical interest or some common property.

A community member can make a 'post' at the root level of the forum. This post can then be commented upon by other forum members who add to the thread. The original author has edit control over the post and any subsequent comments they might make in the resulting thread. These threads can branch into sub thread; making for a hierarchy of comment and response.

The root of the forum has three classes of post. There are 'recent' posts that are sorted to the top of the list by date. 'older' posts that have moved down the list simply due to the date they were posted and what are called 'stickies'. These 'stickies' are often posts addressing common topics or rules of conduct for the forum. They are fixed at the top of the forum and remain there irrespective of the date sorting.

The only other meta data that can be automatically derived from a post-thread object is its activity, number of posts, frequency of posts, statistics on who has posted, size of the posts and the number of thread branches. Otherwise the knowledge encoded in the structure will slowly drift away from the root of the forum.
Should write a paper on this at some point....

No comments:

Post a Comment