Sunday, April 30, 2006

"Manual" is the new "algorithmic"

A recent conversation with a friend about the whole Web 2.0 madness got him to flatter me into pimping my opinion on the subject to the blog. The title of the post was (of course) inspired by a conversation with the other wife.

In the beginning, there were directories (think Yahoo! and the ODP). These were manually populated by some trusted community of people, who made sure that the links pointed to relevant content. Eventually the amount of content on the web grew way beyond the capability of manual discovery, and fairly complicated algorithms crawled and sifted through the mounds of crud to find the data relevant to most queries (think Google).

Once it became obvious that search was an incredibly powerful driving force for web commerce, it wasn't long before an entire community of black-hat search engine optimizers (SEOs) popped up to manipulate the rankings to their advantage. After all, "There was GOLD in them thar SERPs! (Search Engine Result Pages)". Most search engines of course have groups of people dedicated to making sure the ranking algorithms are wise to their tricks.

Fast-forward to 2003(ish), and the pendulum swings back to manual labor. and Flickr introduce this novel concept: Let users "tag" content (URLs and images respectively) with words representative of the content (like "family", "poodle", "jazz"). This works great. Free labelled data! Naively, one could use this as a direct relevance statement. An object tagged with the term "jazz" must obviously be a valid search result for the query "jazz", right? Quite so, but the real power of this turns up if you can generalize the labelling to unlabeled content on the web. That's exactly what machine learning algorithms do. If Yahoo!'s smart, their boffins are using their acquisitions of and Flickr to do exactly that.

From a naive point of view, it would appear that we're done. We've solved the relevance problem if the users themselves tell us what's relevant. Right?

Not quite.

Keep in mind that the only reason index spam wasn't a problem with algorithmic search from 1998 to about 2002 was because it didn't (yet) drive commerce. Once Yahoo! really does start using that label data, and the black-hats catch on that tagging is being used to influence search results, what's going to stop an SEO from tagging affiliate pages for online casinos with "cooking"? Pretty much nothing. At that point the value of the labelled data is zilch. We'll have to resort to natural language techniques for summarization to automatically generate tags. Guess what? That's back to algorithmic information retrieval again.

So that's my $0.02. We're in a temporarily happy phase where "manual is the new algorithmic" (smile Coe). In a couple of years' time we'll be back to where we started. Enjoy it while it lasts.



saturn air jam said...

Maybe you should append "NOT" to the title of the post... *snigger*
Is that why you haven't tagged this particular post? :p

Zoonie said...

I hadn't tagged it because I wasn't on a machine with my tag-generator bookmarklet. I'm just lazy :))

coeman said...

I want to tag people. If they say stupid stuff they get a "stupid tag" stuff a "smart tag" slashdot Karma. Then I want to make a reader so I know how smart/stupid people are (according to those they've interacted with) before I chat with them. I really just want to apply it to cars but hey a man can dream no?