Peter and Trish Royal are heading away from San Francisco. Back to their beloved New York. I first met Pete electronically on Apache’s Avalon project nine years ago - where Inversion of Control started. Lots of ahead of its time fun before it ended in tears. Being in the same city is Peter was a privilege, we got to deliver features for various open source projects TDD style and paired of course. For a while, we were in the same building, as ThoughWorks was upstairs from Radar near the Caltrain.

Semantic Categorization or Search?

More recently, Peter used to work for Radar Networks on the Twine platform. As far as I can work out Twine is a Semantic Categorization platform, but in my opinion it lacks its an essential meme to help it catch on. Peter’s role (and he’s too humble to say) was, I think, lead engineer for the platform, not so much the scientist behind the Semantic purpose for Twine, nor the Marketeer trying to work out how/when to make it popular and to whom. Pete’s old boss at Radar, Nova Spivack, was in the news yesterday with an item on Wolfram Alpha. Nova’s article talks tries not to give the game away, but does detail enough for us to know that it is Semantic Search or Semantic Answers to Questions. You would type a question and Woflram Alpha would give you the answer based on some mathematical representation of the question.

!/files/Mosbys_Circus_Movement.jpg! p. Better search results are sorely needed. Take for example “Circus Movements”, or more accurately “What is Circus Movement”as a search term. The particular answer I am looking for is in the excellent Mosby’s Medical dictionary, and is an unfortunate condition causing the sufferer to do things like somersaults spontaneously. It is also something to do with a heart rhythm, or a gait. Here is almost the only reference online that you’ll find for it. Now Peter does not really have this condition. Indeed, I am quite sure he could not somersault if he wanted to, but it is useful to talk about while we consider the semantic search stuff.

Semantic search results today

Any answers machine (AKA Semantic Search) is going to have to disambiguate as it attempts to answer the question. Google, presently does not ( though our ref is in the top ten ), nor does MS live ( “though our page is #1 in their rank”:/files/MS-Live.jpg ). PowerSet does a good job of the disambiguation , but misses that ‘circus movement’ itself was an overloaded term, so could drop or separate matches for ‘circus’ alone. There was a Reddit response to Nova’s article puffing MIT’s Start , though it returns nothing for our question. True Knowledge (mentioned in a comment to Nova’s posting) has not been taught about circus movement yet. Cuil sadly “ranks pages with the words circus and movement in them”:/files/Cuil.jpg (not necessarily together). Cuil has allegedly relevant categories in a second list to the right of the main search results, but in this case it is just nuts . Lastly, Twine ranks articles in its own system with an either circus or movement in them, and like True Knowledge needs some data entry.

PowerSet leading the pack?

So it seems to me that PowerSet has the best foundation. It is doing the categorization automatically I hope, which is far superior to Twine’s human entered categorization. PowerSet’s data is (or was) taken from Wikipedia and Freebase and categorized with deep understanding gleaned from WordNet . It is not nearly enough though, though you can guess that Microsoft are loading a far greater data set into it. Microsoft being PowerSet’s owners now. After at least one too many glasses of wine, I bumped into two of the PowerSet founders at a post-conference party in San Francisco, and blathered on about the need to start over without the WordNet seed; to instead make a learning system that sorts fact from fiction on the regular web. Indeed I went on to say that their issue was one of computational power - that before semantic search takes off (or takes over) a new computer architecture will need to emerge. Specifically run-cold computers that are the size of thumb drives organized more like a brain than a bus. Weld an ARM chips onto a USB flash drive and have a hundred in a shoebox and you get the idea.

The brain analogy is interesting. When you ask your own brain”what is that condition where people do somersaults involuntarily”, you either know it as circus movement or you do not. Actually, you may struggle to remember, and kick yourself as you once knew, but that is a different issue. If you ask your own brain”what is circus movement”you will get one or more of the three answers, always disambiguated. You might cheekily invent a fourth - “a cause in favor on the Circus’, or you might now know. Rarely will you get ‘no idea’ if you once knew it.

Wolfram Alpha

So if Dr Wolfram has been able to map reduce questions and facts to a mathematical hash, then that could be a leap forward. Lets wait and see, and hope it is more cool that Cuil was. Will it ultimately come up with a single answer gleaned from something it learned on the web, or will it link to pages and some how categorize them and rank them - “this is 93% likely to be correct”. We won’t have to wait long it seems.

March 9th, 2009