Saturday, April 19, 2008

The Woolliness of the Wild Wild Web

WWW is the acronym for World Wide Web, but it more often seems to stand for the Wild and Woolly Web.

Call me old-fashioned, but one of things the drives me up the wall about publication on the web in general, and technical expositions in particular, is the lack of both time-stamps and citations. These two things have existed in the scientific media even before formal journal publication. For example, 17th century scientists like Newton and Hooke, wrote missives to each other and it was convention then, as it is today, to commence a letter with the date. That's how we know that Hooke was very close to coming up with the law of gravitation that is now attributed to Newton (also aided by the latter meticulously eliding all reference to Hooke after the first edition of The Principia). Could we know those things today if they had been using the Web? It's not clear. It depends. And that's the problem; lack of consistency and a lack web tools to enforce consistency.

These same issues of time-stamping and citation of related work are very important for doing good performance analysis and capacity planning.

Generally speaking, an isolated piece of text provides little information. That's the problem with Google. It retrieves text based on string matches and page ranking (previous hits), without any context, so you have to weed your way through a lot of irrelevant junk. One also has to be careful because the first few Google items of text could be years apart in time, and that will likely alter your perspective on the significance of those items. Of course, that's assuming the text even has a time-stamp. You need context, and time-stamps and citations are both an integral part of context.
  • Time Stamps
    Email has always applied automatic time-stamping by default. Newsgoups provided both time-stamping and their own internal form of citation (before hyperlinks). Even the social noise generator Twitter applies automatic time-stamping. Google's Blogger (used here) has time-stamping, but it is ordered by the create time, not the time of publication, which can be confusing if a blog item sits in draft mode for a week or more (it happens). Other, lesser blog tools IMHO, may have some permutation of wall-clock time, day, month but not year, for example. HTML has no tag field to encourage either manual or automatic editor-based time-stamping.

  • Citations
    At least Wikipedia offers a manual way for readers to flag articles that do not include any references.

    Even when references are included, hyper-references have a nasty habit of vaporizing over time. A more pressing and still unsolved problem is the fact that hyperlinks are voided if a web article is printed as hard copy. Books and hardcopy are not going away as fast as some people would like to think.
Here's a recent case in point. Agence France Presse (AFP) reported that Swedish researchers had found a flaw in a quantum-crypto protocol. The AFP text carried no citations or links to sources. Although the names of authors and their respective affiliations were provided, there were no hyperlinks to them. The kindest thing one can say is that AFP upheld the old (pre-web) standard for newspaper reporting. But today, because it is such an active area, any reference to quantum crypto or quantum computing is automatically going to get a lot of attention (i.e., hits on web sites), the AFP text was immediately cloned all over the web; almost 800 clones, according to Google. From my perspective, since I know something about this quantum stuff, the AFP article gave the (wrong) impression that the authors or their institutions were just seeking press attention without any substance to their claim. Another science scam. Coincidental reinforcement for my skepticism came on the same day, as NASA responded to an apparently over-zealous German press release (from the previous day) that a Berlin schoolboy Wunderkind had found a flaw in NASA's calculations concerning the asteroid Apophis hitting the earth in 2036. The next day, NASA had to deny that there had even been an error, that they had had any contact with the schoolboy or that they had contacted the European Space Agency correcting the alleged error. What a mess. It's hard enough hacking through the real science without this kind of noise to confuse the issue. Of course, the lack of time-stamping and proper quoting of sources just encourages this kind of scramble for media attention and ultimately, conspires to produce misinformation. Worse yet, if something is repeated enough times, it must be right. Right? The near instantaneous cloning of press releases on the web is naively viewed by most readers as enhancing credibility; a view which is further reinforced due to the way Google (search) and Google News promote rank. Back on the quantum-crypto front, a clone of the AFP text naturally made it to but with one tiny difference. One of the authors (perhaps out of embarrassment), added a comment which simply contained a link to his FAQ (not available from his home page, which I had already checked) and even then, only at the end of that FAQ are there links to the peer-reviewed journal paper in IEEE Transactions on Information Theory. So, this one was legit, afterall. On reflection, I think the authors had originally taken a low profile because the "flaw" hinged off a very technical point and one for which they even proposed a solution. Probably someone else made the contact with AFP and things took off from there. But I had to do a lot of extra, and what should have been unnecessary, work to reach that conclusion.

