04 October 2006

01. The Infinite Archive

As I mentioned in my first entry, for the next few months, this blog will be preoccupied with Prof. Bill Turkel’s Digital History course at the University of Western Ontario, London. I will try to provide the links for readings on which I am commenting, but if I slip up, they are easily found by going to the course website, and clicking on the relevant week; the title of each blog will include the week’s number. Blog entries not closely related to the course work will not begin with a number.

Battelle’s postulate: “The Database of Intentions is simply this: The aggregate results of every search ever entered, every result list ever tendered, and every path taken as a result. It lives in many places, but three or four places in particular hold a massive amount of this data (ie MSN, Google, and Yahoo)” is mildly intriguing, but fundamentally problematic on two levels. First, it is entirely unreflective on the concept of intention it is employing. Quite aside from intentionality being a heavy hitter in philosophy (cf., Stanford Encyc of Phil) , it is central to the theory of ‘purposeful systems’, defined as ‘systems capable of intentionality’ (Ackoff, Emery, Trist [1]). But neither intention nor intentionality is a univocal concept. I suggest there are fundamental differences in the nature of intentionality inherent in my doing a playful, time-killing search while waiting for the kettle to boil, and my doing a search for information fundamental to my work, or perhaps to my health; or in an individual doing a search for his/her own purposes, and a corporate entity doing a search for its purposes, etc.

Or consider an article in NY Times (9 August 2006) on how data on searches by 675,000 American AOL subscribers, which AOL posted for ‘research purposes,’ was used to identify at least one individual. For the individual profiled, her searches reveal very different orders of intentionality: some were for herself, some were for gifts for others, some were carried out for friends without net access, etc. Until there is a way to distinguish and differentiate the types of intentionality at play in web searches -- and of course intentionality resides in the mind of the purposeful system -- it is difficult to see how reliable conclusions can be drawn from the Stew of Intentions.

The disturbing abuse of privacy in the AOL example leads to a second concern. Battelle makes the customary nod to the downside (“potential to be abused in equally extraordinary fashion”) and open to be “supoenaed” (sic; he should be ‘under penalty’ for misspelling the word). But then of course, he breathlessly gushes on about how wonderful the prospect of a ‘Database of Intentions’ is.

Operating here is perhaps the oldest flaw of the technophile: that ‘can implies ought’. In other words, if something is technically possible, it should be unquestioningly pursued. [2]

I clicked on another Batelle piece, ‘Google as Builder,’ which I believe it certainly will be; investors will push it to deploy its huge cash reserves. It brought to mind Chris Anderson’s Wired article (and now book) ‘The Long Tail'. Anderson’s argument is that the internet changes the economics of identifying and fulfilling (in the business sense) orders for items for which there is only a minute but persistent demand over time. It’s an exponential extension of the move from mass to niche markets, right down to the level of individual markets (as on eBay, which Anderson doesn’t say much about). Suddenly, the marketer’s dismissive comment -- ‘sample of one’ -- is inverted; that market of one now has purchasing power, and access to relevant product. That very different concepts of viable markets are at play today was brought home by a recent announcement by one American TV network that it was cancelling a new series which had drawn ‘only’ three million viewers.

But what Anderson doesn’t acknowledge is the enormous concentration on the other end of the tail. The mediators, or in some cases direct retailers, linking buyer and seller are some of the largest corporations in the US (in market cap): Google, Amazon, MSC, etc. So there’s an odd kind of asymmetry operative here, which bears attending to. Do we really believe that the tail is wagging the other end?

The Single Box Humanities Search”: I have never found a useful reference through Google Scholar (and consequently have discontinuted using it), and this may remain the case while so much of the relevant journal content is gated.
There are two rather simple things which Google Scholar could do which I would find valuable:
- keep track of where scholars are. I notice that Anthony Pagden is no longer at Johns Hopkins. Where did he go? (The AHA Directory is useful but is an annual hard-copy publication and thus often lags changes. Oh, Pagden's moved to UCLA.)
- identify where scholars’ CVs are available. CVs which include lists of publications are useful when one wants to track the evolution of a scholar’s work. But CVs are surprisingly elusive. Sometimes only highly abbreviated ones are available; sometimes quite full CVs; sometimes both. They are often not available thru the academic department directly. Sometimes they are accessible only through the faculty of graduate studies, the presumption apparently being that undergrads wouldn’t/shouldn’t be interested in this information. Sometimes they are on a scholar’s web site. One of the more extensive CVs I have found is Marg Conrad’s at UNB, but look where it’s buried! A search engine which located CVs would be valuable.

Courant’s article is provocative, and one I expect I will return to. It’s easy to share his dismay at the distortion of copyright that has developed in the U.S. and elsewhere. But I am less convinced by his arguments in favour of seeing ideas and information as a ‘public good’ which normatively ought to be accessible without cost.

He claims, “Once something is in digital form, and on a server, the cost of use, pretty much anywhere in the world, is essentially zero.” He thus acknowledges but then sidesteps how the up-front costs of building and maintaining the system and digitizing the content is to be financed. IMHO, Courant may be misapplying the concept of a ‘public good’. In this instance, the good is not simply the ‘idea’ or ‘phrase -- string of code,’ but that together with the systems (hard and soft) that preserve and enable the user to access that idea or string of code, and which entail both capital costs, and marginal cost in supporting users.

Courant discusses Michigan’s freely available ‘Making of America’ (I just looked and it only (or should that be ‘only’?) includes 9,300 books). It provides an interesting contrast with the American Antiquarian Society’s ‘Archive of Americana,’ which provides the full-text of virtually everything printed in the ‘US’ prior to 1820, -- over a hundred thousand items, but which is only available by subscription. To cover the up-front costs, AAS entered into a joint-venture with Readex (Readex via a few corporate hold cos is part of the Thomson empire). The trade-offs between the two approaches deserve discussion.

We are also seeing I think hybrid systems: Anderson’s Long Tail article (and his Long Tail blog) are freely avalable, but his book is sold. So he preserves the income stream from the book (and I expect from corporate clients), but gives the others away (or is this just smart marketing?).

Courant ignores a range of competing institutions that do research and develop ideas, and which often have more immediate and largely invisible impact. These are the large consulting firms and for-profit think tanks; some even offer credible graduate degrees: e.g., AD Little, or for another option, Sotheby’s, the auction house. Their work is proprietary, and the results impinge on us everyday.

Courant focuses on the increasing amount of information (or data) available on the web. I am equally aware of the large amounts of once public information that have become proprietary and inaccessible. One example is the demise of City Directories, which are of huge value to local historians. Uneven, but often listing not only the ‘head of household’ and occupation (and frequently the employer), but also other members of the family and even boarders, sometimes with ages, schools, etc. Comparable data is now sold to businesses, and no, the public/university library no longer has a copy, and couldn’t get one even in the unlikely event it could pay for it.

Rosenzweig's AHR article is convincing on the issues involved in preserving and accessing digital data. Having owned pc’s for 25 years (my first, and second, were the Osborne ‘micro-computer’; the term ‘pc’ only entered the language with IBM’s first product), and having migrated through several operating systems and word processing apps (for example), I can testify to the limited life of digital info. I have a range of information in word processing and db files which I can not convert to a Windows operating environment, and can only access by booting up an old machine. Mine is simply a microcosm of a systemic problem.

The differing postures of historians (save everything) and librarians/archivists (selection is inevitable) is illuminating.

The financial services industry has pushed disintermediation for almost two decades, with considerable success in corporate/commercial banking. But Rosenzweig (above fn51), is incorrect in viewing eBay as an example of direct seller-buyer transactions. eBay is in the middle, collecting its cut on every deal. (As does Abebooks, which in the past year has changed its system to make it considerably more difficult for buyers to transact directly with sellers.) Indeed mediation, not disintermediation, is the financial life blood of the web: I believe Google, and most other sites, only get paid for ads (sponsored links) on which a viewer clicks.

The title, ‘The Infinite Archive’ has obvious sex appeal. Not to quibble, but it seems to me that as long as the web is made of up of binary digits, its size is, at least in principle, calculable, and therefore not infinite. Indeed, one of our readings containes estimates of its size.

What may indeed be infinite are the provocative issues and contested perspectives that it raises.


[1] Russell L. Ackoff, and Fred E. Emery, On Purposeful Systems, (London: Tavistock Publications, 1972); Fred E. Emery, and Eric Trist, eds., Systems Thinking: Selected Readings. Rev. ed., 2 vols., (Harmondsworth: Penguin Books, 1981).

[2] I do not recall where I first encountered this fallacy. A quick google search turned up material primarily on the Kantian inverse: ‘ought implies can’. But one reference intrigues me: Hasan Ozbekhan, “The Triumph of Technology: 'Can' Implies 'Ought'.” It was first published by the Systems Development Corp. in 1967, and then as a chapter in Nigel Cross, David Elliott, and Robin Roy, eds., Man-Made Futures: Readings in Society, Technology and Design, (London: Hutchinson, 1974). An abstract of the SDC version is available.
I was acquainted with Prof Ozbekhan c.1980, when he was chair of the Social Systems Sciences Dept. at Wharton (UPenn), but I believe my awareness of ‘can implies ought’ precedes my knowing him.


Post a Comment

<< Home