A history of enterprise search 1948-1959
I have set the start date for this history somewhat arbitrarily as 1948, the year of the Royal Society Scientific Information Conference. This conference identified the challenges that lay ahead in managing the flow of scientific information, challenges that we have not solved. The earliest research into how computers might help was undertaken by Philip Bagley as part of a Masters project at MIT. His thesis was entitled Electronic Digital Machines for High-Speed Information Searching. He set out the basic principles of ‘information searching’ and wrote a program for the Whirlwind computer at MIT. This Masters thesis should not be confused with his later PhD thesis in which he was the first to use the term ‘metadata.
By June 1952 there was enough interest in the subject at a number of research centres across the USA to hold a Symposium for Machine Techniques for Information Selection at MIT. One of the speakers at the Symposium was Hans-Peter Luhn, at that time working on punched-card retrieval systems for IBM. Luhn would turn out to be hugely influential in information retrieval. Another very influential person was to be Eugene Garfield, who in 1955 published a paper in Science about the value of citation analysis. From this approach Garfield developed the Institute for Scientific Information, in due course one of the leading online databases. However his insight also became one of the innovations incorporated into Google at the outset on the 1990s, but that is another story. Of more immediate interest is a paper by Allen Kent and his colleagues at the Battelle Memorial Institute, Ohio. In this paper the concepts of ‘recall’ and ‘pertinency’ are proposed as metrics for a search application; ‘relevance’ later replaced pertinency.
There were two further important conferences in the 1950s.The first was the International Study Conference on Classification for Information Retrieval, held in Dorking, UK in 1957. This was the first opportunity for UK and US research teams to exchange ideas and research on information retrieval. The US may have had a technology lead but the UK was held in high regard for research and implementation of classification and index frameworks. A year later an International Conference on Scientific Information was held in Washington to take note of developments since the 1948 Royal Society conference and much of the discussion was about information retrieval. The papers make for some fascinating reading. Even in 1958 Dow Chemicals was studying how computer-based systems could be used to manage in-house documentation.
The chemistry community have long had some special information retrieval challenges (such as searching chemical structures) and have always been in the vanguard of search development. It was at an American Chemical Society meeting in Miami in 1957 that Luhn gave a paper on A Statistical Approach to Mechanized Encoding and Searching of Literary Information in which (in effect) he set out the constituent elements of a search application.
The following year Luhn published a paper on his work at IBM in which (according to the abstract) “excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the “auto-abstract.” A visionary approach. Luhn also proposed that the frequency of word occurrence in an article furnished a useful measurement of word significance. This is the origin of the now familiar term frequency – inverse document frequency model although it was not until 1972 that Karen Sparck-Jones developed a rigorous statistical basis for TF.IDF.
In 1959 Maron and Kuhns wrote a seminal paper entitled On Relevance, Probabilistic Indexing and Information Retrieval in which they defined ‘relevance’ and the use of ‘probabilistic indexing’ to allow a computing machine, given a request for information, to make a statistical inference and derive a number (which they called the “relevance number”) for each document, which would be a measure of the probability that the document will satisfy the given request. The result of a search would then be an ordered list of those documents which satisfy the request ranked according to their probable relevance. The importance of the paper is that Maron and Kuhns then evaluated their proposal through a manual (rather than computer-based) trial, so setting out not only the fundamental principle of determining the probability that a document was relevant but the importance of system evaluation.Fifty years later Maron published a short account of the background to this paper in which he provides a fascinating insight in how he and Kuhns developed this principle.
Although Maron and Kuhns had shown that a probabilistic approach was superior to a Boolean approach virtually all of what might be seen as the first generation of commercial search applications used Boolean logic because the challenge of calculating a ‘relevance number’ had yet to be solved. It is of note that Maron was at the Rand Corporation as a spin off, System Development Corporation, played an important role in search development. Another important development in 1959 was the establishment of the Augmentation Research Center at Stanford Research Institute under the direction of Doug Engelbart.
So by the end of the 1950s almost all the core elements were in place, including understanding the required modularity of the search process, the benefits of a probabilistic view of document retrieval (rather than using Boolean operators), the concepts of precision, recall and relevance, and the value of testing and evaluation. What was needed now was computing power. In the 1960s California would forces with Massachusetts in the quest to scale up search and make it widely available within and outside of the organisation.