Hardly a day goes by without an announcement from a search software business of how clever the company has been to develop a dense vector approach to the similarity matching of documents (‘document’ used in its generic form). The impression is created is that this is absolutely leading-edge technology, and it is going to transform the performance of search applications.
On my desk as I write this blog is a copy of “Introduction to Modern Information Retrieval” by Gerard Salton and Michael McGill published in 1983. Chapter 4 describes the SMART retrieval system developed by Salton with pages 120-127 devoted to an explanation of vector representation and similarity computation. The acronym SMART was derived from System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System. This was developed by Salton and his colleagues in the mid-1960s and represented a significant leap forward in the mathematical modelling of information retrieval processes.
A detailed presentation by Salton of his vector model was published in 1978 as an internal Cornell University document. Although Salton had referred to a vector space model (VSM) in earlier publications this document was the first time that he referred to VSM as a vector processing model, and so moving it from a mathematical construct to an ‘information retrieval’ model. The history of the gradual evolution of the VSM IR model is well documented by David Dublin in a paper published in early 2004. The journey is quite a complex one, but there is no doubting that it was Salton who conceived of the utility of the vector space model and used it very effectively in his SMART system.
Gerard Salton (1927-1995) is regarded as the father of modern information retrieval and over the course of his career (initially at Harvard and then at Cornell) received many awards and honours for his pioneering work. The advent of dense matrix vectors takes his work on sparse vectors and builds substantially on it, so it is disappointing to see that in so few of the papers that have been published on dense matrix developments is there any reference to Salton’s pioneering work on which many search applications have been built over the last 50 years.