It is not easy to understand how search works. The usual explanation is along the lines of crawling content, creating an index and then matching the query term against the index. If only it was that simple. In reality a great deal of processing is going on in the background, much of it using techniques from computational linguistics. These techniques enable search applications to extract entities from text, interpret natural language queries, generate summaries and enable computers to analysis the structure of words and sentences up to the point of semantic analysis. These are all ‘wicked problems’ but are essential to any search application. Computational linguistics emerged from early work in the 1960s on machine translation. There is a good account of the history and applications of the science in the Stanford Encyclopedia of Philosophy

The main reason why I’m blogging on this topic is that I am often told that there are no development taking place in search. Certainly some of the core principles date back to the 1960s and 1970s but under the surface there is a considerable amount of research being undertaken in both information retrieval and in computational linguistics. Because the problems that need to be solved are complex it takes time to find solutions. Indeed sometimes a research paper can do no more than clearly state the problem. Good examples include recent papers on deciding how manage queries including a reference to a percentage and parsing models for identifying multiword phrases.

Both these papers were published in the journal Computational Linguistics. This is an open access journal so copies of all the paper published can be downloaded at no cost. It is well worth browsing through some of the back issues to get a sense of the scope and scale of computational linguistics research, and the challenges in ‘understanding’ language in a way that it can be processes by a computer.  The rapid adoption of open source search solutions is very likely to reduce the time taken from research to implementation so some of the techniques described in the papers could be available to developers in months rather than years. Even when they do emerge onto the desktop it may be very difficult to spot exactly what has changed behind the scenes but overall search performance and satisfaction are likely to increase quite substantially over the next few years as a result of computational linguistics.

Martin White