Vespa arrives under Oath via FAST and Yahoo!

by | Oct 8, 2017 | Search

Apache Lucene and Apache Solr have become almost generic synonyms for open source search. Lucene is also a key element of the ElasticSearch stack and some commercial vendors make use of it, including IBM and Attivio. Over the last couple of years there has been quite a lot of debate about the respective merits of Solr vs Elastic, and I’m not going to take sides. However now a new open source index candidate has been released by Yahoo! with the name of Vespa. (I realise that Hadoop is open source but humour me for the purposes of this post!). According to an introduction to Vespa by Jon Bratseth the business case is that “serving search results often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.” He has also written a good (though undated!) blog post which has a schematic of the architecture.

The software has an interesting back story which starts with AlltheWeb in 1999, which came out of the same search team that developed FAST Search and Transfer, It was technically very sophisticated but Google had the momentum and AlltheWeb found it difficult to gain market share. At around the same time GoTo.com was developing pay-for-placement search services. In 2001 GoTo.com changed its name to Overture, purchased Allthe Web and Alta Vista and then in 2003 Overture was acquired by Yahoo! In 2006 IBM and Yahoo! collaborated on the IBM Omnifind Yahoo edition, which was built on Lucene (though that was not visibly disclosed) and was a free enterprise-level search application for up to 500,000 documents. It worked quite well, arguably too well, and IBM withdrew the product in 2010. Back at Yahoo! there was a significant amount of search development taking place, including bringing Doug Cutting on board in 2006 to continue the development of Hadoop. I should say at this stage that despite the gradual decline of Yahoo! the quality of research and development in its laboratories around the world was of the highest order. In 2016 the labs were closed down and integrated into Yahoo! Research.

The story continues under Verizon, which acquired AOL in 2015 and most of Yahoo’s operating divisions in 2017, bringing them together under the name of Oath. Now we are seeing the results of many years of development and the rebuilding of Vespa from an internal application running many of the Yahoo! sites into an open source offering. Why they called it Vespa when there already is a software product called Vespa I have no idea. Be that as it may, the quality of the documentation is first class. I will admit to having some difficulty getting my head around the ranking model, which makes extensive use of tensor mathematics. That is interesting from the viewpoint of both being a novel approach to ranking and being one of the reasons for the very fast processing speed.

So far the only comparison between Lucene and Vespa that I have seen comes from Matt Overstreet at OpenSource Connections, and he also focuses in on the ranking engine, which is one of the major differentiators with Lucene. Vespa can be downloaded from GitHub. Unlike Lucene and Solr it is not an Apache project, at least at present. I think that the arrival of Vespa is very timely, given the computation requirements of content analytics. I suspect the release to open source is because Yahoo! is not in the software development and support business and recognises the value of having a community take on the development. Which is why the documentation is so comprehensive!  It will not be an over-night game-changer but it promises solutions to many emerging problems in search and my expectation is that development momentum will build quickly. A good indicator will be how quickly O’Reilly Media adds a Vespa book to its list of titles! It certainly does not mean the end of Lucene – but now you will have a choice of index management technology.

Martin White