Enterprise search research in 2020

by | Dec 16, 2020 | Search

It is important to appreciate that enterprise content presents some significant challenges to search developers. There are likely to be tens of millions of files of uncurated content in a wide range of file formats and the subject scope will be quite focused compared with web search. The scope is set by the business interests of the organisation and if that is pharmaceuticals there are going to be a massive number of documents relating to perhaps just a dozen therapeutic areas. Then of course there are the trade-offs between precision and recall! 

The papers listed below are just a selection of those published in 2020. It is very difficult to do justice to these papers in just a sentence or two so even if they look only slightly interesting I would recommend a click and read. Many are not explicitly ‘enterprise search’ but cover issues that are a feature of searching enterprise content. Only the final paper in this selection is not open-access. 

Inverted index architecture

The standard index database structure is the inverted index and it has served its purpose well, but now the advent of the processing capabilities of cloud applications could potentially offer alternative high-performance index architectures. File collections in the hundreds of millions also give rise to substantial index sizes, and that requires careful attention to index compression.

IIU: Specialized Architecture for Inverted Index Search (taejunham.github.io)

Techniques for Inverted Index Compression (arxiv.org) 

Interoperability

Federated search has come on a great deal over the last few years but interoperability is still a challenging issue.  

[2003.08276] Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format (arxiv.org)

BM25

The BM25 ranking model was developed in the early 1990s and has become the de facto ranking model for most text search applications. However, there are many variants of BM25 and there is on-going research to understand the opportunities and challenges of these variants

Improvements to BM25 and Language Models Examined (otago.ac.nz)

Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants (ru.nl)

Query understanding

Although much of the focus of enterprise search is on standalone applications many (indeed most) enterprise applications have a good search functionality. The paper below is an account of the development by Salesforce of the search capabilities in its CRM application. There are many wider lessons to be learned from the team’s experience with search optimization.   

[2012.06238] Query Understanding for Natural Language Enterprise Search (arxiv.org)

Over the last few years the concept of ‘professional search’ has come to the fore. Clinicians, lawyers and patent agents are just a few professions which need to create conplex queries, often using Boolean strings. 

City Research Online – Towards Explainability in Professional Search

Microsoft Research

The level of detail in the Salesforce paper is very unusual. Most vendors talk about AI, NLP and ML in very generic terms. However, this year Microsoft Research published two papers that went beneath the surface of SharePoint. The first of these is an analysis of millions of search sessions originating from within Microsoft Office applications, collected over one month of activity, in an effort to characterize search behavior in productivity software

Characterizing Search Behavior in Productivity Software (microsoft.com)

In another large scale study Microsoft Research analysed a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents.

Understanding User Behavior For Document Recommendation – Microsoft Research

 Machine learning

The Salesforce paper referred to above touches on the issues around deep learning. Sinequa reports on the use of BERT with the longer document formats that are common in enterprise search, and in doing so makes the point that web search and enterprise search invariably require very different solutions. This paper will give you a good sense of what is involved in taking a powerful but basic framework and transforming it into a workable solution for a specific piece of search software

Classifying long textual documents (up to 25 000 tokens) using BERT | by Sinequa | Dec, 2020 | Medium

The challenges of building on NLP techniques which might work well with short curated web content into the enterprise search space are discussed in detail in this paper in which approaches from categorical or bag-of-words representations to word embeddings representations in the latent space are outlined.

[2012.01941] On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf’s Law, and Similarity Search (arxiv.org)

When Google itself raises questions about the value of machine learning then it is probably time to take a deep breath and read through this paper co-authored by a substantial team of Google developers.

[2011.03395] Underspecification Presents Challenges for Credibility in Modern Machine Learning (arxiv.org)

Perceptual speed

So we have come to the point of scanning the list of results. This can be much more challenging than it might seem as the results may have different snippet formats and may present metadata in different ways. This is where the concept of perceptual speed comes into play.

Predicting perceptual speed from search behaviour — University of Strathclyde

Auto summarization

One of the features of enterprise content is that the documents can be long and complex, making it very difficult to make an immediate judgment on their value. This is where auto-summarisation can make a substantial difference.

[2012.07619] What Makes a Good Summary? Reconsidering the Focus of Automatic Summarization (arxiv.org)

The true measure of relevance

One of the most disturbing research papers of 2020 suggests that even that finding relevant papers does not necessarily lead to better decisions.

Do better search engines really equate to better clinical decisions? If not, why not? – Vegt – – Journal of the Association for Information Science and Technology – Wiley Online Library

The authors found that the ability to interpret documents correctly was a much more important factor impacting task success. Despite the aid of the search engine, half of the clinical questions that were used as test cases were answered incorrectly. The authors commented that if their findings are representative, information retrieval research may need to reorient its emphasis towards helping users to better understand information, rather than just finding it for them. [This paper is not open access]

The aim of this post is to surface three important considerations. The first is that there is a great deal of development being undertaken into optimizing search through complex documents and this will continue. As a result, the performance of enterprise search software (ignoring for the moment what ‘performance’ means!) is going to improve, but it will take more than the insertion of AI routines to achieve.  

This brings me to the second consideration. Enterprise search applications are modular. This gives application vendors immense flexibility to build a technology stack but it has to work in a totally integrated fashion and be capable of post-implementation creativity to meet requirements which may not have been on the original specification a year earlier, especially when that year is 2020! 

The third is that enterprise search, as with all search applications, is not a solved problem even though the origins of enterprise search date back to the 1970s. The question that needs to be asked when considering the replacement of a current search application is whether the technology  offered by a vendor will meet future user requirements encompassing a very wide range of search intents.  

Martin White