Google recently published a research paper on a new algorithm called SMITH. It is said that this algorithm outperforms BERT for understanding long queries and long documents. This new algorithm is able to understand passages within documents in the same way that BERT is able to understand words and sentences. The SMITH algorithm is hence able to understand documents better.
What is the SMITH Algorithm?
The SMITH algorithm is a new model to understand the entire document. BERT is more suited for understanding words within the context of sentences. BERT is trained on data sets to predict randomly hidden words are from the context within sentences, the SMITH algorithm is trained to predict what the next block of sentences are. The algorithm helps understand larger documents better than the BERT algorithm.
BERT Algorithm has some limitations
First the Algorithm undergoes pre-training where it is trained on the data set. In a typical pre-training, engineers mask random words within sentences and the algorithm tried to predict the masked words. As the algorithm learns, it eventually becomes optimized to make fewer mistakes on the training data. This results in fewer mistakes. Then, relations between sentence blocks in a document are used for understanding what the document is about. After testing, researchers noted that SMITH does better with longer text documents.