Concord — A Digital Concordance

Christopher Lee
7 min readDec 23, 2021

This writeup was written on 10/23/20 describing a project called Concord that I built while I was at Lambda School. I published it on Medium on 12/23/21 to serve as an external reference that offers a more detailed explanation of Concord than the short summarization I provide on my portfolio.

Here’s what Concord looks like. I hacked the GUI together using Python. Looks stunning right?

Currently, Concord can take in pdfs and text files and extract the text from them. I’ve preloaded some texts and pdf files here to demonstrate some examples, but you can upload any arbitrary pdf or text file and any arbitrary number of files run run a query against.

I thought it’d be interesting to compare the Holy Bible(KJV) and the Satanic Bible. Suppose I wanted to compare how the Bible and Satanic Bible deal with the concept of “evil”. I simply select both texts from the listbox and enter the query “evil”.

When I click the “Find” button, it opens up a new window displaying the relevant results. If multiple texts were selected, each passage is color coded according to which text they came from.

One of the main purposes of Concord is to facilitate the collision of ideas between disparate sources of text in order to unearth some sort of interesting insight or connection. So I can click on the “Shuffle” button and it’ll randomly shuffle the results, juxtaposing the passages in different ways and laying them out for visual comparison. I can shuffle the results as many times as I want, each time producing a novel arrangement of results.

But this is a naive type of search. We were lucky that “evil” was contained in both texts. What if our query wasn’t explicitly contained in the text? For instance, suppose I was curious about passages from the Bible that had to do with the infamous snake that tempted Eve. I’d select the Bible from the listbox and query “snake”.

I’d get the results below.

No results are found because the King James Version uses the term “serpent” instead of “snake”. In this case, the program failed to fulfill my true query intent, which wasn’t specifically “snakes” per se, but rather “fetch me all passages that have ‘snake’ in them and if other terms are used in place of ‘snake’, fetch me those passages instead.” I don’t want to have to run down the list of all synonyms of snake until I land a hit. So I decided to set this particular example as my benchmark test. If Concord successfully returns passages with “serpent” in them, then it will have satisfied the actual intent of my query and thus passed the benchmark.

I tried implementing a Natural Language Processing technique called Latent Semantic Analysis(LSA), which is used for comparing similarities between documents by inferring semantic relationships between terms. What became immediately apparent was that LSA wasn’t going to cut it. It failed to pick up on semantic relatedness of words outside the model’s vocabulary. Essentially, the only words the model thinks exist in the universe are those contained in the documents we fed it. Because “snake” didn’t appear anywhere in the Bible, LSA didn’t have any context or basis to form a semantic understanding of “snake”. Had “snake” occurred at least once in the Bible, perhaps LSA would’ve picked up on its contextual usage and linked it to “serpent.” So LSA is useful for certain jobs like comparing similarities between documents, but is poor at fulfilling the sort of semantic search I actually want it to do. As a result, I had to abandon LSA and adopt a different method.

The alternate model I decided to implement is called Word2Vec, which does a much better job of capturing semantic relationships of words. Due to the size of the model, it does a much better job of understanding words outside those found in the text. To use it in my program I can check the “word2vec” box to tell the program we want to use the word2vec model, and rerun the same “snake” query.

As you can see from the results below, it passes our benchmark test, retrieving sentences that contain “serpent”.

Not only that, but check out the first result at the top. The model retrieved a passage with the word “lizard” in it. It appears to pick up on something belonging to the same conceptual category as “snake”, which I can only assume must be “reptiles”. So it appears the model is flexible enough to identify similarities beyond just the synonyms of a word. Indeed, word2vec is a powerful way to retrieve conceptually related terms belonging to the same latent category as our query. This can be very useful for running queries not just on the __word__-level but on the __concept__-level, extending the powers and capabilities of Concord and perhaps surprising the user by presenting relevant yet unexpected results he or she didn’t even know they wanted. This is the power of semantic search.

In Concord’s current version, using word2vec is slow and clunky because I was in a mad dash to finish the project before the deadline, so I used word2vec out of the box and didn’t do any sort of optimizing or finetuning. But the important thing is that it works. It gets the job done! Additionally, Concord doesn’t handle pdfs very well. Depending on how many pages the PDF contains, it can take many seconds to read it in, extract the text, and process it for querying. The quality of text extraction isn’t great either, especially on PDFs where the pages are double-columned. But these issues can be addressed with more work.

EDIT 12/30/21: I built a Streamlit app to continue exploring what semantic search is capable of. I felt it was better to focus on a specific text instead of accepting arbitrary documents and indexing them on the fly like in the Concord prototype I built for Lambda School. I decided to use the King James Bible, one of the most canonical texts in the Western world. I used a Huggingface model to produce vector embeddings for every verse in the KJV Bible, and a vector database called Pinecone to store and index those vector embeddings in order to perform similarity search.

This way of indexing information and getting them to be in a semantically-searchable state has an initial time cost of encoding the text as a vector and then upserting it into the vector database. Encoding every verse in the Bible and upserting them into Pinecone took around 9 minutes. But it’s a cost you pay once, and thereafter you can perform semantic search to your heart’s content across millions of vectors in milliseconds. Some avenues of experimentation I’m exploring are trying out other models with different performance characteristics to handle the encoding of text into embeddings. Using a bigger model will improve the quality of results, but will be slower. Also, I’d like to try adding a re-ranking stage at the end of the pipeline before the final search results are presented to the user. After the top <i>k</i> results are found using a vanilla semantic search, the re-ranking pipeline does a pass-through of those results and re-ranks them according to their similarity to the query. This added re-ranking stage should significantly improve the relevance of the search results because it applies an additional relevance filter on top of the vanilla semantic search which is already finding the semantically closest passages to the query in the vector database. Of course, this extra step will introduce some inference latency. It’s a question of whether the performance boost justifies the latency cost. See here for more information on the re-ranking pipeline.

--

--