Search has long been a problem domain for developers. We want users to input good data without scrolling through thousands of options, so we need a search. But how do we accomplish that search within the specific data?
Basic tools, such as SQL or similarity scoring, and natural language processing can effectively search data depending on the user context and intent.
SQL and Similarity Scoring
In the past, we had some limited tools around this. We could run a SQL query with the LIKE operator. Or, to get fancy, we'd run that query in a way to make it case-insensitive.
But what if the user makes a singular word plural or writes one word as two words? Then we can lean on similarity-scoring tools like Levenshtein distance and trigrams.
Potential Problems
For some data, that's all you need. But say you're searching through a list of company names, and you want to find the company closest to the current input.
We don't want the search to get too creative in that case. We wouldn't want "scottish restaurant" to return "McDonalds", and we may not be looking for "Fedex" when searching for "UPS."
For other data domains, we want some kind of cross-referencing of terms or phrases. If I'm looking for a physician, I might search for “doctor.” That's not going to be within any useful trigram or Levenshtein distance.
These are the searches where our clients ask for a Google-like experience. So, we turn to natural language processing (NLP) and embeddings.
Natural Language Processing
With NLP in search, we take a trained machine learning model that already "knows" similarity in terms and compute a set of numbers based on the model with the set of data we want to search.
When we receive user input, we can compute an embedding for that input and compare its similarity to other embeddings to get the top results.
Hugging Face has many open-source base models for this purpose. Hardware requirements depend on the model as some are more computationally intense than others.
The most intense piece of the process is generating the embeddings for the search set. If the search set is fairly static and won't get updated very often, it's best to compute it once and store it in a file.
While there are several options for file storage, I have favored storing a pandas dataframe in a Parquet file using PyArrow. This keeps the list of strings for which embeddings were computed in the same file as the embeddings.
Of course, as the data set grows, we need to consider alternate solutions like vector databases. But for a limited search of terms, the parquet file will suffice.
Choosing the Right Search Approach
Search is a deceptively complex problem, and the right approach depends entirely on the data and user expectations.
Simple SQL queries and similarity scoring work well for structured data with predictable variations, while NLP-powered embeddings are best when users expect a more intuitive, context-aware search experience.
Understanding these trade-offs is key to implementing an effective search solution. Whether fine-tuning SQL queries or leveraging machine learning models, the goal is to help users find what they need.