Tutorial 6: Complex Components (Embedding Based Search Engine)

In this tutorial, we’ll show you how you can build a simple search engine over a dataset, using the CLIP model to drive the search. Users will be able to type in a query to search over images, and will see the dataset images ranked by their similarity to the query.

To get started, run the tutorial demo script.

mk demo match

You should see the tutorial app when you open the link in your browser. Let’s break down the code in the demo script.

Installing dependencies

This tutorial has additional dependencies that you need to install. Run the following command to install them.

pip install ftfy regex git+https://github.com/openai/CLIP.git

Once you run the script, it will download the CLIP model and cache it in your home directory. This will take a few minutes.

Loading in the dataset

The first few lines just load in the imagenette dataset, a small 10-class subset of ImageNet.

IMAGE_COLUMN = "img"
df = mk.get("imagenette", version="160px")
EMBED_COLUMN = "img_clip"

# Download the precomupted CLIP embeddings for imagenette.
df_clip = mk.DataFrame.read(
    "https://huggingface.co/datasets/meerkat-ml/meerkat-dataframes/resolve/main/embeddings/imagenette_160px.mk.tar.gz",
    overwrite=False,
)
Extracting tar archive, this may take a few minutes...
df = df.merge(df_clip[["img_id", "img_clip"]], on="img_id")
img_id path noisy_labels_0 noisy_labels_1 noisy_labels_5 noisy_labels_25 noisy_labels_50 is_valid label_id label label_idx split img_path index img img_clip
0 n02979186_9036 train/n02979186/n02979186_9036.JPEG n02979186 n02979186 n02979186 n02979186 n02979186 False n02979186 cassette player 482 train train/n02979186/n02979186_9036.JPEG 0 np.ndarray(shape=(512,))
1 n02979186_11957 train/n02979186/n02979186_11957.JPEG n02979186 n02979186 n02979186 n02979186 n03000684 False n02979186 cassette player 482 train train/n02979186/n02979186_11957.JPEG 1 np.ndarray(shape=(512,))
2 n02979186_9715 train/n02979186/n02979186_9715.JPEG n02979186 n02979186 n02979186 n03417042 n03000684 False n02979186 cassette player 482 train train/n02979186/n02979186_9715.JPEG 2 np.ndarray(shape=(512,))
3 n02979186_21736 train/n02979186/n02979186_21736.JPEG n02979186 n02979186 n02979186 n02979186 n03417042 False n02979186 cassette player 482 train train/n02979186/n02979186_21736.JPEG 3 np.ndarray(shape=(512,))
4 ILSVRC2012_val_00046953 train/n02979186/ILSVRC2012_val_00046953.JPEG n02979186 n02979186 n02979186 n02979186 n03394916 False n02979186 cassette player 482 train train/n02979186/ILSVRC2012_val_00046953.JPEG 4 np.ndarray(shape=(512,))

Creating the Match component

Now that we have a data frame with the dataset and CLIP image embeddings, we want to be able to run a search over the dataset. To do this, we’ll use the Match component, which allows us to match a query to the dataset.

match = mk.gui.Match(df, against=EMBED_COLUMN)

Here, against is the column in the data frame that we want to match against. In this case, we want to match against the CLIP embeddings.

Match is a complex component that does a few things:

  • as a component, it visualizes a search bar where users can type in a query

  • it provides an endpoint attribute, on_match, which can be used to run an endpoint when the user types in a query

  • by default, Match assigns on_match to an endpoint that adds a column to the data frame with the match scores

Getting the match criteria

When a match query is run, it generates a new column in the data frame with the match scores. The name of this column is stored in the criterion attribute of the Match component. We can use this to get the name of the match criterion.

Let’s use a magic context manager to get the name of the match criterion. Using magic makes the match.criterion Store property accessors reactive, so accessing its name attribute will cause this property access to re-run when the criterion is updated.

# Get the name of the match criterion in a reactive way.
with mk.magic():
    criterion_name = match.criterion.name

Now, when a new match query is run, criterion_name will be updated to the name of the new match criterion.

Sorting by match scores

Now that we have the name of the match criterion, we’ll use it to sort the data frame by the match scores. We’ll use the mk.sort function to do this.

We also want sorting to be reactive, so that when a new match query is run, the data frame is sorted by the new match scores. Fortunately, mk.sort is a reactive function.

However, before we run sort, we need to mark the data frame df. Any reactive functions that take df as an argument when it is marked will re-run when it is updated.

df.mark()
df_sorted = mk.sort(data=df, by=criterion_name, ascending=False)

With this, we now ensure that when df is updated with the new match scores from the user’s query, df_sorted will be re-sorted by the column with the new match scores.

Putting it all together

With all the pieces, let’s put them together into a Page and launch the app.

page = mk.gui.Page(
    component=mk.gui.html.flexcol([match, gallery]),
    id="match",
)
page.launch()