{ "cells": [ { "cell_type": "markdown", "id": "fc2f293d", "metadata": {}, "source": [ "# Tutorial 6: Complex Components (Embedding Based Search Engine)\n", "\n", "In this tutorial, we'll show you how you can build a simple search engine over a dataset, using the CLIP model to drive the search. Users will be able to type in a query to search over images, and will see the dataset images ranked by their similarity to the query.\n", "\n", "\n", "To get started, run the tutorial demo script.\n", "\n", "```{code-block} bash\n", "mk demo match\n", "```\n", "\n", "You should see the tutorial app when you open the link in your browser. Let's break down the code in the demo script.\n", "\n", "## Installing dependencies\n", "This tutorial has additional dependencies that you need to install. Run the following command to install them.\n", "\n", "```{code-block} bash\n", "pip install ftfy regex git+https://github.com/openai/CLIP.git\n", "```\n", "\n", "Once you run the script, it will download the CLIP model and cache it in your home directory. This will take a few minutes.\n", "\n", "## Loading in the dataset" ] }, { "cell_type": "code", "execution_count": 1, "id": "4f995290", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import meerkat as mk\n", "import rich" ] }, { "cell_type": "markdown", "id": "ac8b9407", "metadata": {}, "source": [ "The first few lines just load in the `imagenette` dataset, a small 10-class subset of ImageNet." ] }, { "cell_type": "code", "execution_count": 2, "id": "f203f5c4", "metadata": {}, "outputs": [], "source": [ "IMAGE_COLUMN = \"img\"\n", "df = mk.get(\"imagenette\", version=\"160px\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "0e1d21de", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "04212565526b4600b88f7bc476370e21", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/114M [00:00\n", " \n", " \n", " \n", " img_id\n", " path\n", " noisy_labels_0\n", " noisy_labels_1\n", " noisy_labels_5\n", " noisy_labels_25\n", " noisy_labels_50\n", " is_valid\n", " label_id\n", " label\n", " label_idx\n", " split\n", " img_path\n", " index\n", " img\n", " img_clip\n", " \n", " \n", " \n", " \n", " 0\n", " n02979186_9036\n", " train/n02979186/n02979186_9036.JPEG\n", " n02979186\n", " n02979186\n", " n02979186\n", " n02979186\n", " n02979186\n", " False\n", " n02979186\n", " cassette player\n", " 482\n", " train\n", " train/n02979186/n02979186_9036.JPEG\n", " 0\n", " \n", " np.ndarray(shape=(512,))\n", " \n", " \n", " 1\n", " n02979186_11957\n", " train/n02979186/n02979186_11957.JPEG\n", " n02979186\n", " n02979186\n", " n02979186\n", " n02979186\n", " n03000684\n", " False\n", " n02979186\n", " cassette player\n", " 482\n", " train\n", " train/n02979186/n02979186_11957.JPEG\n", " 1\n", " \n", " np.ndarray(shape=(512,))\n", " \n", " \n", " 2\n", " n02979186_9715\n", " train/n02979186/n02979186_9715.JPEG\n", " n02979186\n", " n02979186\n", " n02979186\n", " n03417042\n", " n03000684\n", " False\n", " n02979186\n", " cassette player\n", " 482\n", " train\n", " train/n02979186/n02979186_9715.JPEG\n", " 2\n", " \n", " np.ndarray(shape=(512,))\n", " \n", " \n", " 3\n", " n02979186_21736\n", " train/n02979186/n02979186_21736.JPEG\n", " n02979186\n", " n02979186\n", " n02979186\n", " n02979186\n", " n03417042\n", " False\n", " n02979186\n", " cassette player\n", " 482\n", " train\n", " train/n02979186/n02979186_21736.JPEG\n", " 3\n", " \n", " np.ndarray(shape=(512,))\n", " \n", " \n", " 4\n", " ILSVRC2012_val_00046953\n", " train/n02979186/ILSVRC2012_val_00046953.JPEG\n", " n02979186\n", " n02979186\n", " n02979186\n", " n02979186\n", " n03394916\n", " False\n", " n02979186\n", " cassette player\n", " 482\n", " train\n", " train/n02979186/ILSVRC2012_val_00046953.JPEG\n", " 4\n", " \n", " np.ndarray(shape=(512,))\n", " \n", " \n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "08ec50fc", "metadata": {}, "source": [ "## Creating the `Match` component\n", "Now that we have a data frame with the dataset and CLIP image embeddings, we want to be able to run a search over the dataset. To do this, we'll use the `Match` component, which allows us to match a query to the dataset." ] }, { "cell_type": "code", "execution_count": 6, "id": "a83de5f2", "metadata": {}, "outputs": [], "source": [ "match = mk.gui.Match(df, against=EMBED_COLUMN)" ] }, { "cell_type": "markdown", "id": "cfe17222", "metadata": {}, "source": [ "Here, `against` is the column in the data frame that we want to match against. In this case, we want to match against the CLIP embeddings.\n", "\n", "`Match` is a complex component that does a few things:\n", "- as a component, it visualizes a search bar where users can type in a query\n", "- it provides an endpoint attribute, `on_match`, which can be used to run an endpoint when the user types in a query\n", "- by default, `Match` assigns `on_match` to an endpoint that adds a column to the data frame with the match scores\n", "\n", "## Getting the match criteria\n", "\n", "When a match query is run, it generates a new column in the data frame with the match scores. The name of this column is stored in the `criterion` attribute of the `Match` component. We can use this to get the name of the match criterion.\n", "\n", "Let's use a `magic` context manager to get the name of the match criterion. Using `magic` makes the `match.criterion` Store property accessors reactive, so accessing its `name` attribute will cause this property access to re-run when the criterion is updated." ] }, { "cell_type": "code", "execution_count": 7, "id": "09126ded", "metadata": {}, "outputs": [], "source": [ "# Get the name of the match criterion in a reactive way.\n", "with mk.magic():\n", " criterion_name = match.criterion.name" ] }, { "cell_type": "markdown", "id": "cf2dfe14", "metadata": {}, "source": [ "Now, when a new match query is run, `criterion_name` will be updated to the name of the new match criterion.\n", "\n", "## Sorting by match scores\n", "Now that we have the name of the match criterion, we'll use it to sort the data frame by the match scores. We'll use the `mk.sort` function to do this.\n", "\n", "We also want sorting to be reactive, so that when a new match query is run, the data frame is sorted by the new match scores. Fortunately, `mk.sort` is a reactive function.\n", "\n", "However, before we run `sort`, we need to _mark_ the data frame `df`. Any reactive functions that take `df` as an argument when it is marked will re-run when it is updated." ] }, { "cell_type": "code", "execution_count": 8, "id": "7a943174", "metadata": {}, "outputs": [], "source": [ "df.mark()\n", "df_sorted = mk.sort(data=df, by=criterion_name, ascending=False)" ] }, { "cell_type": "markdown", "id": "12cd33e1", "metadata": {}, "source": [ "With this, we now ensure that when `df` is updated with the new match scores from the user's query, `df_sorted` will be re-sorted by the column with the new match scores.\n", "\n", "## Visualizing results in a `Gallery`\n", "Now that we have a sorted data frame, we can visualize the results. Let's use a `Gallery` component to visualize the data frame, and show the images by default." ] }, { "cell_type": "code", "execution_count": 9, "id": "180f1bad", "metadata": {}, "outputs": [], "source": [ "gallery = mk.gui.Gallery(df_sorted, main_column=IMAGE_COLUMN)" ] }, { "cell_type": "markdown", "id": "b055f509", "metadata": {}, "source": [ "## Putting it all together\n", "With all the pieces, let's put them together into a `Page` and launch the app.\n", "```python\n", "page = mk.gui.Page(\n", " component=mk.gui.html.flexcol([match, gallery]),\n", " id=\"match\",\n", ")\n", "page.launch()\n", "```" ] } ], "metadata": { "file_format": "mystnb", "kernelspec": { "display_name": "python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" }, "source_map": [ 5, 31, 35, 39, 44, 54, 59, 62, 68, 70, 83, 87, 97, 100, 107, 109 ], "widgets": { "application/vnd.jupyter.widget-state+json": { "state": { "04212565526b4600b88f7bc476370e21": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_9c092cbd212144deb8abd134d1d1405a", "IPY_MODEL_fa43a8814af14dceb86c1592da15304a", "IPY_MODEL_d619795f045c4824a5f70f88dc53b929" ], "layout": "IPY_MODEL_d07079b21c144b48b61c7d695b8c1509", "tabbable": null, "tooltip": null } }, "73a37b6406ff4a6aaac15c4fc0ff3b63": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "7b93c2a9a9c748d58be593b480f8b7f6": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "7fcb9b1480354d0eb71844aed805fe8d": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "9b4c6cb42cb34435ba867e8eba948aa2": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "9c092cbd212144deb8abd134d1d1405a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_7fcb9b1480354d0eb71844aed805fe8d", "placeholder": "​", "style": "IPY_MODEL_7b93c2a9a9c748d58be593b480f8b7f6", "tabbable": null, "tooltip": null, "value": "Downloading: 100%" } }, "a0a756587d1044a785f5801a4b9a34ec": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "c7c90c1570cc4268963f4ddb1fad9234": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "d07079b21c144b48b61c7d695b8c1509": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "d619795f045c4824a5f70f88dc53b929": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_c7c90c1570cc4268963f4ddb1fad9234", "placeholder": "​", "style": "IPY_MODEL_a0a756587d1044a785f5801a4b9a34ec", "tabbable": null, "tooltip": null, "value": " 114M/114M [00:01<00:00, 67.1MB/s]" } }, "fa43a8814af14dceb86c1592da15304a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_9b4c6cb42cb34435ba867e8eba948aa2", "max": 114352332.0, "min": 0.0, "orientation": "horizontal", "style": "IPY_MODEL_73a37b6406ff4a6aaac15c4fc0ff3b63", "tabbable": null, "tooltip": null, "value": 114352332.0 } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }