{ "cells": [ { "cell_type": "markdown", "id": "a0e367b4", "metadata": {}, "source": [ "(quickstart-df)=\n", "\n", "# Quickstart: Data Frames\n", "\n", "This quickstart provides a quick walkthrough of the `Meerkat` data frame, which allows users to interact with unstructured data alongside standard tabular data." ] }, { "cell_type": "code", "execution_count": 1, "id": "3cab2328", "metadata": {}, "outputs": [], "source": [ "import os\n", "import meerkat as mk\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "980b84dc", "metadata": {}, "source": [ "## 💾 Downloading the data\n", "First, we'll download some data to explore. We're going to use the [Imagenette dataset](https://github.com/fastai/imagenette#image%E7%BD%91), a small subset of the original [ImageNet](https://www.image-net.org/update-mar-11-2021.php). This dataset is made up of 10 classes (e.g. \"garbage truck\", \"gas pump\", \"golf ball\").\n", "- Download time: < 1 minute\n", "- Download size: 130M\n", "\n", "In addition to downloading the data, `download_imagnette` prepares a CSV, `imagenette.csv`, with a row for each image." ] }, { "cell_type": "code", "execution_count": 2, "id": "21f2f799", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7e8b2044cad24e29bba79ae8544bbf84", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/99.0M [00:00\n", " \n", " \n", " \n", " label\n", " split\n", " img_path\n", " img\n", " \n", " \n", " \n", " \n", " 0\n", " cassette player\n", " train\n", " train/n02979186/n02979186_9036.JPEG\n", " \n", " \n", " \n", " 1\n", " cassette player\n", " train\n", " train/n02979186/n02979186_11957.JPEG\n", " \n", " \n", " \n", " 2\n", " cassette player\n", " train\n", " train/n02979186/n02979186_9715.JPEG\n", " \n", " \n", " \n", " 3\n", " cassette player\n", " train\n", " train/n02979186/n02979186_21736.JPEG\n", " \n", " \n", " \n", " 4\n", " cassette player\n", " train\n", " train/n02979186/ILSVRC2012_val_00046953.JPEG\n", " \n", " \n", " \n", " ...\n", " ...\n", " ...\n", " ...\n", " ...\n", " \n", " \n", " 13389\n", " gas pump\n", " valid\n", " val/n03425413/n03425413_17521.JPEG\n", " \n", " \n", " \n", " 13390\n", " gas pump\n", " valid\n", " val/n03425413/n03425413_20711.JPEG\n", " \n", " \n", " \n", " 13391\n", " gas pump\n", " valid\n", " val/n03425413/n03425413_19050.JPEG\n", " \n", " \n", " \n", " 13392\n", " gas pump\n", " valid\n", " val/n03425413/n03425413_13831.JPEG\n", " \n", " \n", " \n", " 13393\n", " gas pump\n", " valid\n", " val/n03425413/n03425413_1242.JPEG\n", " \n", " \n", " \n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a `DataFrame`\n", "df = mk.from_csv(\"./downloads/imagenette2-160/imagenette.csv\")\n", "\n", "# Create an `ImageColumn`` and add it to the `DataFrame`\n", "df[\"img\"] = mk.image(\n", " df[\"img_path\"], \n", " base_dir=os.path.join(dataset_dir, \"imagenette2-160\")\n", ")\n", "df" ] }, { "cell_type": "markdown", "id": "75e36222", "metadata": {}, "source": [ "The call to `head` shows the first few rows in the `DataFrame`. You can see that there are a few metadata columns, as well as the \"img\" column we added in.\n", "\n", "## 🗂 Selecting data\n", "*For more information see the user guide section on {ref}`guide/dataframe/selection`.*\n", "\n", "When we create an `ImageColumn` we don't load the images into memory. Instead, `ImageColumn` keeps track of all those filepaths we passed in and only loads the images when they are needed. \n", "\n", "When we select a row of the `ImageColumn`, we get an instance `FileCell` back. A `FileCell` is an object that holds everything we need to materialize the cell (e.g. the filepath to the image and the loading function), but stops just short of doing so." ] }, { "cell_type": "code", "execution_count": 5, "id": "49427d88", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Indexing the `ImageColumn` returns an object of type: .\n" ] } ], "source": [ "img_cell = df[\"img\"][0]\n", "print(f\"Indexing the `ImageColumn` returns an object of type: {type(img_cell)}.\")" ] }, { "cell_type": "markdown", "id": "fa5c4e99", "metadata": {}, "source": [ "To actually materialize the image, we simply call the cell." ] }, { "cell_type": "code", "execution_count": 6, "id": "27cef896", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "img = img_cell()\n", "img" ] }, { "cell_type": "markdown", "id": "f83e9ef4", "metadata": {}, "source": [ "We can subselect a **batch** of images by indexing with a slice. Notice that this returns a smaller {class}`~meerkat.DataFrame`." ] }, { "cell_type": "code", "execution_count": 7, "id": "e9c2479e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Indexing a slice of the `ImageColumn` returns a: .\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(FileColumn)
0
1
2
" ], "text/plain": [ "column([FileCell(fn=<...7f9ca8184b50>), FileCell(fn=<...7f9ca8184b50>), FileCell(fn=<...7f9ca8184b50>)], backend=FileColumn" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imgs = df[\"img\"][1:4]\n", "print(f\"Indexing a slice of the `ImageColumn` returns a: {type(imgs)}.\")\n", "imgs" ] }, { "cell_type": "markdown", "id": "c31224e4", "metadata": {}, "source": [ "The whole batch of images can be loaded together by calling the column. \n", "```\n", "imgs();\n", "```\n", "\n", "One can load multiple rows using any one of following indexing schemes:\n", "- **Slice indexing**: _e.g._ `column[4:10]`\n", "- **Integer array indexing**: _e.g._ `column[[0, 4, 6, 11]]`\n", "- **Boolean array indexing**: _e.g._ `column[np.array([True, False, False ..., True, False])]`\n", "\n", "### 📎 _Aside_: `ImageColumn` under the hood, `DeferredColumn`.\n", "\n", "If you check out the implementation of `ImageColumn` (at [meerkat/column/image_column.py](https://github.com/HazyResearch/meerkat/blob/main/meerkat/column/image_column.py)), you'll notice that it's a super simple subclass of `DeferredColumn`. \n", "\n", "_What's a `DeferredColumn`?_\n", "In `meerkat`, high-dimensional data types like images and videos are typically stored in a `DeferredColumn`. A `DeferredColumn` wraps around another column and lazily applies a function to it's content as it is indexed. Consider the following example, where we create a simple `meerkat` column..." ] }, { "cell_type": "code", "execution_count": 8, "id": "fbcdc6f2", "metadata": {}, "outputs": [], "source": [ " col = mk.column([0,1,2])" ] }, { "cell_type": "markdown", "id": "bfe0f28a", "metadata": {}, "source": [ "...and wrap it in a deferred column.\n", "```\n", " dcol = col.defer(fn=lambda x: x + 10)\n", " dcol[1]() # the function is only called at this point!\n", "```\n", "Critically, the function inside a lambda column is only called at the time the column is called! This is very useful for columns with large data types that we don't want to load all into memory at once. For example, we could create a `DeferredColumn` that lazily loads images...\n", "```\n", " >>> filepath_col = mk.PandasSeriesColumn([\"path/to/image0.jpg\", ...])\n", " >>> img_col = filepath.defer(lambda x: load_image(x))\n", "```\n", "An `ImageColumn` is a just a `DeferredColumn` like this one, with a few more bells and whistles!\n", "\n", "## 🛠 Applying operations over the DataFrame.\n", "\n", "When analyzing data, we often perform operations on each example in our dataset (e.g. compute a model's prediction on each example, tokenize each sentence, compute a model's embedding for each example) and store them. The `DataFrame` makes it easy to perform these operations: \n", "- Produce new columns (via `DataFrame.map`)\n", "- Produce new columns and store the columns alongside the original data (via `DataFrame.update`)\n", "- Extract an important subset of the datset (via `DataFrame.filter`). \n", "\n", "Under the hood, dataloading is multiprocessed so that costly I/O doesn't bottleneck our computation.\n", "\n", "Let's start by filtering the `DataFrame` down to the examples in the validation set." ] }, { "cell_type": "code", "execution_count": 9, "id": "55aef16b", "metadata": {}, "outputs": [], "source": [ "valid_df = df[df[\"split\"] == \"valid\"]" ] }, { "cell_type": "markdown", "id": "15bd8702", "metadata": {}, "source": [ "### 🫐 Using `DataFrame.map` to compute average intensity of the blue color channel in the images.\n", "\n", "To demonstrate the utility of the `map` operation, we'll explore the relationship between the \"blueness\" of an image and the class of the image. \n", "\n", "We'll quantify the \"blueness\" of each image by simply computing the mean intensity of the blue color channel. This can be accomplished with a simple `map` operation over the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 10, "id": "32cd4aa5", "metadata": {}, "outputs": [], "source": [ "blue_col = valid_df.map(\n", " lambda img: np.array(img)[:, :, 2].mean(), \n", " num_workers=2\n", ")\n", "\n", "# Add the intensities as a new column in the `DataFrame` \n", "valid_df[\"avg_blue\"] = blue_col" ] }, { "cell_type": "markdown", "id": "7ef51694", "metadata": {}, "source": [ "### 🪂 vs. ⛳️\n", "Next, we'll explore the relationship between blueness and the class label of the image. To do so, we'll compare the blue intensity distribution of images labeled \"parachute\" to the distribution of of images labeled \"golf ball\".\n", "Using the [`seaborn`](https://seaborn.pydata.org/installing.html) plotting package and our `DataFrame`, this can be accomplished in one line:" ] }, { "cell_type": "code", "execution_count": 11, "id": "a14aea38", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/meerkat/meerkat/meerkat/dataframe.py:901: UserWarning: Could not convert column img of type , it will be dropped from the output.\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## OPTIONAL: this cell requires the seaborn dependency: https://seaborn.pydata.org/installing.html \n", "import seaborn as sns\n", "\n", "plot_df = valid_df[np.isin(valid_df[\"label\"], [\"golf ball\", \"parachute\"])]\n", "sns.displot(\n", " data=plot_df.to_pandas(), \n", " x=\"avg_blue\", \n", " hue=\"label\", \n", " height=3, \n", " aspect=2\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "id": "6f3dc1f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FileCell(fn=)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_df[\"img\"][int(np.argmax(valid_df[\"avg_blue\"]))]" ] }, { "cell_type": "markdown", "id": "ef61a0d6", "metadata": {}, "source": [ "## 💾 Writing a `DataFrame` to disk. \n", "Finally, we can write the updated `DataFrame` to disk for later use." ] }, { "cell_type": "code", "execution_count": 13, "id": "61d8705c", "metadata": {}, "outputs": [], "source": [ "valid_df.write(os.path.join(dataset_dir, \"valid_df\"))" ] }, { "cell_type": "code", "execution_count": 14, "id": "97f6a240", "metadata": {}, "outputs": [], "source": [ "valid_df = mk.read(os.path.join(dataset_dir, \"valid_df\"))" ] } ], "metadata": { "file_format": "mystnb", "kernelspec": { "display_name": "python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" }, "source_map": [ 5, 13, 17, 26, 32, 36, 40, 50, 62, 71, 74, 77, 80, 83, 87, 105, 107, 130, 132, 140, 148, 154, 168, 170, 176, 180 ], "widgets": { "application/vnd.jupyter.widget-state+json": { "state": { "1566be8159bb4cbdac03174484956dd0": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_5f647a36bbec417d8f0483ef0eb6610e", "max": 99003388.0, "min": 0.0, "orientation": "horizontal", "style": "IPY_MODEL_31f38e8e764947cab875f55dd82f205e", "tabbable": null, "tooltip": null, "value": 99003388.0 } }, "31f38e8e764947cab875f55dd82f205e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "5769b06e9bed489388896fc01dd0b133": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "594d9864be874976b3f7880b930e4c6c": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "5f647a36bbec417d8f0483ef0eb6610e": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "7660c2f60e0c4341b020fd908a082309": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "7e8b2044cad24e29bba79ae8544bbf84": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_ae22ebc288b0482687930df91bf44385", "IPY_MODEL_1566be8159bb4cbdac03174484956dd0", "IPY_MODEL_e188978339384512b0846af38f1ac5ae" ], "layout": "IPY_MODEL_b352d6a448514850ad67fa5e42410ec2", "tabbable": null, "tooltip": null } }, "a5fd1ffff5b243aeafaf40725bbd28b2": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "background": null, "description_width": "", "font_size": null, "text_color": null } }, "ae22ebc288b0482687930df91bf44385": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_594d9864be874976b3f7880b930e4c6c", "placeholder": "​", "style": "IPY_MODEL_a5fd1ffff5b243aeafaf40725bbd28b2", "tabbable": null, "tooltip": null, "value": "Downloading: 100%" } }, "b352d6a448514850ad67fa5e42410ec2": { "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e188978339384512b0846af38f1ac5ae": { "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "HTMLView", "description": "", "description_allow_html": false, "layout": "IPY_MODEL_7660c2f60e0c4341b020fd908a082309", "placeholder": "​", "style": "IPY_MODEL_5769b06e9bed489388896fc01dd0b133", "tabbable": null, "tooltip": null, "value": " 99.0M/99.0M [00:03<00:00, 36.8MB/s]" } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }