Datasets
Datasets¶
Meerkat provides a dataset registry that makes it easy to download datasets and load them into Meerkat data structures.
For example, using get()
we can download and prepare the Imagenette dataset:
In [1]: import meerkat as mk
In [2]: df = mk.datasets.get("imagenette")
Some datasets have multiple versions, for example Imagenette provides a full-size version as well as 320 pixel and 160 pixel versions. You can list a dataset’s available versions with versions()
:
In [3]: mk.datasets.versions("imagenette")
Out[3]: ['full', '320px', '160px']
In [4]: mk.datasets.get("imagenette", version="160px")
By default datasets are downloaded to ~/.meerkat/datasets/{name}/{version}
. However, if you already have the dataset downloaded elsewhere or you want to download to a different location, you can specify the dataset_dir
argument.
df = mk.datasets.get("imagenette", dataset_dir="/local/download/of/imagenette/full")
You can also configure Meerkat to use a different default root directory. By setting the mk.config.datasets.root_dir = "/local/download/of"
, the default location for datasets will be /local/download/of/datasets/{name}/{version}
.
How does Meerkat’s dataset registry fit in with other dataset hubs? The purpose of the Meerkat dataset registry is to provide code for downloading datasets and loading them into DataFrame
objects. The Meerkat registry, like Torchvision Datasets, doesn’t actually host any data.
In contrast, dataset hubs like HuggingFace Datasets and Activeloop Hub are great community efforts that do host data. So, the Meerkat registry is complementary to these hubs: in fact, we can currently load any dataset in the HuggingFace hubs directly through our registry. For example, we can load the IMBD dataset hosted on HuggingFace with mk.datasets.get("imdb")
.
Contributing Datasets
We encourage users to contribute datasets to the Meerkat registry. If you’re already using Meerkat with your dataset, contributing it to the registry is straightforward: you just share the code that you’re already using to load the dataset into Meerkat. Please follow the instructions in Contributing Datasets.
The table below lists all of the datasets currently in the meerkat registry.
You can also list these datasets programmatically with mk.datasets.catalog
.
tags | versions | homepage | |
---|---|---|---|
celeba |
main
|
link | |
coco |
2014
|
link | |
expw |
main
|
link | |
fer |
plus
|
link | |
imagenet |
ilsvrc2012
|
link | |
imagenette |
full 320px 160px
|
link | |
lvis |
v1
|
link | |
mirflickr |
25k
|
link | |
ngoa |
main
|
link | |
pascal |
2012
|
link | |
fer |
plus
|
link |