I/O
Contents
I/O¶
In this guide, we will discuss how to bring data into Meerkat from various file formats and external libraries. We will also discuss how to export data in Meerkat DataFrames back into these formats. Finally, we’ll also discuss how to persist Meerkat DataFrames using write()
and read()
.
Should I export or persist?
This guide discusses two different ways of saving data in Meerkat: exporting and persisting. They serve different purposes:
Export your DataFrame to another format using one of the
to_*
methods if you need to use the data with other libraries or tools. (See the section on Exporting DataFrames.)Persist your DataFrame using
write()
if you simply want to save it to disk for later use. (See the section on Persisting DataFrames.)
Importing DataFrames¶
Meerkat has a number of built-in functions for reading in data from various file formats and Python libraries. We’ll provide one in depth example for reading in data from a CSV file, and then provide a list of the other supported file formats and libraries.
Example: Importing a dataset from CSV¶
Let’s load a CSV file from disk and read it into a Meerkat DataFrame.
We will be using a small sample of data from the National Gallery of Art Open Data Program. We’ve included this data at _data/art_ngoa.csv
in the Meerkat repository.
We will use the from_csv()
function to read in the data.
import meerkat as mk
df = mk.from_csv("_data/art_ngoa.csv")
df.head()
title | attribution | medium | objectid | iiifthumburl | |
---|---|---|---|---|---|
0 | Boy with Cat | Ernst Ludwig Kirchner | lithograph on yellow wove paper | 103631 | https://api.nga.gov/iiif/b309afc0-b270-4c29-9ce7-e8f39125f40d/full/!200,200/0/default.jpg |
1 | A Hound Chasing a Hare | Benozzo Gozzoli | pen and brown ink with traces of red chalk, heightened with white on pink prepared paper | 75840 | https://api.nga.gov/iiif/6e53c5bd-799d-4ac8-a57d-2c831a57da14/full/!200,200/0/default.jpg |
2 | Lamp, Lower Portion | Anonymous Artist after Andrea Briosco, called Riccio | gilded bronze | 130749 | https://api.nga.gov/iiif/8e5477ed-040f-4240-89ca-08d95b1c955b/full/!200,200/0/default.jpg |
3 | Andiron with Putto Finial | Nicolò Roccatagliata | bronze | 73441 | https://api.nga.gov/iiif/a6126402-696e-4333-96af-43362bef1afa/full/!200,200/0/default.jpg |
4 | Dispatch No. 4: THREE VALLEYS. The Valleys of Silicon, San Joaquin, and Death | Alec Soth | inkjet print and 48 page newsprint booklet in portfolio case | 221224 | https://api.nga.gov/iiif/29e05dbf-f0e3-4088-9ffe-83d4ca8c1968/full/!200,200/0/default.jpg |
Notice that each row corresponds to a single work of art, and each column corresponds to a different attribute of the work of art.
Representing images. The last column, iiifthumburl
, contains a URL to a thumbnail image of the work.
Using image()
, we can download the thumbnail image and display it in the DataFrame.
df["image"] = mk.image(df["iiifthumburl"], downloader="url")
df[["image", "title", "attribution"]].head()
image | title | attribution | |
---|---|---|---|
0 | Boy with Cat | Ernst Ludwig Kirchner | |
1 | A Hound Chasing a Hare | Benozzo Gozzoli | |
2 | Lamp, Lower Portion | Anonymous Artist after Andrea Briosco, called Riccio | |
3 | Andiron with Putto Finial | Nicolò Roccatagliata | |
4 | Dispatch No. 4: THREE VALLEYS. The Valleys of Silicon, San Joaquin, and Death | Alec Soth |
The function mk.image
creates a ImageColumn
which defers the downloading of images from the URLs until the data is needed.
Deferred Columns
If you’re wondering how ImageColumn
works, check out the guide on Deferred Columns.
Adding a primary key. The objectid
column contains a unique identifier for each work of art. We can use set_primary_key()
to set this column as the primary key for the DataFrame, which allows us to perform key-based indexing on the DataFrame.
df = df.set_primary_key("objectid")
df.loc[221224]
{'title': 'Dispatch No. 4: THREE VALLEYS. The Valleys of Silicon, San Joaquin, and Death',
'attribution': 'Alec Soth',
'medium': 'inkjet print and 48 page newsprint booklet in portfolio case',
'objectid': 221224,
'iiifthumburl': 'https://api.nga.gov/iiif/29e05dbf-f0e3-4088-9ffe-83d4ca8c1968/full/!200,200/0/default.jpg',
'image': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2a8439e310>)}
The from_csv()
function has a utility parameter primary_key
which can be used to set the primary key when the DataFrame is created.
mk.from_csv("_data/art_ngoa.csv", primary_key="objectid")
title | attribution | medium | objectid | iiifthumburl | |
---|---|---|---|---|---|
0 | Boy with Cat | Ernst Ludwig Kirchner | lithograph on yellow wove paper | 103631 | https://api.nga.gov/iiif/b309afc0-b270-4c29-9ce7-e8f39125f40d/full/!200,200/0/default.jpg |
1 | A Hound Chasing a Hare | Benozzo Gozzoli | pen and brown ink with traces of red chalk, heightened with white on pink prepared paper | 75840 | https://api.nga.gov/iiif/6e53c5bd-799d-4ac8-a57d-2c831a57da14/full/!200,200/0/default.jpg |
2 | Lamp, Lower Portion | Anonymous Artist after Andrea Briosco, called Riccio | gilded bronze | 130749 | https://api.nga.gov/iiif/8e5477ed-040f-4240-89ca-08d95b1c955b/full/!200,200/0/default.jpg |
3 | Andiron with Putto Finial | Nicolò Roccatagliata | bronze | 73441 | https://api.nga.gov/iiif/a6126402-696e-4333-96af-43362bef1afa/full/!200,200/0/default.jpg |
4 | Dispatch No. 4: THREE VALLEYS. The Valleys of Silicon, San Joaquin, and Death | Alec Soth | inkjet print and 48 page newsprint booklet in portfolio case | 221224 | https://api.nga.gov/iiif/29e05dbf-f0e3-4088-9ffe-83d4ca8c1968/full/!200,200/0/default.jpg |
... | ... | ... | ... | ... | ... |
94 | Mountain Landscape with a Hollow | Alexander Cozens | brush drawing in brown wash on laid paper | 65569 | https://api.nga.gov/iiif/d26d1ba9-7814-4a7e-9d8b-1c0d24f76d7b/full/!200,200/0/default.jpg |
95 | On the Vimy to Lens Road | David Young Cameron | watercolor and graphite | 5708 | https://api.nga.gov/iiif/5e06918b-e6cb-4639-8452-2e050db3d2e9/full/!200,200/0/default.jpg |
96 | Girl Writer | Ralph Austin | lithograph | 148623 | https://api.nga.gov/iiif/57e15461-94b3-4020-8a8a-46884c2cc50d/full/!200,200/0/default.jpg |
97 | Jean de Saulx, 1555-1629, Viscount of Tavanes and Lugny, and Marquess of Mirabet [obverse] | French 17th Century | bronze | 45344 | https://api.nga.gov/iiif/ccd2d7b2-6119-4326-aa04-cb61e28f11a3/full/!200,200/0/default.jpg |
98 | From My Window at the Shelton, North | Alfred Stieglitz | gelatin silver print | 36364 | https://api.nga.gov/iiif/cc3afc42-5e1a-4228-8c45-7f17f27c224c/full/!200,200/0/default.jpg |
Primary Keys
To learn more about primary keys and key-based indexing, check out the section Selecting Rows by Key.
Importing from storage formats¶
Meerkat supports importing data from a number of other file formats. As in the example above, you may need to set the primary key and/or add additional columns for complex data (e.g. images, audio).
from_csv()
: Reads in data from a CSV (comma-separated values) file. CSV files are a common format for storing tabular data.from_feather()
: Reads in data from a Feather file. Feather is a language-agnostic file format for storing DataFrames. It can provide significantly faster I/O than CSV.from_parquet()
: Reads in data from a Parquet file. Parquet is a columnar storage format that is designed for efficiency.from_json()
: Reads in data from a JSON file.
If your data is in a format not listed here, load it into a Pandas DataFrame and use from_pandas()
to convert it to a Meerkat DataFrame.
Importing from other libraries¶
It’s also posible to import data from third-party Python libraries like Pandas and HuggingFace Datasets.
from_pandas()
: Converts a Pandas DataFrame to a Meerkat DataFrame. By default, the index of the Pandas DataFrame will be used as the primary key for the Meerkat DataFrame.from_arrow()
: Converts an Arrow Table to a Meerkat DataFrame.from_huggingface()
: Converts a HuggingFace Dataset to a Meerkat DataFrame. By default, the index of the HuggingFace Dataset will be used as the primary key for the Meerkat DataFrame.
Exporting DataFrames¶
Meerkat supports exporting DataFrames from Meerkat to other file formats and libraries. These methods are useful for converting data into formats that can be used by other libraries or software.
Warning
Most file formats designed for tabular data do not offer the same flexibility as Meerkat DataFrames, especially when it comes to storing complex data types and multi-dimensional tensors. As a result, exporting a Meerkat DataFrame to a file format may result in data loss.
Specifically, any DeferredColumn
(or its subclasses) will not be exported. If you want to export a DeferredColumn
, you should first materialize the column(s) by calling the DataFrame. Also, depending on the export destination, any TensorColumn
and/or ObjectColumn
in the DataFrame may not be exported.
If you simply want to save a Meerkat DataFrame to disk, you should use write()
instead (see Writing DataFrames). This will persist the DataFrame in a format that can be read back into Meerkat without any data loss.
Continuing with the example above, let’s export the DataFrame to a CSV file.
df.to_csv("_data/art_ngoa_export.csv")
/home/runner/work/meerkat/meerkat/meerkat/dataframe.py:901: UserWarning: Could not convert column image of type <class 'meerkat.columns.deferred.file.FileColumn'>, it will be dropped from the output.
warnings.warn(
If we inspect the first 5 lines of the CSV file from the command line, we can see that the image
column is missing. This is because the image
column is a DeferredColumn
and was not exported.
!head -n 5 _data/art_ngoa_export.csv
title,attribution,medium,objectid,iiifthumburl
Boy with Cat,Ernst Ludwig Kirchner,lithograph on yellow wove paper,103631,"https://api.nga.gov/iiif/b309afc0-b270-4c29-9ce7-e8f39125f40d/full/!200,200/0/default.jpg"
A Hound Chasing a Hare,Benozzo Gozzoli,"pen and brown ink with traces of red chalk, heightened with white on pink prepared paper",75840,"https://api.nga.gov/iiif/6e53c5bd-799d-4ac8-a57d-2c831a57da14/full/!200,200/0/default.jpg"
"Lamp, Lower Portion","Anonymous Artist after Andrea Briosco, called Riccio",gilded bronze,130749,"https://api.nga.gov/iiif/8e5477ed-040f-4240-89ca-08d95b1c955b/full/!200,200/0/default.jpg"
Andiron with Putto Finial,Nicolò Roccatagliata,bronze,73441,"https://api.nga.gov/iiif/a6126402-696e-4333-96af-43362bef1afa/full/!200,200/0/default.jpg"
When columns are dropped during export, a warning is raised.
Exporting to storage formats¶
Meerkat supports exporting DataFrames to a number of file formats, with the DataFrame
class providing the methods listed below.
to_csv()
: Writes the DataFrame to a CSV file. CSV files are a common format for storing tabular data. Unlike some alternatives, CSV files are human-readable in a text-editor and can be easily imported into spreadsheet software.to_feather()
: Writes the DataFrame to a Feather file. Feather is a language-agnostic file format for storing DataFrames. It can provide significantly faster I/O than CSV.to_parquet()
: Writes the DataFrame to a Parquet file. Parquet is a columnar storage format that is designed for efficiency.to_json()
: Writes the DataFrame to a JSON file.
Note that several of the methods take an optional engine
parameter. This parameter allows you to control the underlying library that is used to write the DataFrame to disk. Options include: pandas
and arrow
. If no engine
is specified, one is automatically chosen based on the columns in the DataFrame. For example, we can write the DataFrame to a CSV file using the Arrow library instead of Pandas.
df.to_csv("_data/art_ngoa_export_arrow.csv", engine="arrow")
Exporting to other libraries¶
It is also possible to export Meerkat DataFrames to other Python DataFrame libraries.
to_pandas()
: Converts the DataFrame to a Pandas DataFrameto_arrow()
: Converts the DataFrame to an Arrow Table.
Persisting DataFrames¶
In this section, we discuss how to persist Meerkat DataFrames to disk using the write()
method. Unlike the export methods discussed above, write()
guarantees that the DataFrame read back in with read()
will contain the exact sam columns as the original DataFrame.
Writing DataFrames¶
Above we saw how some column types in Meerkat DataFrames cannot be exported to a single file format. Specifically, we saw that the column we created to display images was dropped when exporting to CSV.
write()
allows us to persist the DataFrame to disk in a way that will preserve the image column.
df.write("_data/art_ngoa.mk")
How does it work? Under the hood, write()
works by splitting the DataFrame among several different files. For example, the ScalarColumn
s could be stored together in a Feather file, while the TensorColumn
s could be stored in NPY format. The path passed to write()
is the directory where the files will be stored. The directory will also contain a meta.yaml
file that contains information about the DataFrame. This file is used by read()
to reconstruct the DataFrame.
Reading DataFrames¶
To read the DataFrame back in from disk, we can use the read()
function.
mk.read("_data/art_ngoa.mk")
title | attribution | medium | objectid | iiifthumburl | image | |
---|---|---|---|---|---|---|
0 | Boy with Cat | Ernst Ludwig Kirchner | lithograph on yellow wove paper | 103631 | https://api.nga.gov/iiif/b309afc0-b270-4c29-9ce7-e8f39125f40d/full/!200,200/0/default.jpg | |
1 | A Hound Chasing a Hare | Benozzo Gozzoli | pen and brown ink with traces of red chalk, heightened with white on pink prepared paper | 75840 | https://api.nga.gov/iiif/6e53c5bd-799d-4ac8-a57d-2c831a57da14/full/!200,200/0/default.jpg | |
2 | Lamp, Lower Portion | Anonymous Artist after Andrea Briosco, called Riccio | gilded bronze | 130749 | https://api.nga.gov/iiif/8e5477ed-040f-4240-89ca-08d95b1c955b/full/!200,200/0/default.jpg | |
3 | Andiron with Putto Finial | Nicolò Roccatagliata | bronze | 73441 | https://api.nga.gov/iiif/a6126402-696e-4333-96af-43362bef1afa/full/!200,200/0/default.jpg | |
4 | Dispatch No. 4: THREE VALLEYS. The Valleys of Silicon, San Joaquin, and Death | Alec Soth | inkjet print and 48 page newsprint booklet in portfolio case | 221224 | https://api.nga.gov/iiif/29e05dbf-f0e3-4088-9ffe-83d4ca8c1968/full/!200,200/0/default.jpg | |
... | ... | ... | ... | ... | ... | ... |
94 | Mountain Landscape with a Hollow | Alexander Cozens | brush drawing in brown wash on laid paper | 65569 | https://api.nga.gov/iiif/d26d1ba9-7814-4a7e-9d8b-1c0d24f76d7b/full/!200,200/0/default.jpg | |
95 | On the Vimy to Lens Road | David Young Cameron | watercolor and graphite | 5708 | https://api.nga.gov/iiif/5e06918b-e6cb-4639-8452-2e050db3d2e9/full/!200,200/0/default.jpg | |
96 | Girl Writer | Ralph Austin | lithograph | 148623 | https://api.nga.gov/iiif/57e15461-94b3-4020-8a8a-46884c2cc50d/full/!200,200/0/default.jpg | |
97 | Jean de Saulx, 1555-1629, Viscount of Tavanes and Lugny, and Marquess of Mirabet [obverse] | French 17th Century | bronze | 45344 | https://api.nga.gov/iiif/ccd2d7b2-6119-4326-aa04-cb61e28f11a3/full/!200,200/0/default.jpg | |
98 | From My Window at the Shelton, North | Alfred Stieglitz | gelatin silver print | 36364 | https://api.nga.gov/iiif/cc3afc42-5e1a-4228-8c45-7f17f27c224c/full/!200,200/0/default.jpg |
Note that all the columns are present, including the image
column, which had previously been lost with to_csv()
.