DataFrame

Meerkat.

class DataFrame(data: Optional[Union[dict, list]] = None, primary_key: Union[str, bool] = True, *args, **kwargs)[source]

A collection of equal length columns.

Parameters
  • data (Union[dict, list]) – A dictionary of columns or a list of dictionaries.

  • primary_key (str, optional) – The name of the primary key column. Defaults to None.

format(formatters: Dict[str, FormatterGroup]) DataFrame[source]

Create a view of the DataFrame with formatted columns.

Example

Parameters

formatters (Dict[str, FormatterGroup]) – A dictionary mapping column names to FormatterGroups.

Returns

A view of the DataFrame with formatted columns.

Return type

DataFrame

Examples

# assume df is a DataFrame with columns "img", "text", "id"

gallery = mk.Gallery(
    df=df.format(
        img={"thumbnail": ImageFormatter(max_size=(48, 48))},
        text={"icon": TextFormatter()},
    )
)
property data: meerkat.block.manager.BlockManager

Get the underlying data (excluding invisible rows).

To access underlying data with invisible rows, use _data.

property columns

Column names in the DataFrame.

property primary_key: meerkat.columns.scalar.abstract.ScalarColumn

The column acting as the primary key.

property primary_key_name: str

The name of the column acting as the primary key.

set_primary_key(column: str, inplace: bool = False) meerkat.dataframe.DataFrame[source]

Set the DataFrame’s primary key using an existing column. This is an out-of-place operation. For more information on primary keys, see the User Guide.

Parameters

column (str) – The name of an existing column to set as the primary key.

create_primary_key(column: str)[source]

Create a primary key of contiguous integers.

Parameters

column (str) – The name of the column to create.

property nrows

Number of rows in the DataFrame.

property ncols

Number of rows in the DataFrame.

property shape

Shape of the DataFrame (num_rows, num_columns).

size()[source]

Shape of the DataFrame (num_rows, num_columns).

add_column(name: str, data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor], overwrite=False) None[source]

Add a column to the DataFrame.

remove_column(column: str) None[source]

Remove a column from the dataset.

append(df: meerkat.dataframe.DataFrame, axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) meerkat.dataframe.DataFrame[source]

Append a batch of data to the dataset.

example_or_batch must have the same columns as the dataset (regardless of what columns are visible).

head(n: int = 5) meerkat.dataframe.DataFrame[source]

Get the first n examples of the DataFrame.

tail(n: int = 5) meerkat.dataframe.DataFrame[source]

Get the last n examples of the DataFrame.

set(value: meerkat.dataframe.DataFrame)[source]

Set the data of this DataFrame to the data of another DataFrame.

This is used inside endpoints to tell Meerkat when a DataFrame has been modified. Calling this method outside of an endpoint will not have any effect on the graph.

classmethod from_batch(batch: Dict[str, Union[List, meerkat.columns.abstract.Column]]) meerkat.dataframe.DataFrame[source]

Convert a batch to a Dataset.

classmethod from_batches(batches: Sequence[Dict[str, Union[List, meerkat.columns.abstract.Column]]]) meerkat.dataframe.DataFrame[source]

Convert a list of batches to a dataset.

classmethod from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame[source]

Create a Meerkat DataFrame from a Pandas DataFrame.

Warning

In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.

Parameters
  • df – The Pandas DataFrame to convert.

  • index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.

  • primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.

Returns

The Meerkat DataFrame.

Return type

DataFrame

classmethod from_arrow(table: pyarrow.lib.Table)[source]

Create a Dataset from a pandas DataFrame.

classmethod from_huggingface(*args, **kwargs)[source]

Load a Huggingface dataset as a DataFrame.

Use this to replace datasets.load_dataset, so

>>> dict_of_datasets = datasets.load_dataset('boolq')

becomes

>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
classmethod from_csv(filepath: str, primary_key: Optional[str] = None, backend: str = 'pandas', *args, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a csv file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_feather(filepath: str, primary_key: Optional[str] = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a feather file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_parquet(filepath: str, primary_key: Optional[str] = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a parquet file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_json(filepath: str, primary_key: Optional[str] = None, orient: str = 'records', lines: bool = False, backend: str = 'pandas', **kwargs) meerkat.dataframe.DataFrame[source]

Load a DataFrame from a json file.

By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the orient="records" format. If the data is in a different format in the JSON, you can specify the orient parameter. See pandas.read_json() for more details.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_json().

  • orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as pandas.read_json().

  • lines (bool) – Whether the json file is a jsonl file. Same as pandas.read_json().

  • backend (str) – The backend to use for the loading and reuslting columns.

  • **kwargs – Keyword arguments forwarded to pandas.read_json().

Returns

The constructed dataframe.

Return type

DataFrame

to_pandas(index: bool = False, allow_objects: bool = False) pandas.core.frame.DataFrame[source]

Convert a Meerkat DataFrame to a Pandas DataFrame.

Parameters

index (bool) – Use the primary key as the index of the Pandas DataFrame. Defaults to False.

Returns

The constructed dataframe.

Return type

pd.DataFrame

to_arrow() pandas.core.frame.DataFrame[source]

Convert a Meerkat DataFrame to an Arrow Table.

Returns

The constructed table.

Return type

pa.Table

to_csv(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a csv file.

The engine used to write the csv to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the csv. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_feather(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a feather file.

The engine used to write the feather to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the feather. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_parquet(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a parquet file.

The engine used to write the parquet to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the parquet. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_json(filepath: str, lines: bool = False, orient: str = 'records') None[source]

Save a Dataset to a json file.

Parameters
  • filepath (str) – The file path to save to.

  • lines (bool) – Whether to write the json file as a jsonl file.

  • orient (str) – The orientation of the json file. Same as pandas.DataFrame.to_json().

batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]

Batch the dataset. TODO:

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

Returns

batches of data

update(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, remove_columns: Optional[List[str]] = None, num_workers: int = 0, output_type: Optional[Union[type, Dict[str, type]]] = None, mmap: bool = False, mmap_path: Optional[str] = None, materialize: bool = True, pbar: bool = False, **kwargs) meerkat.dataframe.DataFrame[source]

Update the columns of the dataset.

filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.dataframe.DataFrame][source]

Filter operation on the DataFrame.

sort(by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame[source]

Sort the DataFrame by the values in the specified columns. Similar to sort_values in pandas.

Parameters
  • by (Union[str, List[str]]) – The columns to sort by.

  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A sorted view of DataFrame.

Return type

DataFrame

sample(n: int = None, frac: float = None, replace: bool = False, weights: Union[str, numpy.ndarray] = None, random_state: Union[int, numpy.random.mtrand.RandomState] = None) meerkat.dataframe.DataFrame[source]

Select a random sample of rows from DataFrame. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.

Parameters
  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from the DataFrame.

Return type

DataFrame

shuffle(seed: int = None) meerkat.dataframe.DataFrame[source]

Shuffle the rows of the DataFrame out-of-place.

Parameters

seed (int) – Random seed to use for shuffling.

Returns

A shuffled view of the DataFrame.

Return type

DataFrame

rename(mapper: Union[Dict, Callable] = None, errors: Literal['ignore', 'raise'] = 'ignore') meerkat.dataframe.DataFrame[source]

Return a new DataFrame with the specified column labels renamed.

Dictionary values must be unique (1-to-1). Labels not specified will be left unchanged. Extra labels will not throw an error.

Parameters
  • mapper (Union[Dict, Callable], optional) – Dict-like of function transformations to apply to the values of the columns. Defaults to None.

  • errors (Literal['ignore', 'raise'], optional) – If ‘raise’, raise a KeyError when the Dict contains labels that do not exist in the DataFrame. If ‘ignore’, extra keys will be ignored. Defaults to ‘ignore’.

Raises

ValueError – _description_

Returns

A new DataFrame with the specified column labels renamed.

Return type

DataFrame

drop(columns: Union[str, Collection[str]], check_exists=True) meerkat.dataframe.DataFrame[source]

Return a new DataFrame with the specified columns dropped.

Parameters

columns (Union[str, Collection[str]]) – The columns to drop.

Returns

A new DataFrame with the specified columns dropped.

Return type

DataFrame

classmethod read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame[source]

Load a DataFrame stored on disk.

write(path: str) None[source]

Save a DataFrame to disk.

to_huggingface(repository, commit_message: Optional[str] = None)[source]

Upload a DataFrame to a HuggingFace repository.

This method will dump the dataframe into the repository.local_dir. If commit_message is specified, the repository will be pushed to the hub.

The dataframe can then be accessed with:
>>> repo = huggingface_hub.snapshot_download(repository)
>>> # or repo = huggingface_hub.Repository(clone_from=repository)
>>> df = mk.read(repo)
Parameters
  • repository – The huggingface_hub.Repository object to upload to.

  • commit_message – The commit message to use when pushing to the huggingface.

Note

This will overwrite the existing DataFrame in the repository.

Example

>>> repo = huggingface_hub.Repository(
...     local_dir="my-dataset",
...     clone_from="user/my-dataset",
...     repo_type="dataset")
>>> df.to_huggingface(repo, commit_message="uploading dataframe")
mark()

Converts the object to a reactive object in-place.

unmark()

Converts the object to a non-reactive object in-place.

class Row[source]
reactive(fn: Optional[Callable] = None, nested_return: bool = False, skip_fn: Optional[Callable[[...], bool]] = None, backend_only: bool = False) Callable[source]

Internal decorator that is used to mark a function as reactive. This is only meant for internal use, and users should use the react() decorator instead.

Functions decorated with this will create nodes in the operation graph, which are executed whenever their inputs are modified.

A basic example that adds two numbers:

@reactive
def add(a: int, b: int) -> int:
    return a + b

a = Store(1)
b = Store(2)
c = add(a, b)

When either a or b is modified, the add function will be called again with the new values of a and b.

A more complex example that concatenates two mk.DataFrame objects:

@reactive
def concat(df1: mk.DataFrame, df2: mk.DataFrame) -> mk.DataFrame:
    return mk.concat([df1, df2])

df1 = mk.DataFrame(...)
df2 = mk.DataFrame(...)
df3 = concat(df1, df2)
Parameters
  • fn – See react().

  • nested_return – See react().

  • skip_fn – See react().

Returns

See react().

class unmarked[source]

A context manager and decorator that forces all objects within it to behave as if they are not marked. This means that any functions (reactive or not) called with those objects will never be rerun.

Effectively, functions (by decoration) or blocks of code (with the context manager) behave as if they are not reactive.

Examples:

Consider the following function:

>>> @reactive
... def f(x):
...     return x + 1

If we call f with a marked object, then it will be rerun if the object changes:

>>> x = mark(1)
>>> f(x) # f is rerun when x changes

Now, suppose we call f inside another function g that is not reactive:

>>> def g(x):
...     out = f(x)
...     return out

If we call g with a marked object, then the out variable will be recomputed if the object changes. Even though g is not reactive, f is, and f is called within g with a marked object.

Sometimes, this might be what we want. However, sometimes we want to ensure that a function or block of code behaves as if it is not reactive.

For this behavior, we can use the unmarked context manager:

>>> with unmarked():
...     g(x) # g and nothing in g is rerun when x changes

Or, we can use the unmarked decorator:

>>> @unmarked
... def g(x):
...     out = f(x)
...     return out

In both cases, the out variable will not be recomputed if the object changes, even though f is reactive.

class Store(wrapped: meerkat.interactive.graph.store.T, backend_only: bool = False)[source]
to_json() Any[source]

Converts the wrapped object into a jsonifiable object.

property frontend

Returns a Pydantic model that can be should be sent to the frontend.

These models are typically named <something>Frontend (e.g. ComponentFrontend, StoreFrontend).

set(new_value: meerkat.interactive.graph.store.T) None[source]

Set the value of the store.

This will trigger any reactive functions that depend on this store.

Parameters

new_value (T) – The new value of the store.

Returns

None

Note

Even if the new_value is the same as the current value, this will still trigger any reactive functions that depend on this store. To avoid this, check for equality before calling this method.

mark(input: meerkat.interactive.graph.marking.T) meerkat.interactive.graph.marking.T[source]

Mark an object.

If the input is an object, then the object will become reactive: all of its methods and properties will become reactive. It will be returned as a Store object.

Parameters

input – Any object to mark.

Returns

A reactive function or object.

Examples:

Use mark on primitive types:

>>> x = mark(1)
>>> # x is now a `Store` object

Use mark on complex types:

>>> x = mark([1, 2, 3])

Use mark on instances of classes:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> x: Store = mark(df)
>>> y = x.head()
>>> class Foo:
...     def __init__(self, x):
...         self.x = x
...     def __call__(self):
...         return self.x + 1
>>> f = Foo(1)
>>> x = mark(f)

Use mark on functions:

>>> aggregation = mark(mean)
endpoint(fn: Optional[Callable] = None, prefix: Optional[Union[str, fastapi.routing.APIRouter]] = None, route: Optional[str] = None, method: str = 'POST') meerkat.interactive.endpoint.Endpoint[source]

Decorator to mark a function as an endpoint.

An endpoint is a function that can be called to
  • update the value of a Store (e.g. incrementing a counter)

  • update a DataFrame (e.g. adding a new row)

  • run a computation and return its result to the frontend

  • run a function in response to a frontend event (e.g. button

    click)

Endpoints differ from reactive functions in that they are not automatically triggered by changes in their inputs. Instead, they are triggered by explicit calls to the endpoint function.

The Store and DataFrame objects that are modified inside the endpoint function will automatically trigger reactive functions that depend on them.

@endpoint
def increment(count: Store, step: int = 1):
    count.set(count + step)
    # ^ update the count Store, which will trigger operations
    #   that depend on it

# Create a button that calls the increment endpoint
counter = Store(0)
button = Button(on_click=increment(counter))
# ^ read this as: call the increment endpoint with the `counter`
# Store when the button is clicked
Parameters
  • fn – The function to decorate.

  • prefix – The prefix to add to the route. If a string, it will be prepended to the route. If an APIRouter, the route will be added to the router.

  • route – The route to add to the endpoint. If not specified, the route will be the name of the function.

  • method – The HTTP method to use for the endpoint. Defaults to “POST”.

Returns

The decorated function, as an Endpoint object.

class magic(magic: bool = True)[source]

A context manager and decorator that changes the behavior of Store objects inside it. All methods, properties and public attributes of Store objects will be wrapped in @reactive decorators.

Examples:

column(data: Sequence, scalar_backend: Optional[str] = None) meerkat.columns.abstract.Column[source]

Create a Meerkat column from data.

The Meerkat column type is inferred from the type and structure of the data passed in.

class Column(data: Sequence = None, collate_fn: Callable = None, formatters: FormatterGroup = None, *args, **kwargs)[source]

An abstract class for Meerkat columns.

property data

Get the underlying data.

filter(function: Callable, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, **kwargs) Optional[meerkat.columns.abstract.Column][source]

Filter the elements of the column using a function.

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.Column[source]

Select a random sample of rows from Column. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.

Parameters
  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from the DataFrame.

Return type

Column

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

classmethod from_data(data: Union[Columnable, Column])[source]

Convert data to a meerkat column using the appropriate Column type.

head(n: int = 5) meerkat.columns.abstract.Column[source]

Get the first n examples of the column.

tail(n: int = 5) meerkat.columns.abstract.Column[source]

Get the last n examples of the column.

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

to_torch() torch.Tensor[source]

Convert the column to a PyTorch Tensor.

If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a PyTorch Tensor.

Return type

torch.Tensor

to_numpy() numpy.ndarray[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_json() dict[source]

Convert the column to a JSON object.

If the column cannot be converted to a JSON object, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a JSON object.

Return type

dict

class ObjectColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_numpy()[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

class ScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
class PandasScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
dt

alias of meerkat.columns.scalar.pandas._MeerkatCombinedDatetimelikeProperties

cat

alias of meerkat.columns.scalar.pandas._MeerkatCategoricalAccessor

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

PandasSeriesColumn

For now! Raises error when shape of input array is more than one error.

to_tensor() torch.Tensor[source]

Use column.to_tensor() instead of torch.tensor(column), which is very slow.

to_numpy() torch.Tensor[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

to_json() List[Any][source]

Convert the column to a JSON object.

If the column cannot be converted to a JSON object, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a JSON object.

Return type

dict

class ArrowScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

to_numpy()[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_pandas(allow_objects: bool = False)[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

class TensorColumn(data: TensorColumnTypes = None, backend: str = None)[source]
class NumPyTensorColumn(data: TensorColumnTypes = None, backend: str = None)[source]
is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[str, List[str]]] = None) meerkat.columns.tensor.numpy.NumPyTensorColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.tensor.numpy.NumPyTensorColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (bool) – Whether to sort in ascending or descending order.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

NumpySeriesColumn

For now! Raises error when shape of input array is more than one error.

to_torch() torch.Tensor[source]

Convert the column to a PyTorch Tensor.

If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a PyTorch Tensor.

Return type

torch.Tensor

to_pandas(allow_objects: bool = True) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

to_numpy() numpy.ndarray[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_json() List[Any][source]

Convert the column to a JSON object.

If the column cannot be converted to a JSON object, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a JSON object.

Return type

dict

class TorchTensorColumn(data: TensorColumnTypes = None, backend: str = None)[source]
classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.Column])[source]

Convert data to an EmbeddingColumn.

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

TensorColumn

For now! Raises error when shape of input array is more than one error.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

to_pandas(allow_objects: bool = True) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_numpy() pandas.core.series.Series[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

class DeferredColumn(data: Union[meerkat.block.deferred_block.DeferredOp, meerkat.block.abstract.BlockView], output_type: Optional[Type[meerkat.columns.abstract.Column]] = None, *args, **kwargs)[source]
property fn: Callable

Subclasses like ImageColumn should be able to implement their own version.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

class FileColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]

A column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the lz indexer, the files are not materialized and a FileCell or a FileColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths.

  • loader (Union[str, Callable[[Union[str, IO]], Any]]) – a callable that accepts a filepath or an I/O stream and returns data.

  • base_dir (str, optional) –

    an absolute path to a directory containing the files. If provided, the filepath to be loaded will be joined with the base_dir. As such, this argument should only be used if the loader will be applied to relative paths. T

    The base_dir can also include environment variables (e.g. $DATA_DIR/images) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.

  • downloader (Union[str, callable], optional) –

    a callable that accepts at least two positional arguments - a URI and a destination (which could be either a string or file object).

    Meerkat includes a small set of built-in downloaders [“url”, “gcs”] which can be specified via string.

  • cache_dir (str, optional) – the directory on disk where downloaded files are to be cached. Defaults to None, in which case files will be re-downloaded on every access of the data. The cache_dir can also include environment variables (e.g. $DATA_DIR/images) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (Column) – [description]

class ImageColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]

DEPRECATED A column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an ImageCell or an ImageColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

class AudioColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]

A lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an FileCell or an AudioColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

collate(batch)[source]

Collate data.

class AbstractCell(*args, **kwargs)[source]
get(*args, **kwargs) object[source]

Get me the thing that this cell exists for.

property metadata: dict

Get the metadata associated with this cell.

class DeferredCell(data: meerkat.block.deferred_block.DeferredCellOp)[source]
property data: object

Get the data associated with this cell.

get(*args, **kwargs)[source]

Get me the thing that this cell exists for.

class FileCell(data: meerkat.block.deferred_block.DeferredCellOp)[source]
map(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, Type[Column]], Type[Column]] = None, materialize: bool = True, use_ray: bool = False, num_blocks: int = 100, blocks_per_window: int = 10, pbar: bool = False, **kwargs)[source]

Create a new Column or DataFrame by applying a function to each row in data.

This function shares nearly the exact same signature with defer(), the difference is that defer() returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.

Learn more in the user guide: Mapping: map and defer.

{input_description}

What gets returned by defer?

  • If function returns a single value, then defer will return a DeferredColumn object.

  • If function returns a dictionary, then defer will return a DataFrame containing DeferredColumn objects. The keys of the dictionary are used as column names. The outputs argument can be used to override the column names.

  • If function returns a tuple, then defer will return a DataFrame containing DeferredColumn objects. The column names will be integers. The column names can be overriden by passing a tuple to the outputs argument.

  • If function returns a tuple or a dictionary, then passing "single" to the outputs argument will cause defer to return a single DeferredColumn that materializes to a ObjectColumn

Note

This function is also available as a method of DataFrame and Column under the name map.

Parameters
  • data (DataFrame) – The DataFrame or Column to which the function will be applied.

  • function (Callable) – The function that will be applied to the rows of data.

  • is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.

  • batch_size (int, optional) – The size of the batch. Defaults to 1.

  • inputs (Dict[str, str], optional) – Dictionary mapping column names in data to keyword arguments of function. Ignored if data is a column. When calling function values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.

  • outputs (Union[Dict[any, str], Tuple[str]], optional) –

    Controls how the output of function is mapped to the output of defer(). Defaults to None.

    • If None: the output is inferred from the return type of the function. See explanation above.

    • If "single": a single DeferredColumn is returned.

    • If a Dict[any, str]: then a DataFrame containing DeferredColumns is returned. This is useful when the output of function is a Dict. outputs maps the outputs of function to column names in the resulting DataFrame.

    • If a Tuple[str]: then a DataFrame containing output DeferredColumn is returned. This is useful when the of function is a Tuple. outputs maps the outputs of function to column names in the resulting DataFrame.

  • output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.

  • materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.

  • use_ray (bool) – Use Ray to parallelize the computation. Defaults to False.

  • num_blocks (int) – When using Ray, the number of blocks to split the data into. Defaults to 100.

  • blocks_per_window (int) – When using Ray, the number of blocks to process in a single Ray task. Defaults to 10.

  • pbar (bool) – Show a progress bar. Defaults to False.

Returns

A Column or a DataFrame.

Return type

Union[DataFrame, Column]

Examples

We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.

In [1]: import datetime

In [2]: import meerkat as mk

In [3]: df = mk.DataFrame({
   ...:     "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943],
   ...:     "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"]
   ...: })
   ...: 

Single input column. Lazily create a column of birth years to a column of ages.

In [4]: df["age"] = df["birth_year"].map(
   ...:     lambda x: datetime.datetime.now().year - x
   ...: )
   ...: 

In [5]: df["age"]
Out[5]: column([56, 30, 13, 38, 16, 33, ...], backend=PandasScalarColumn

Multiple input columns. Lazily create a column of birth years to a column of ages.

In [6]: df["ma_eligible"] = df.map(
   ...:     lambda age, residence: (residence == "MA") and (age >= 18)
   ...: )
   ...: 

In [7]: df["ma_eligible"]
Out[7]: column([True, False, False, False, False, True, ...], backend=PandasScalarColumn
defer(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, Type[Column]], Type[Column]] = None, materialize: bool = True) Union[DataFrame, DeferredColumn][source]

Create one or more DeferredColumns that lazily applies a function to each row in data.

This function shares nearly the exact same signature with map(), the difference is that defer() returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.

Learn more in the user guide: Deferred map and chaining.

{input_description}

What gets returned by defer?

  • If function returns a single value, then defer will return a DeferredColumn object.

  • If function returns a dictionary, then defer will return a DataFrame containing DeferredColumn objects. The keys of the dictionary are used as column names. The outputs argument can be used to override the column names.

  • If function returns a tuple, then defer will return a DataFrame containing DeferredColumn objects. The column names will be integers. The column names can be overriden by passing a tuple to the outputs argument.

  • If function returns a tuple or a dictionary, then passing "single" to the outputs argument will cause defer to return a single DeferredColumn that materializes to a ObjectColumn.

How do you execute the deferred map?

Depending on function and the outputs argument, returns either a DeferredColumn or a DataFrame. Both are callables. To execute the deferred map, simply call the returned object.

Note

This function is also available as a method of DataFrame and Column under the name defer.

Parameters
  • data (DataFrame) – The DataFrame or Column to which the function will be applied.

  • function (Callable) – The function that will be applied to the rows of data.

  • is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.

  • batch_size (int, optional) – The size of the batch. Defaults to 1.

  • inputs (Dict[str, str], optional) – Dictionary mapping column names in data to keyword arguments of function. Ignored if data is a column. When calling function values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.

  • outputs (Union[Dict[any, str], Tuple[str]], optional) –

    Controls how the output of function is mapped to the output of defer(). Defaults to None.

    • If None: the output is inferred from the return type of the function. See explanation above.

    • If "single": a single DeferredColumn is returned.

    • If a Dict[any, str]: then a DataFrame containing DeferredColumns is returned. This is useful when the output of function is a Dict. outputs maps the outputs of function to column names in the resulting DataFrame.

    • If a Tuple[str]: then a DataFrame containing output DeferredColumn is returned. This is useful when the of function is a Tuple. outputs maps the outputs of function to column names in the resulting DataFrame.

  • output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.

  • materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.

Returns

A DeferredColumn or a

DataFrame containing DeferredColumn representing the deferred map.

Return type

Union[DataFrame, DeferredColumn]

Examples

We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.

In [1]: import datetime

In [2]: import meerkat as mk

In [3]: df = mk.DataFrame({
   ...:     "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943],
   ...:     "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"]
   ...: })
   ...: 

Single input column. Lazily create a column of birth years to a column of ages.

In [4]: df["age"] = df["birth_year"].defer(
   ...:     lambda x: datetime.datetime.now().year - x
   ...: )
   ...: 

In [5]: df["age"]
Out[5]: column([DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), ...], backend=DeferredColumn

We can materialize the deferred map (i.e. run it) by calling the column.

In [6]: df["age"]()
Out[6]: column([56, 30, 13, 38, 16, 33, ...], backend=PandasScalarColumn

Multiple input columns. Lazily create a column of birth years to a column of ages.

In [7]: df["ma_eligible"] = df.defer(
   ...:     lambda age, residence: (residence == "MA") and (age >= 18)
   ...: )
   ...: 

In [8]: df["ma_eligible"]()
Out[8]: column([True, False, False, False, False, True, ...], backend=PandasScalarColumn
concat(objs: Union[Sequence[meerkat.dataframe.DataFrame], Sequence[meerkat.columns.abstract.Column]], axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Concatenate a sequence of columns or a sequence of DataFrame`s. If sequence is empty, returns an empty `DataFrame.

  • If concatenating columns, all columns must be of the same type (e.g. all

ListColumn). - If concatenating `DataFrame`s along axis 0 (rows), all `DataFrame`s must have the same set of columns. - If concatenating `DataFrame`s along axis 1 (columns), all `DataFrame`s must have the same length and cannot have any of the same column names.

Parameters
  • objs (Union[Sequence[DataFrame], Sequence[AbstractColumn]]) – sequence of columns or DataFrames.

  • axis (Union[str, int]) – The axis along which to concatenate. Ignored if concatenating columns.

Returns

concatenated DataFrame or column

Return type

Union[DataFrame, AbstractColumn]

complete(df: meerkat.dataframe.DataFrame, prompt: str, engine: str, batch_size: int = 1, use_ray: bool = False, num_blocks: int = 100, blocks_per_window: int = 10, pbar: bool = False, client_connection: Optional[str] = None, cache_connection: str = '~/.manifest/cache.sqlite') meerkat.columns.scalar.abstract.ScalarColumn[source]

Apply a generative language model to each row in a DataFrame.

Parameters
  • df (DataFrame) – The DataFrame to which the function will be applied.

  • prompt (str) –

  • engine (str) –

  • batch_size (int, optional) – The size of the batch. Defaults to 1.

  • materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.

  • use_ray (bool) – Use Ray to parallelize the computation. Defaults to False.

  • num_blocks (int) – When using Ray, the number of blocks to split the data into. Defaults to 100.

  • blocks_per_window (int) – When using Ray, the number of blocks to process in a single Ray task. Defaults to 10.

  • pbar (bool) – Show a progress bar. Defaults to False.

  • client_connection – The connection string for the client. This is typically the key (e.g. OPENAI). If it is not provided, it will be inferred from the engine.

  • cache_connection – The sqlite connection string for the cache.

Returns

A DeferredColumn or a

DataFrame containing DeferredColumn representing the deferred map.

Return type

Union[Column]

merge(left: meerkat.dataframe.DataFrame, right: meerkat.dataframe.DataFrame, how: str = 'inner', on: Union[str, List[str]] = None, left_on: Union[str, List[str]] = None, right_on: Union[str, List[str]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None) meerkat.dataframe.DataFrame[source]

Perform a database-style join operation between two DataFrames.

Parameters
  • left (DataFrame) – Left DataFrame.

  • right (DataFrame) – Right DataFrame.

  • how (str, optional) – The join type. Defaults to “inner”.

  • on (Union[str, List[str]], optional) – The columns(s) to join on. These columns must be ScalarColumn. Defaults to None, in which case the left_on and right_on parameters must be passed.

  • left_on (Union[str, List[str]], optional) – The column(s) in the left DataFrame to join on. These columns must be ScalarColumn. Defaults to None.

  • right_on (Union[str, List[str]], optional) – The column(s) in the right DataFrame to join on. These columns must be ScalarColumn. Defaults to None.

  • sort (bool, optional) – Whether to sort the result DataFrame by the join key(s). Defaults to False.

  • suffixes (Sequence[str], optional) – Suffixes to use in the case their are conflicting column names in the result DataFrame. Should be a sequence of length two, with suffixes[0] the suffix for the column from the left DataFrame and suffixes[1] the suffix for the right. Defaults to (“_x”, “_y”).

  • validate (_type_, optional) –

    The check to perform on the result DataFrame. Defaults to None, in which case no check is performed. Valid options are:

    • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

    • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.

    • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.

    • “many_to_many” or “m:m”: allowed, but does not result in checks.

Returns

The merged DataFrame.

Return type

DataFrame

embed(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column, str, PIL.Image.Image], input: Optional[str] = None, encoder: Union[str, meerkat.ops.embed.encoder.Encoder] = 'clip', modality: Optional[str] = None, out_col: Optional[str] = None, device: Union[int, str] = 'auto', mmap_dir: Optional[str] = None, num_workers: int = 0, batch_size: int = 128, pbar: bool = True, **kwargs) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Embed a column of data with an encoder from the encoder registry.

Examples

Suppose you have an Image dataset (e.g. Imagenette, CIFAR-10) loaded into a Meerkat DataFrame. You can embed the images in the dataset with CLIP using a code snippet like:

import meerkat as mk

df = mk.datasets.get("imagenette")

df = mk.embed(
    data=df,
    input_col="img",
    encoder="clip"
)
Parameters
  • data (Union[mk.DataFrame, mk.AbstractColumn]) – A dataframe or column containing the data to embed.

  • input_col (str, optional) – If data is a dataframe, the name of the column to embed. If data is a column, then the parameter is ignored. Defaults to None.

  • encoder (Union[str, Encoder], optional) – Name of the encoder to use. List supported encoders with domino.encoders. Defaults to “clip”. Alternatively, pass an Encoder object containing a custom encoder.

  • modality (str, optional) – The modality of the data to be embedded. Defaults to None, in which case the modality is inferred from the type of the input column.

  • out_col (str, optional) – The name of the column where the embeddings are stored. Defaults to None, in which case it is "{encoder}({input_col})".

  • device (Union[int, str], optional) – The device on which. Defaults to “cpu”.

  • mmap_dir (str, optional) – The path to directory where a memory-mapped file containing the embeddings will be written. Defaults to None, in which case the embeddings are not memmapped.

  • num_workers (int, optional) – Number of worker processes used to load the data from disk. Defaults to 4.

  • batch_size (int, optional) – Size of the batches to used . Defaults to 128.

  • **kwargs – Additional keyword arguments are passed to the encoder. To see supported arguments for each encoder, see the encoder documentation (e.g. clip()).

Returns

A view of data with a new column containing the embeddings. This column will be named according to the out_col parameter.

Return type

mk.DataFrame

sort(data: meerkat.dataframe.DataFrame, by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame[source]

Sort a DataFrame or Column. If a DataFrame, sort by the values in the specified columns. Similar to sort_values in pandas.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sort.

  • by (Union[str, List[str]]) – The columns to sort by. Ignored if data is a Column.

  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A sorted view of DataFrame.

Return type

DataFrame

sample(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column], n: int = None, frac: float = None, replace: bool = False, weights: Union[str, numpy.ndarray] = None, random_state: Union[int, numpy.random.mtrand.RandomState] = None) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Select a random sample of rows from DataFrame or Column. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sample from.

  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string and data is a DataFrame, the sampled_df will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from DataFrame or

Column.

Return type

Union[DataFrame, AbstractColumn]

shuffle(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column], seed=None) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Shuffle the rows of a DataFrame or Column.

Shuffling is done out-of-place and with numpy.

Parameters
  • data (Union[DataFrame, Column]) – DataFrame or Column to shuffle.

  • seed (int) – Seed to use for shuffling.

Returns

Shuffled DataFrame or Column.

Return type

Union[DataFrame, Column]

groupby(data: meerkat.dataframe.DataFrame, by: Union[str, Sequence[str]] = None) meerkat.ops.sliceby.groupby.GroupBy[source]

Perform a groupby operation on a DataFrame or Column (similar to a DataFrame.groupby and Series.groupby operations in Pandas).j.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – The data to group.

  • by (Union[str, Sequence[str]]) – The column(s) to group by. Ignored if data is a Column.

Returns

A GroupBy object.

Return type

Union[DataFrameGroupBy, AbstractColumnGroupBy]

clusterby(data: DataFrame, by: Union[str, Sequence[str]], method: Union[str, 'ClusterMixin'] = 'KMeans', encoder: str = 'clip', modality: str = None, **kwargs) ClusterBy[source]

Perform a clusterby operation on a DataFrame.

Parameters
  • data (DataFrame) – The dataframe to cluster.

  • by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the encoder and the resulting embedding will be used.

  • method (Union[str, "ClusterMixin"]) – The clustering method to use.

  • encoder (str) – The encoder to use for the embedding. Defaults to clip.

  • modality (Union[str, Sequence[str])) – The modality to of the

  • **kwargs – Additional keyword arguments to pass to the clustering method.

Returns

A ClusterBy object.

Return type

ClusterBy

explainby(data: DataFrame, by: Union[str, Sequence[str]], target: Union[str, Mapping[str]], method: Union[str, 'domino.Slicer'] = 'MixtureSlicer', encoder: str = 'clip', modality: str = None, scores: bool = False, use_cache: bool = True, output_col: str = None, **kwargs) ExplainBy[source]

Perform a clusterby operation on a DataFrame.

Parameters
  • data (DataFrame) – The dataframe to cluster.

  • by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the encoder and the resulting embedding will be used.

  • method (Union[str, domino.Slicer]) – The clustering method to use.

  • encoder (str) – The encoder to use for the embedding. Defaults to clip.

  • modality (Union[str, Sequence[str])) – The modality to of the

  • **kwargs – Additional keyword arguments to pass to the clustering method.

Returns

A ExplainBy object.

Return type

ExplainBy

cand(*args)[source]

Overloaded and operator.

Use this when you want to use the and operator on reactive values (e.g. Store)

Parameters

*args – The arguments to and together.

Returns

The result of the and operation.

cor(*args)[source]

Overloaded or operator.

Use this when you want to use the or operator on reactive values (e.g. Store)

Parameters

*args – The arguments to or together.

Returns

The result of the or operation.

cnot(x)[source]

Overloaded not operator.

Use this when you want to use the not operator on reactive values (e.g. Store).

Parameters

x – The arguments to not.

Returns

The result of the and operation.

all(iterable, /) bool

Return True if bool(x) is True for all values x in the iterable.

If the iterable is empty, return True.

any(iterable, /) bool

Return True if bool(x) is True for any x in the iterable.

If the iterable is empty, return False.

bool(x) bool

Overloaded bool operator.

Use this when you want to use the bool operator on reactive values (e.g. Store).

Parameters

x – The argument to convert to a bool.

Returns

The result of the bool operation.

Return type

Store[bool] | bool

len(obj, /)

Return the number of items in a container.

hex(number, /) str

Return the hexadecimal representation of an integer.

>>> hex(12648430)
'0xc0ffee'
oct(number, /) str

Return the octal representation of an integer.

>>> oct(342391)
'0o1234567'
slice(*args)

Overloaded slice class.

sum(iterable, /, start=0) float

Return the sum of a ‘start’ value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value. This function is intended specifically for use with numeric values and may reject non-numeric types.

abs(x, /) float

Return the absolute value of the argument.

from_csv(filepath: str, primary_key: Optional[str] = None, backend: str = 'pandas', *args, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a csv file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

from_json(filepath: str, primary_key: Optional[str] = None, orient: str = 'records', lines: bool = False, backend: str = 'pandas', **kwargs) meerkat.dataframe.DataFrame

Load a DataFrame from a json file.

By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the orient="records" format. If the data is in a different format in the JSON, you can specify the orient parameter. See pandas.read_json() for more details.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_json().

  • orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as pandas.read_json().

  • lines (bool) – Whether the json file is a jsonl file. Same as pandas.read_json().

  • backend (str) – The backend to use for the loading and reuslting columns.

  • **kwargs – Keyword arguments forwarded to pandas.read_json().

Returns

The constructed dataframe.

Return type

DataFrame

from_parquet(filepath: str, primary_key: Optional[str] = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a parquet file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

from_feather(filepath: str, primary_key: Optional[str] = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a feather file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
Returns

The constructed dataframe.

Return type

DataFrame

from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame

Create a Meerkat DataFrame from a Pandas DataFrame.

Warning

In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.

Parameters
  • df – The Pandas DataFrame to convert.

  • index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.

  • primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.

Returns

The Meerkat DataFrame.

Return type

DataFrame

from_arrow(table: pyarrow.lib.Table)

Create a Dataset from a pandas DataFrame.

from_huggingface(*args, **kwargs)

Load a Huggingface dataset as a DataFrame.

Use this to replace datasets.load_dataset, so

>>> dict_of_datasets = datasets.load_dataset('boolq')

becomes

>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame

Load a DataFrame stored on disk.

class BaseFormatter[source]
encode(cell: Any, **kwargs)[source]

Encode the cell on the backend before sending it to the frontend.

The cell is lazily loaded, so when used on a LambdaColumn, cell will be a LambdaCell. This is important for displays that don’t actually need to apply the lambda in order to display the value.

static to_yaml(dumper: yaml.dumper.Dumper, data: meerkat.interactive.formatter.base.BaseFormatter)[source]

This function is called by the YAML dumper to convert a Formatter object into a YAML node.

It should not be called directly.

static from_yaml(loader, node)[source]

This function is called by the YAML loader to convert a YAML node into an Formatter object.

It should not be called directly.

html(cell: Any)[source]

When not in interactive mode, objects are visualized using static html.

This method should produce that static html for the cell.

class FormatterGroup(base: Optional[meerkat.interactive.formatter.base.BaseFormatter] = None, **kwargs)[source]

A formatter group is a mapping from formatter placeholders to formatters.

Data in a Meerkat column sometimes need to be displayed differently in different GUI contexts. For example, in a table, we display thumbnails of images, but in a carousel view, we display the full image.

Because most components in Meerkat work on any data type, it is important that they are implemented in a formatter-agnostic way. So, instead of specifying formatters, components make requests for data specifying a formatter placeholder. For example, the {class}`mk.gui.Gallery` component requests data using the thumbnail formatter placeholder.

For a specific column of data, we specify which formatters to use for each placeholder using a formatter group. A formatter group is a mapping from formatter placeholders to formatters. Each column in Meerkat has a formatter_group property. A column’s formatter group controls how it will be displayed in different contexts in Meerkat GUIs.

Parameters
  • base (FormatterGroup) – The base formatter group to use.

  • **kwargs – The formatters to add to the formatter group.

static to_yaml(dumper: yaml.dumper.Dumper, data: meerkat.interactive.formatter.base.BaseFormatter)[source]

This function is called by the YAML dumper to convert a Formatter object into a YAML node.

It should not be called directly.

static from_yaml(loader, node)[source]

This function is called by the YAML loader to convert a YAML node into an Formatter object.

It should not be called directly.

get(name: str, dataset_dir: Optional[str] = None, version: Optional[str] = None, download_mode: str = 'reuse', registry: Optional[str] = None, **kwargs) Union[meerkat.dataframe.DataFrame, Dict[str, meerkat.dataframe.DataFrame]][source]

Load a dataset into .

Parameters
  • name (str) – Name of the dataset.

  • dataset_dir (str) – The directory containing dataset data. Defaults to ~/.meerkat/datasets/{name}.

  • version (str) – The version of the dataset. Defaults to latest.

  • download_mode (str) – The download mode. Options are: “reuse” (default) will download the dataset if it does not exist, “force” will download the dataset even if it exists, “extract” will reuse any downloaded archives but force extracting those archives, and “skip” will not download the dataset if it doesn’t yet exist. Defaults to reuse.

  • registry (str) – The registry to use. If None, then checks each supported registry in turn. Currently, supported registries include meerkat and huggingface.

  • **kwargs – Additional arguments passed to the dataset.

DataPanel

alias of meerkat.dataframe.DataFrame

scalar

alias of meerkat.columns.scalar.abstract.ScalarColumn

tensor

alias of meerkat.columns.tensor.abstract.TensorColumn

deferred

alias of meerkat.columns.deferred.base.DeferredColumn

objects

alias of meerkat.columns.object.base.ObjectColumn

files

alias of meerkat.columns.deferred.file.FileColumn

image(filepaths: typing.Sequence[str], base_dir: typing.Optional[str] = None, downloader: typing.Optional[typing.Union[callable, str]] = None, loader: callable = <function load_image>, cache_dir: typing.Optional[str] = None)[source]

Create a FileColumn where each cell represents an image stored on disk. The underlying data is a ScalarColumn of strings, where each string is the path to an image.

Parameters
  • filepaths (Sequence[str]) – A list of filepaths to images.

  • loader (Union[str, Callable[[Union[str, IO]], Any]]) – a callable that accepts a filepath or an I/O stream and returns data.

  • base_dir (str, optional) –

    an absolute path to a directory containing the files. If provided, the filepath to be loaded will be joined with the base_dir. As such, this argument should only be used if the loader will be applied to relative paths. T

    The base_dir can also include environment variables (e.g. $DATA_DIR/images) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.

  • downloader (Union[str, callable], optional) –

    a callable that accepts at least two positional arguments - a URI and a destination (which could be either a string or file object).

    Meerkat includes a small set of built-in downloaders [“url”, “gcs”] which can be specified via string.

  • fallback_downloader (callable, optional) – a callable that will be run each time the the downloader fails (for any reason). This is useful, for example, if you expect some of the URIs in a dataset to be broken fallback_downloader could write an empty file in place of the original. If fallback_downloader is not supplied, the original exception is re-raised.

  • cache_dir (str, optional) – the directory on disk where downloaded files are to be cached. Defaults to None, in which case files will be re-downloaded on every access of the data. The cache_dir can also include environment variables (e.g. $DATA_DIR/images) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.

audio

alias of meerkat.columns.deferred.audio.AudioColumn

class classproperty(fget=None, fset=None, fdel=None, doc=None)[source]

Taken from https://stackoverflow.com/a/13624858.

The behavior of class properties using the @classmethod and @property decorators has changed across Python versions. This class (should) provide consistent behavior across Python versions. See https://stackoverflow.com/a/1800999 for more information.