DataFrame
DataFrame¶
Meerkat.
- class DataFrame(data: Optional[Union[dict, list]] = None, primary_key: Union[str, bool] = True, *args, **kwargs)[source]¶
A collection of equal length columns.
- Parameters
- format(formatters: Dict[str, FormatterGroup]) DataFrame [source]¶
Create a view of the DataFrame with formatted columns.
Example
- Parameters
formatters (Dict[str, FormatterGroup]) – A dictionary mapping column names to FormatterGroups.
- Returns
A view of the DataFrame with formatted columns.
- Return type
Examples
# assume df is a DataFrame with columns "img", "text", "id" gallery = mk.Gallery( df=df.format( img={"thumbnail": ImageFormatter(max_size=(48, 48))}, text={"icon": TextFormatter()}, ) )
- property data: meerkat.block.manager.BlockManager¶
Get the underlying data (excluding invisible rows).
To access underlying data with invisible rows, use _data.
- property columns¶
Column names in the DataFrame.
- property primary_key: meerkat.columns.scalar.abstract.ScalarColumn¶
The column acting as the primary key.
- set_primary_key(column: str, inplace: bool = False) meerkat.dataframe.DataFrame [source]¶
Set the DataFrame’s primary key using an existing column. This is an out-of-place operation. For more information on primary keys, see the User Guide.
- Parameters
column (str) – The name of an existing column to set as the primary key.
- create_primary_key(column: str)[source]¶
Create a primary key of contiguous integers.
- Parameters
column (str) – The name of the column to create.
- property nrows¶
Number of rows in the DataFrame.
- property ncols¶
Number of rows in the DataFrame.
- property shape¶
Shape of the DataFrame (num_rows, num_columns).
- add_column(name: str, data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor], overwrite=False) None [source]¶
Add a column to the DataFrame.
- append(df: meerkat.dataframe.DataFrame, axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) meerkat.dataframe.DataFrame [source]¶
Append a batch of data to the dataset.
example_or_batch must have the same columns as the dataset (regardless of what columns are visible).
- head(n: int = 5) meerkat.dataframe.DataFrame [source]¶
Get the first n examples of the DataFrame.
- tail(n: int = 5) meerkat.dataframe.DataFrame [source]¶
Get the last n examples of the DataFrame.
- set(value: meerkat.dataframe.DataFrame)[source]¶
Set the data of this DataFrame to the data of another DataFrame.
This is used inside endpoints to tell Meerkat when a DataFrame has been modified. Calling this method outside of an endpoint will not have any effect on the graph.
- classmethod from_batch(batch: Dict[str, Union[List, meerkat.columns.abstract.Column]]) meerkat.dataframe.DataFrame [source]¶
Convert a batch to a Dataset.
- classmethod from_batches(batches: Sequence[Dict[str, Union[List, meerkat.columns.abstract.Column]]]) meerkat.dataframe.DataFrame [source]¶
Convert a list of batches to a dataset.
- classmethod from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame [source]¶
Create a Meerkat DataFrame from a Pandas DataFrame.
Warning
In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.
- Parameters
df – The Pandas DataFrame to convert.
index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.
primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.
- Returns
The Meerkat DataFrame.
- Return type
- classmethod from_huggingface(*args, **kwargs)[source]¶
Load a Huggingface dataset as a DataFrame.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
- classmethod from_csv(filepath: str, primary_key: Optional[str] = None, backend: str = 'pandas', *args, **kwargs) meerkat.dataframe.DataFrame [source]¶
Create a DataFrame from a csv file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv()
.*args – Argument list for
pandas.read_csv()
.**kwargs – Keyword arguments forwarded to
pandas.read_csv()
.
- Returns
The constructed dataframe.
- Return type
- classmethod from_feather(filepath: str, primary_key: Optional[str] = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame [source]¶
Create a DataFrame from a feather file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_feather()
.columns (Optional[Sequence[str]]) – The columns to load. Same as
pandas.read_feather()
.use_threads (bool) – Whether to use threads to read the file. Same as
pandas.read_feather()
.**kwargs – Keyword arguments forwarded to
pandas.read_feather()
.
- Returns
The constructed dataframe.
- Return type
- classmethod from_parquet(filepath: str, primary_key: Optional[str] = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame [source]¶
Create a DataFrame from a parquet file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_parquet()
.engine (str) – The parquet engine to use. Same as
pandas.read_parquet()
.columns (Optional[Sequence[str]]) – The columns to load. Same as
pandas.read_parquet()
.**kwargs – Keyword arguments forwarded to
pandas.read_parquet()
.
- Returns
The constructed dataframe.
- Return type
- classmethod from_json(filepath: str, primary_key: Optional[str] = None, orient: str = 'records', lines: bool = False, backend: str = 'pandas', **kwargs) meerkat.dataframe.DataFrame [source]¶
Load a DataFrame from a json file.
By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the
orient="records"
format. If the data is in a different format in the JSON, you can specify theorient
parameter. Seepandas.read_json()
for more details.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_json()
.orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as
pandas.read_json()
.lines (bool) – Whether the json file is a jsonl file. Same as
pandas.read_json()
.backend (str) – The backend to use for the loading and reuslting columns.
**kwargs – Keyword arguments forwarded to
pandas.read_json()
.
- Returns
The constructed dataframe.
- Return type
- to_pandas(index: bool = False, allow_objects: bool = False) pandas.core.frame.DataFrame [source]¶
Convert a Meerkat DataFrame to a Pandas DataFrame.
- Parameters
index (bool) – Use the primary key as the index of the Pandas DataFrame. Defaults to False.
- Returns
The constructed dataframe.
- Return type
pd.DataFrame
- to_arrow() pandas.core.frame.DataFrame [source]¶
Convert a Meerkat DataFrame to an Arrow Table.
- Returns
The constructed table.
- Return type
pa.Table
- to_csv(filepath: str, engine: str = 'auto')[source]¶
Save a DataFrame to a csv file.
The engine used to write the csv to disk.
- to_feather(filepath: str, engine: str = 'auto')[source]¶
Save a DataFrame to a feather file.
The engine used to write the feather to disk.
- to_parquet(filepath: str, engine: str = 'auto')[source]¶
Save a DataFrame to a parquet file.
The engine used to write the parquet to disk.
- to_json(filepath: str, lines: bool = False, orient: str = 'records') None [source]¶
Save a Dataset to a json file.
- batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]¶
Batch the dataset. TODO:
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
- Returns
batches of data
- update(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, remove_columns: Optional[List[str]] = None, num_workers: int = 0, output_type: Optional[Union[type, Dict[str, type]]] = None, mmap: bool = False, mmap_path: Optional[str] = None, materialize: bool = True, pbar: bool = False, **kwargs) meerkat.dataframe.DataFrame [source]¶
Update the columns of the dataset.
- filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.dataframe.DataFrame] [source]¶
Filter operation on the DataFrame.
- sort(by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame [source]¶
Sort the DataFrame by the values in the specified columns. Similar to
sort_values
in pandas.- Parameters
- Returns
A sorted view of DataFrame.
- Return type
- sample(n: int = None, frac: float = None, replace: bool = False, weights: Union[str, numpy.ndarray] = None, random_state: Union[int, numpy.random.mtrand.RandomState] = None) meerkat.dataframe.DataFrame [source]¶
Select a random sample of rows from DataFrame. Roughly equivalent to
sample
in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataFrame.
- Return type
- shuffle(seed: int = None) meerkat.dataframe.DataFrame [source]¶
Shuffle the rows of the DataFrame out-of-place.
- rename(mapper: Union[Dict, Callable] = None, errors: Literal['ignore', 'raise'] = 'ignore') meerkat.dataframe.DataFrame [source]¶
Return a new DataFrame with the specified column labels renamed.
Dictionary values must be unique (1-to-1). Labels not specified will be left unchanged. Extra labels will not throw an error.
- Parameters
mapper (Union[Dict, Callable], optional) – Dict-like of function transformations to apply to the values of the columns. Defaults to None.
errors (Literal['ignore', 'raise'], optional) – If ‘raise’, raise a KeyError when the Dict contains labels that do not exist in the DataFrame. If ‘ignore’, extra keys will be ignored. Defaults to ‘ignore’.
- Raises
ValueError – _description_
- Returns
A new DataFrame with the specified column labels renamed.
- Return type
- drop(columns: Union[str, Collection[str]], check_exists=True) meerkat.dataframe.DataFrame [source]¶
Return a new DataFrame with the specified columns dropped.
- classmethod read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame [source]¶
Load a DataFrame stored on disk.
- to_huggingface(repository, commit_message: Optional[str] = None)[source]¶
Upload a DataFrame to a HuggingFace repository.
This method will dump the dataframe into the
repository.local_dir
. Ifcommit_message
is specified, the repository will be pushed to the hub.- The dataframe can then be accessed with:
>>> repo = huggingface_hub.snapshot_download(repository) >>> # or repo = huggingface_hub.Repository(clone_from=repository) >>> df = mk.read(repo)
- Parameters
repository – The huggingface_hub.Repository object to upload to.
commit_message – The commit message to use when pushing to the huggingface.
Note
This will overwrite the existing DataFrame in the repository.
Example
>>> repo = huggingface_hub.Repository( ... local_dir="my-dataset", ... clone_from="user/my-dataset", ... repo_type="dataset") >>> df.to_huggingface(repo, commit_message="uploading dataframe")
- mark()¶
Converts the object to a reactive object in-place.
- unmark()¶
Converts the object to a non-reactive object in-place.
- reactive(fn: Optional[Callable] = None, nested_return: bool = False, skip_fn: Optional[Callable[[...], bool]] = None, backend_only: bool = False) Callable [source]¶
Internal decorator that is used to mark a function as reactive. This is only meant for internal use, and users should use the
react()
decorator instead.Functions decorated with this will create nodes in the operation graph, which are executed whenever their inputs are modified.
A basic example that adds two numbers:
@reactive def add(a: int, b: int) -> int: return a + b a = Store(1) b = Store(2) c = add(a, b)
When either a or b is modified, the add function will be called again with the new values of a and b.
A more complex example that concatenates two mk.DataFrame objects:
@reactive def concat(df1: mk.DataFrame, df2: mk.DataFrame) -> mk.DataFrame: return mk.concat([df1, df2]) df1 = mk.DataFrame(...) df2 = mk.DataFrame(...) df3 = concat(df1, df2)
- Parameters
fn – See
react()
.nested_return – See
react()
.skip_fn – See
react()
.
- Returns
See
react()
.
- class unmarked[source]¶
A context manager and decorator that forces all objects within it to behave as if they are not marked. This means that any functions (reactive or not) called with those objects will never be rerun.
Effectively, functions (by decoration) or blocks of code (with the context manager) behave as if they are not reactive.
Examples:
Consider the following function:
>>> @reactive ... def f(x): ... return x + 1
If we call f with a marked object, then it will be rerun if the object changes:
>>> x = mark(1) >>> f(x) # f is rerun when x changes
Now, suppose we call f inside another function g that is not reactive:
>>> def g(x): ... out = f(x) ... return out
If we call g with a marked object, then the out variable will be recomputed if the object changes. Even though g is not reactive, f is, and f is called within g with a marked object.
Sometimes, this might be what we want. However, sometimes we want to ensure that a function or block of code behaves as if it is not reactive.
For this behavior, we can use the unmarked context manager:
>>> with unmarked(): ... g(x) # g and nothing in g is rerun when x changes
Or, we can use the unmarked decorator:
>>> @unmarked ... def g(x): ... out = f(x) ... return out
In both cases, the out variable will not be recomputed if the object changes, even though f is reactive.
- class Store(wrapped: meerkat.interactive.graph.store.T, backend_only: bool = False)[source]¶
-
- property frontend¶
Returns a Pydantic model that can be should be sent to the frontend.
These models are typically named <something>Frontend (e.g. ComponentFrontend, StoreFrontend).
- set(new_value: meerkat.interactive.graph.store.T) None [source]¶
Set the value of the store.
This will trigger any reactive functions that depend on this store.
- Parameters
new_value (T) – The new value of the store.
- Returns
None
Note
Even if the new_value is the same as the current value, this will still trigger any reactive functions that depend on this store. To avoid this, check for equality before calling this method.
- mark(input: meerkat.interactive.graph.marking.T) meerkat.interactive.graph.marking.T [source]¶
Mark an object.
If the input is an object, then the object will become reactive: all of its methods and properties will become reactive. It will be returned as a Store object.
- Parameters
input – Any object to mark.
- Returns
A reactive function or object.
Examples:
Use mark on primitive types:
>>> x = mark(1) >>> # x is now a `Store` object
Use mark on complex types:
>>> x = mark([1, 2, 3])
Use mark on instances of classes:
>>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, 3]}) >>> x: Store = mark(df) >>> y = x.head()
>>> class Foo: ... def __init__(self, x): ... self.x = x ... def __call__(self): ... return self.x + 1 >>> f = Foo(1) >>> x = mark(f)
Use mark on functions:
>>> aggregation = mark(mean)
- endpoint(fn: Optional[Callable] = None, prefix: Optional[Union[str, fastapi.routing.APIRouter]] = None, route: Optional[str] = None, method: str = 'POST') meerkat.interactive.endpoint.Endpoint [source]¶
Decorator to mark a function as an endpoint.
- An endpoint is a function that can be called to
update the value of a Store (e.g. incrementing a counter)
update a DataFrame (e.g. adding a new row)
run a computation and return its result to the frontend
- run a function in response to a frontend event (e.g. button
click)
Endpoints differ from reactive functions in that they are not automatically triggered by changes in their inputs. Instead, they are triggered by explicit calls to the endpoint function.
The Store and DataFrame objects that are modified inside the endpoint function will automatically trigger reactive functions that depend on them.
@endpoint def increment(count: Store, step: int = 1): count.set(count + step) # ^ update the count Store, which will trigger operations # that depend on it # Create a button that calls the increment endpoint counter = Store(0) button = Button(on_click=increment(counter)) # ^ read this as: call the increment endpoint with the `counter` # Store when the button is clicked
- Parameters
fn – The function to decorate.
prefix – The prefix to add to the route. If a string, it will be prepended to the route. If an APIRouter, the route will be added to the router.
route – The route to add to the endpoint. If not specified, the route will be the name of the function.
method – The HTTP method to use for the endpoint. Defaults to “POST”.
- Returns
The decorated function, as an Endpoint object.
- class magic(magic: bool = True)[source]¶
A context manager and decorator that changes the behavior of Store objects inside it. All methods, properties and public attributes of Store objects will be wrapped in @reactive decorators.
Examples:
- column(data: Sequence, scalar_backend: Optional[str] = None) meerkat.columns.abstract.Column [source]¶
Create a Meerkat column from data.
The Meerkat column type is inferred from the type and structure of the data passed in.
- class Column(data: Sequence = None, collate_fn: Callable = None, formatters: FormatterGroup = None, *args, **kwargs)[source]¶
An abstract class for Meerkat columns.
- property data¶
Get the underlying data.
- filter(function: Callable, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, **kwargs) Optional[meerkat.columns.abstract.Column] [source]¶
Filter the elements of the column using a function.
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column [source]¶
Return a sorted view of the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column [source]¶
Return indices that would sorted the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
- sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.Column [source]¶
Select a random sample of rows from Column. Roughly equivalent to
sample
in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataFrame.
- Return type
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]¶
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- classmethod from_data(data: Union[Columnable, Column])[source]¶
Convert data to a meerkat column using the appropriate Column type.
- head(n: int = 5) meerkat.columns.abstract.Column [source]¶
Get the first n examples of the column.
- tail(n: int = 5) meerkat.columns.abstract.Column [source]¶
Get the last n examples of the column.
- to_pandas(allow_objects: bool = False) pandas.core.series.Series [source]¶
Convert the column to a Pandas Series.
If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Pandas Series.
- Return type
pd.Series
- to_arrow() pyarrow.lib.Array [source]¶
Convert the column to an Arrow Array.
If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as an Arrow Array.
- Return type
pa.Array
- to_torch() torch.Tensor [source]¶
Convert the column to a PyTorch Tensor.
If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a PyTorch Tensor.
- Return type
torch.Tensor
- to_numpy() numpy.ndarray [source]¶
Convert the column to a Numpy array.
If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Numpy array.
- Return type
np.ndarray
- class ObjectColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]¶
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]¶
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- to_pandas(allow_objects: bool = False) pandas.core.series.Series [source]¶
Convert the column to a Pandas Series.
If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Pandas Series.
- Return type
pd.Series
- class ScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]¶
- class PandasScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]¶
- dt¶
alias of
meerkat.columns.scalar.pandas._MeerkatCombinedDatetimelikeProperties
- cat¶
alias of
meerkat.columns.scalar.pandas._MeerkatCategoricalAccessor
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn [source]¶
Return a sorted view of the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
AbstractColumn
- argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn [source]¶
Return indices that would sorted the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
PandasSeriesColumn
For now! Raises error when shape of input array is more than one error.
- to_tensor() torch.Tensor [source]¶
Use column.to_tensor() instead of torch.tensor(column), which is very slow.
- to_numpy() torch.Tensor [source]¶
Convert the column to a Numpy array.
If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Numpy array.
- Return type
np.ndarray
- to_pandas(allow_objects: bool = False) pandas.core.series.Series [source]¶
Convert the column to a Pandas Series.
If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Pandas Series.
- Return type
pd.Series
- to_arrow() pyarrow.lib.Array [source]¶
Convert the column to an Arrow Array.
If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as an Arrow Array.
- Return type
pa.Array
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- class ArrowScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]¶
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- to_numpy()[source]¶
Convert the column to a Numpy array.
If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Numpy array.
- Return type
np.ndarray
- class NumPyTensorColumn(data: TensorColumnTypes = None, backend: str = None)[source]¶
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[str, List[str]]] = None) meerkat.columns.tensor.numpy.NumPyTensorColumn [source]¶
Return a sorted view of the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
- argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.tensor.numpy.NumPyTensorColumn [source]¶
Return indices that would sorted the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
NumpySeriesColumn
For now! Raises error when shape of input array is more than one error.
- to_torch() torch.Tensor [source]¶
Convert the column to a PyTorch Tensor.
If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a PyTorch Tensor.
- Return type
torch.Tensor
- to_pandas(allow_objects: bool = True) pandas.core.series.Series [source]¶
Convert the column to a Pandas Series.
If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Pandas Series.
- Return type
pd.Series
- to_arrow() pyarrow.lib.Array [source]¶
Convert the column to an Arrow Array.
If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as an Arrow Array.
- Return type
pa.Array
- to_numpy() numpy.ndarray [source]¶
Convert the column to a Numpy array.
If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Numpy array.
- Return type
np.ndarray
- class TorchTensorColumn(data: TensorColumnTypes = None, backend: str = None)[source]¶
- classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.Column])[source]¶
Convert data to an EmbeddingColumn.
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn [source]¶
Return a sorted view of the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn [source]¶
Return indices that would sorted the column.
- Parameters
- Returns
A view of the column with the sorted data.
- Return type
For now! Raises error when shape of input array is more than one error.
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- to_pandas(allow_objects: bool = True) pandas.core.series.Series [source]¶
Convert the column to a Pandas Series.
If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Pandas Series.
- Return type
pd.Series
- to_numpy() pandas.core.series.Series [source]¶
Convert the column to a Numpy array.
If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.
- Returns
The column as a Numpy array.
- Return type
np.ndarray
- class DeferredColumn(data: Union[meerkat.block.deferred_block.DeferredOp, meerkat.block.abstract.BlockView], output_type: Optional[Type[meerkat.columns.abstract.Column]] = None, *args, **kwargs)[source]¶
- property fn: Callable¶
Subclasses like ImageColumn should be able to implement their own version.
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- class FileColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]¶
A column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the
lz
indexer, the files are not materialized and aFileCell
or aFileColumn
is returned instead.- Parameters
data (Sequence[str]) – A list of filepaths.
loader (Union[str, Callable[[Union[str, IO]], Any]]) – a callable that accepts a filepath or an I/O stream and returns data.
base_dir (str, optional) –
an absolute path to a directory containing the files. If provided, the
filepath
to be loaded will be joined with thebase_dir
. As such, this argument should only be used if the loader will be applied to relative paths. TThe
base_dir
can also include environment variables (e.g.$DATA_DIR/images
) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.downloader (Union[str, callable], optional) –
a callable that accepts at least two positional arguments - a URI and a destination (which could be either a string or file object).
Meerkat includes a small set of built-in downloaders [“url”, “gcs”] which can be specified via string.
cache_dir (str, optional) – the directory on disk where downloaded files are to be cached. Defaults to None, in which case files will be re-downloaded on every access of the data. The
cache_dir
can also include environment variables (e.g.$DATA_DIR/images
) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.
- is_equal(other: meerkat.columns.abstract.Column) bool [source]¶
Tests whether two columns.
- Parameters
other (Column) – [description]
- class ImageColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]¶
DEPRECATED A column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lz
indexer, the images are not materialized and anImageCell
or anImageColumn
is returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop
).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:
. Defaults totorchvision.datasets.folder.default_loader
.Warning
In order for the column to be serializable with
write()
, the loader function must be pickleable.base_dir (str) – A base directory that the paths in
data
are relative to. IfNone
, the paths are assumed to be absolute.
- class AudioColumn(data: Sequence[str] = None, type: str = None, loader: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, cache_dir: str = None, formatters: FormatterGroup = None, *args, **kwargs)[source]¶
A lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lz
indexer, the images are not materialized and anFileCell
or anAudioColumn
is returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop
).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:
. Defaults totorchvision.datasets.folder.default_loader
.Warning
In order for the column to be serializable with
write()
, the loader function must be pickleable.base_dir (str) – A base directory that the paths in
data
are relative to. IfNone
, the paths are assumed to be absolute.
- map(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, Type[Column]], Type[Column]] = None, materialize: bool = True, use_ray: bool = False, num_blocks: int = 100, blocks_per_window: int = 10, pbar: bool = False, **kwargs)[source]¶
Create a new
Column
orDataFrame
by applying a function to each row in data.This function shares nearly the exact same signature with
defer()
, the difference is thatdefer()
returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.Learn more in the user guide: Mapping: map and defer.
{input_description}
What gets returned by defer?
If
function
returns a single value, thendefer
will return aDeferredColumn
object.If
function
returns a dictionary, thendefer
will return aDataFrame
containingDeferredColumn
objects. The keys of the dictionary are used as column names. Theoutputs
argument can be used to override the column names.If
function
returns a tuple, thendefer
will return aDataFrame
containingDeferredColumn
objects. The column names will be integers. The column names can be overriden by passing a tuple to theoutputs
argument.If
function
returns a tuple or a dictionary, then passing"single"
to theoutputs
argument will causedefer
to return a singleDeferredColumn
that materializes to aObjectColumn
- Parameters
data (DataFrame) – The
DataFrame
orColumn
to which the function will be applied.function (Callable) – The function that will be applied to the rows of
data
.is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.
batch_size (int, optional) – The size of the batch. Defaults to 1.
inputs (Dict[str, str], optional) – Dictionary mapping column names in
data
to keyword arguments offunction
. Ignored ifdata
is a column. When callingfunction
values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.outputs (Union[Dict[any, str], Tuple[str]], optional) –
Controls how the output of
function
is mapped to the output ofdefer()
. Defaults toNone
.If
None
: the output is inferred from the return type of the function. See explanation above.If
"single"
: a singleDeferredColumn
is returned.If a
Dict[any, str]
: then aDataFrame
containing DeferredColumns is returned. This is useful when the output offunction
is aDict
.outputs
maps the outputs offunction
to column names in the resultingDataFrame
.If a
Tuple[str]
: then aDataFrame
containing outputDeferredColumn
is returned. This is useful when the offunction
is aTuple
.outputs
maps the outputs offunction
to column names in the resultingDataFrame
.
output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.
materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.
use_ray (bool) – Use Ray to parallelize the computation. Defaults to False.
num_blocks (int) – When using Ray, the number of blocks to split the data into. Defaults to 100.
blocks_per_window (int) – When using Ray, the number of blocks to process in a single Ray task. Defaults to 10.
pbar (bool) – Show a progress bar. Defaults to False.
- Returns
- Return type
Examples
We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.
In [1]: import datetime In [2]: import meerkat as mk In [3]: df = mk.DataFrame({ ...: "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943], ...: "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"] ...: }) ...:
Single input column. Lazily create a column of birth years to a column of ages.
In [4]: df["age"] = df["birth_year"].map( ...: lambda x: datetime.datetime.now().year - x ...: ) ...: In [5]: df["age"] Out[5]: column([56, 30, 13, 38, 16, 33, ...], backend=PandasScalarColumn
Multiple input columns. Lazily create a column of birth years to a column of ages.
In [6]: df["ma_eligible"] = df.map( ...: lambda age, residence: (residence == "MA") and (age >= 18) ...: ) ...: In [7]: df["ma_eligible"] Out[7]: column([True, False, False, False, False, True, ...], backend=PandasScalarColumn
- defer(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, Type[Column]], Type[Column]] = None, materialize: bool = True) Union[DataFrame, DeferredColumn] [source]¶
Create one or more DeferredColumns that lazily applies a function to each row in data.
This function shares nearly the exact same signature with
map()
, the difference is thatdefer()
returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.Learn more in the user guide: Deferred map and chaining.
{input_description}
What gets returned by defer?
If
function
returns a single value, thendefer
will return aDeferredColumn
object.If
function
returns a dictionary, thendefer
will return aDataFrame
containingDeferredColumn
objects. The keys of the dictionary are used as column names. Theoutputs
argument can be used to override the column names.If
function
returns a tuple, thendefer
will return aDataFrame
containingDeferredColumn
objects. The column names will be integers. The column names can be overriden by passing a tuple to theoutputs
argument.If
function
returns a tuple or a dictionary, then passing"single"
to theoutputs
argument will causedefer
to return a singleDeferredColumn
that materializes to aObjectColumn
.
How do you execute the deferred map?
Depending on
function
and theoutputs
argument, returns either aDeferredColumn
or aDataFrame
. Both are callables. To execute the deferred map, simply call the returned object.- Parameters
data (DataFrame) – The
DataFrame
orColumn
to which the function will be applied.function (Callable) – The function that will be applied to the rows of
data
.is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.
batch_size (int, optional) – The size of the batch. Defaults to 1.
inputs (Dict[str, str], optional) – Dictionary mapping column names in
data
to keyword arguments offunction
. Ignored ifdata
is a column. When callingfunction
values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.outputs (Union[Dict[any, str], Tuple[str]], optional) –
Controls how the output of
function
is mapped to the output ofdefer()
. Defaults toNone
.If
None
: the output is inferred from the return type of the function. See explanation above.If
"single"
: a singleDeferredColumn
is returned.If a
Dict[any, str]
: then aDataFrame
containing DeferredColumns is returned. This is useful when the output offunction
is aDict
.outputs
maps the outputs offunction
to column names in the resultingDataFrame
.If a
Tuple[str]
: then aDataFrame
containing outputDeferredColumn
is returned. This is useful when the offunction
is aTuple
.outputs
maps the outputs offunction
to column names in the resultingDataFrame
.
output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.
materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.
- Returns
- A
DeferredColumn
or a DataFrame
containingDeferredColumn
representing the deferred map.
- A
- Return type
Union[DataFrame, DeferredColumn]
Examples
We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.
In [1]: import datetime In [2]: import meerkat as mk In [3]: df = mk.DataFrame({ ...: "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943], ...: "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"] ...: }) ...:
Single input column. Lazily create a column of birth years to a column of ages.
In [4]: df["age"] = df["birth_year"].defer( ...: lambda x: datetime.datetime.now().year - x ...: ) ...: In [5]: df["age"] Out[5]: column([DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), ...], backend=DeferredColumn
We can materialize the deferred map (i.e. run it) by calling the column.
In [6]: df["age"]() Out[6]: column([56, 30, 13, 38, 16, 33, ...], backend=PandasScalarColumn
Multiple input columns. Lazily create a column of birth years to a column of ages.
In [7]: df["ma_eligible"] = df.defer( ...: lambda age, residence: (residence == "MA") and (age >= 18) ...: ) ...: In [8]: df["ma_eligible"]() Out[8]: column([True, False, False, False, False, True, ...], backend=PandasScalarColumn
- concat(objs: Union[Sequence[meerkat.dataframe.DataFrame], Sequence[meerkat.columns.abstract.Column]], axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column] [source]¶
Concatenate a sequence of columns or a sequence of DataFrame`s. If sequence is empty, returns an empty `DataFrame.
If concatenating columns, all columns must be of the same type (e.g. all
ListColumn). - If concatenating `DataFrame`s along axis 0 (rows), all `DataFrame`s must have the same set of columns. - If concatenating `DataFrame`s along axis 1 (columns), all `DataFrame`s must have the same length and cannot have any of the same column names.
- complete(df: meerkat.dataframe.DataFrame, prompt: str, engine: str, batch_size: int = 1, use_ray: bool = False, num_blocks: int = 100, blocks_per_window: int = 10, pbar: bool = False, client_connection: Optional[str] = None, cache_connection: str = '~/.manifest/cache.sqlite') meerkat.columns.scalar.abstract.ScalarColumn [source]¶
Apply a generative language model to each row in a DataFrame.
- Parameters
df (DataFrame) – The
DataFrame
to which the function will be applied.prompt (str) –
engine (str) –
batch_size (int, optional) – The size of the batch. Defaults to 1.
materialize (bool, optional) – Whether to materialize the input column(s). Defaults to True.
use_ray (bool) – Use Ray to parallelize the computation. Defaults to False.
num_blocks (int) – When using Ray, the number of blocks to split the data into. Defaults to 100.
blocks_per_window (int) – When using Ray, the number of blocks to process in a single Ray task. Defaults to 10.
pbar (bool) – Show a progress bar. Defaults to False.
client_connection – The connection string for the client. This is typically the key (e.g. OPENAI). If it is not provided, it will be inferred from the engine.
cache_connection – The sqlite connection string for the cache.
- Returns
- A
DeferredColumn
or a DataFrame
containingDeferredColumn
representing the deferred map.
- A
- Return type
Union[Column]
- merge(left: meerkat.dataframe.DataFrame, right: meerkat.dataframe.DataFrame, how: str = 'inner', on: Union[str, List[str]] = None, left_on: Union[str, List[str]] = None, right_on: Union[str, List[str]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None) meerkat.dataframe.DataFrame [source]¶
Perform a database-style join operation between two DataFrames.
- Parameters
left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.
how (str, optional) – The join type. Defaults to “inner”.
on (Union[str, List[str]], optional) – The columns(s) to join on. These columns must be
ScalarColumn
. Defaults to None, in which case the left_on and right_on parameters must be passed.left_on (Union[str, List[str]], optional) – The column(s) in the left DataFrame to join on. These columns must be
ScalarColumn
. Defaults to None.right_on (Union[str, List[str]], optional) – The column(s) in the right DataFrame to join on. These columns must be
ScalarColumn
. Defaults to None.sort (bool, optional) – Whether to sort the result DataFrame by the join key(s). Defaults to False.
suffixes (Sequence[str], optional) – Suffixes to use in the case their are conflicting column names in the result DataFrame. Should be a sequence of length two, with
suffixes[0]
the suffix for the column from the left DataFrame andsuffixes[1]
the suffix for the right. Defaults to (“_x”, “_y”).validate (_type_, optional) –
The check to perform on the result DataFrame. Defaults to None, in which case no check is performed. Valid options are:
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
- Returns
The merged DataFrame.
- Return type
- embed(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column, str, PIL.Image.Image], input: Optional[str] = None, encoder: Union[str, meerkat.ops.embed.encoder.Encoder] = 'clip', modality: Optional[str] = None, out_col: Optional[str] = None, device: Union[int, str] = 'auto', mmap_dir: Optional[str] = None, num_workers: int = 0, batch_size: int = 128, pbar: bool = True, **kwargs) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column] [source]¶
Embed a column of data with an encoder from the encoder registry.
Examples
Suppose you have an Image dataset (e.g. Imagenette, CIFAR-10) loaded into a Meerkat DataFrame. You can embed the images in the dataset with CLIP using a code snippet like:
import meerkat as mk df = mk.datasets.get("imagenette") df = mk.embed( data=df, input_col="img", encoder="clip" )
- Parameters
data (Union[mk.DataFrame, mk.AbstractColumn]) – A dataframe or column containing the data to embed.
input_col (str, optional) – If
data
is a dataframe, the name of the column to embed. Ifdata
is a column, then the parameter is ignored. Defaults to None.encoder (Union[str, Encoder], optional) – Name of the encoder to use. List supported encoders with
domino.encoders
. Defaults to “clip”. Alternatively, pass anEncoder
object containing a custom encoder.modality (str, optional) – The modality of the data to be embedded. Defaults to None, in which case the modality is inferred from the type of the input column.
out_col (str, optional) – The name of the column where the embeddings are stored. Defaults to None, in which case it is
"{encoder}({input_col})"
.device (Union[int, str], optional) – The device on which. Defaults to “cpu”.
mmap_dir (str, optional) – The path to directory where a memory-mapped file containing the embeddings will be written. Defaults to None, in which case the embeddings are not memmapped.
num_workers (int, optional) – Number of worker processes used to load the data from disk. Defaults to 4.
batch_size (int, optional) – Size of the batches to used . Defaults to 128.
**kwargs – Additional keyword arguments are passed to the encoder. To see supported arguments for each encoder, see the encoder documentation (e.g.
clip()
).
- Returns
A view of
data
with a new column containing the embeddings. This column will be named according to theout_col
parameter.- Return type
mk.DataFrame
- sort(data: meerkat.dataframe.DataFrame, by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame [source]¶
Sort a DataFrame or Column. If a DataFrame, sort by the values in the specified columns. Similar to
sort_values
in pandas.- Parameters
data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sort.
by (Union[str, List[str]]) – The columns to sort by. Ignored if data is a Column.
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A sorted view of DataFrame.
- Return type
- sample(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column], n: int = None, frac: float = None, replace: bool = False, weights: Union[str, numpy.ndarray] = None, random_state: Union[int, numpy.random.mtrand.RandomState] = None) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column] [source]¶
Select a random sample of rows from DataFrame or Column. Roughly equivalent to
sample
in Pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html.- Parameters
data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sample from.
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string and data is a DataFrame, the sampled_df will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
- A random sample of rows from DataFrame or
Column.
- Return type
Union[DataFrame, AbstractColumn]
- shuffle(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column], seed=None) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column] [source]¶
Shuffle the rows of a DataFrame or Column.
Shuffling is done out-of-place and with numpy.
- groupby(data: meerkat.dataframe.DataFrame, by: Union[str, Sequence[str]] = None) meerkat.ops.sliceby.groupby.GroupBy [source]¶
Perform a groupby operation on a DataFrame or Column (similar to a DataFrame.groupby and Series.groupby operations in Pandas).j.
- clusterby(data: DataFrame, by: Union[str, Sequence[str]], method: Union[str, 'ClusterMixin'] = 'KMeans', encoder: str = 'clip', modality: str = None, **kwargs) ClusterBy [source]¶
Perform a clusterby operation on a DataFrame.
- Parameters
data (DataFrame) – The dataframe to cluster.
by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the
encoder
and the resulting embedding will be used.method (Union[str, "ClusterMixin"]) – The clustering method to use.
encoder (str) – The encoder to use for the embedding. Defaults to
clip
.modality (Union[str, Sequence[str])) – The modality to of the
**kwargs – Additional keyword arguments to pass to the clustering method.
- Returns
A ClusterBy object.
- Return type
ClusterBy
- explainby(data: DataFrame, by: Union[str, Sequence[str]], target: Union[str, Mapping[str]], method: Union[str, 'domino.Slicer'] = 'MixtureSlicer', encoder: str = 'clip', modality: str = None, scores: bool = False, use_cache: bool = True, output_col: str = None, **kwargs) ExplainBy [source]¶
Perform a clusterby operation on a DataFrame.
- Parameters
data (DataFrame) – The dataframe to cluster.
by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the
encoder
and the resulting embedding will be used.method (Union[str, domino.Slicer]) – The clustering method to use.
encoder (str) – The encoder to use for the embedding. Defaults to
clip
.modality (Union[str, Sequence[str])) – The modality to of the
**kwargs – Additional keyword arguments to pass to the clustering method.
- Returns
A ExplainBy object.
- Return type
ExplainBy
- cand(*args)[source]¶
Overloaded
and
operator.Use this when you want to use the and operator on reactive values (e.g. Store)
- Parameters
*args – The arguments to and together.
- Returns
The result of the and operation.
- cor(*args)[source]¶
Overloaded
or
operator.Use this when you want to use the
or
operator on reactive values (e.g. Store)- Parameters
*args – The arguments to
or
together.- Returns
The result of the
or
operation.
- cnot(x)[source]¶
Overloaded
not
operator.Use this when you want to use the
not
operator on reactive values (e.g. Store).- Parameters
x – The arguments to not.
- Returns
The result of the and operation.
- all(iterable, /) bool ¶
Return True if bool(x) is True for all values x in the iterable.
If the iterable is empty, return True.
- any(iterable, /) bool ¶
Return True if bool(x) is True for any x in the iterable.
If the iterable is empty, return False.
- bool(x) bool ¶
Overloaded
bool
operator.Use this when you want to use the
bool
operator on reactive values (e.g. Store).
- len(obj, /)¶
Return the number of items in a container.
- hex(number, /) str ¶
Return the hexadecimal representation of an integer.
>>> hex(12648430) '0xc0ffee'
- slice(*args)¶
Overloaded
slice
class.
- sum(iterable, /, start=0) float ¶
Return the sum of a ‘start’ value (default: 0) plus an iterable of numbers
When the iterable is empty, return the start value. This function is intended specifically for use with numeric values and may reject non-numeric types.
- from_csv(filepath: str, primary_key: Optional[str] = None, backend: str = 'pandas', *args, **kwargs) meerkat.dataframe.DataFrame ¶
Create a DataFrame from a csv file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv()
.*args – Argument list for
pandas.read_csv()
.**kwargs – Keyword arguments forwarded to
pandas.read_csv()
.
- Returns
The constructed dataframe.
- Return type
- from_json(filepath: str, primary_key: Optional[str] = None, orient: str = 'records', lines: bool = False, backend: str = 'pandas', **kwargs) meerkat.dataframe.DataFrame ¶
Load a DataFrame from a json file.
By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the
orient="records"
format. If the data is in a different format in the JSON, you can specify theorient
parameter. Seepandas.read_json()
for more details.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_json()
.orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as
pandas.read_json()
.lines (bool) – Whether the json file is a jsonl file. Same as
pandas.read_json()
.backend (str) – The backend to use for the loading and reuslting columns.
**kwargs – Keyword arguments forwarded to
pandas.read_json()
.
- Returns
The constructed dataframe.
- Return type
- from_parquet(filepath: str, primary_key: Optional[str] = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame ¶
Create a DataFrame from a parquet file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_parquet()
.engine (str) – The parquet engine to use. Same as
pandas.read_parquet()
.columns (Optional[Sequence[str]]) – The columns to load. Same as
pandas.read_parquet()
.**kwargs – Keyword arguments forwarded to
pandas.read_parquet()
.
- Returns
The constructed dataframe.
- Return type
- from_feather(filepath: str, primary_key: Optional[str] = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame ¶
Create a DataFrame from a feather file. All of the columns will be
meerkat.ScalarColumn
with backend Pandas.- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_feather()
.columns (Optional[Sequence[str]]) – The columns to load. Same as
pandas.read_feather()
.use_threads (bool) – Whether to use threads to read the file. Same as
pandas.read_feather()
.**kwargs – Keyword arguments forwarded to
pandas.read_feather()
.
- Returns
The constructed dataframe.
- Return type
- from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame ¶
Create a Meerkat DataFrame from a Pandas DataFrame.
Warning
In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.
- Parameters
df – The Pandas DataFrame to convert.
index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.
primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.
- Returns
The Meerkat DataFrame.
- Return type
- from_arrow(table: pyarrow.lib.Table)¶
Create a Dataset from a pandas DataFrame.
- from_huggingface(*args, **kwargs)¶
Load a Huggingface dataset as a DataFrame.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
- read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame ¶
Load a DataFrame stored on disk.
- class BaseFormatter[source]¶
- encode(cell: Any, **kwargs)[source]¶
Encode the cell on the backend before sending it to the frontend.
The cell is lazily loaded, so when used on a LambdaColumn,
cell
will be aLambdaCell
. This is important for displays that don’t actually need to apply the lambda in order to display the value.
- static to_yaml(dumper: yaml.dumper.Dumper, data: meerkat.interactive.formatter.base.BaseFormatter)[source]¶
This function is called by the YAML dumper to convert a
Formatter
object into a YAML node.It should not be called directly.
- class FormatterGroup(base: Optional[meerkat.interactive.formatter.base.BaseFormatter] = None, **kwargs)[source]¶
A formatter group is a mapping from formatter placeholders to formatters.
Data in a Meerkat column sometimes need to be displayed differently in different GUI contexts. For example, in a table, we display thumbnails of images, but in a carousel view, we display the full image.
Because most components in Meerkat work on any data type, it is important that they are implemented in a formatter-agnostic way. So, instead of specifying formatters, components make requests for data specifying a formatter placeholder. For example, the {class}`mk.gui.Gallery` component requests data using the thumbnail formatter placeholder.
For a specific column of data, we specify which formatters to use for each placeholder using a formatter group. A formatter group is a mapping from formatter placeholders to formatters. Each column in Meerkat has a formatter_group property. A column’s formatter group controls how it will be displayed in different contexts in Meerkat GUIs.
- Parameters
base (FormatterGroup) – The base formatter group to use.
**kwargs – The formatters to add to the formatter group.
- static to_yaml(dumper: yaml.dumper.Dumper, data: meerkat.interactive.formatter.base.BaseFormatter)[source]¶
This function is called by the YAML dumper to convert a
Formatter
object into a YAML node.It should not be called directly.
- get(name: str, dataset_dir: Optional[str] = None, version: Optional[str] = None, download_mode: str = 'reuse', registry: Optional[str] = None, **kwargs) Union[meerkat.dataframe.DataFrame, Dict[str, meerkat.dataframe.DataFrame]] [source]¶
Load a dataset into .
- Parameters
name (str) – Name of the dataset.
dataset_dir (str) – The directory containing dataset data. Defaults to ~/.meerkat/datasets/{name}.
version (str) – The version of the dataset. Defaults to latest.
download_mode (str) – The download mode. Options are: “reuse” (default) will download the dataset if it does not exist, “force” will download the dataset even if it exists, “extract” will reuse any downloaded archives but force extracting those archives, and “skip” will not download the dataset if it doesn’t yet exist. Defaults to reuse.
registry (str) – The registry to use. If None, then checks each supported registry in turn. Currently, supported registries include meerkat and huggingface.
**kwargs – Additional arguments passed to the dataset.
- DataPanel¶
alias of
meerkat.dataframe.DataFrame
- scalar¶
- tensor¶
- deferred¶
- objects¶
- files¶
- image(filepaths: typing.Sequence[str], base_dir: typing.Optional[str] = None, downloader: typing.Optional[typing.Union[callable, str]] = None, loader: callable = <function load_image>, cache_dir: typing.Optional[str] = None)[source]¶
Create a
FileColumn
where each cell represents an image stored on disk. The underlying data is aScalarColumn
of strings, where each string is the path to an image.- Parameters
filepaths (Sequence[str]) – A list of filepaths to images.
loader (Union[str, Callable[[Union[str, IO]], Any]]) – a callable that accepts a filepath or an I/O stream and returns data.
base_dir (str, optional) –
an absolute path to a directory containing the files. If provided, the
filepath
to be loaded will be joined with thebase_dir
. As such, this argument should only be used if the loader will be applied to relative paths. TThe
base_dir
can also include environment variables (e.g.$DATA_DIR/images
) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.downloader (Union[str, callable], optional) –
a callable that accepts at least two positional arguments - a URI and a destination (which could be either a string or file object).
Meerkat includes a small set of built-in downloaders [“url”, “gcs”] which can be specified via string.
fallback_downloader (callable, optional) – a callable that will be run each time the the downloader fails (for any reason). This is useful, for example, if you expect some of the URIs in a dataset to be broken
fallback_downloader
could write an empty file in place of the original. Iffallback_downloader
is not supplied, the original exception is re-raised.cache_dir (str, optional) – the directory on disk where downloaded files are to be cached. Defaults to None, in which case files will be re-downloaded on every access of the data. The
cache_dir
can also include environment variables (e.g.$DATA_DIR/images
) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.
- audio¶
- class classproperty(fget=None, fset=None, fdel=None, doc=None)[source]¶
Taken from https://stackoverflow.com/a/13624858.
The behavior of class properties using the @classmethod and @property decorators has changed across Python versions. This class (should) provide consistent behavior across Python versions. See https://stackoverflow.com/a/1800999 for more information.