Copy vs. View Behavior
Contents
Copy vs. View Behavior¶
In Meerkat, as in other data structures (e.g. NumPy, Pandas ), it is important to understand whether or not two variables point to objects that share the same underlying data. If they do, modifying one will affect the other. If they don’t, data must be getting copied, which could have implications for efficiency. Consider the following example:
>>> import meerkat as mk
>>> col1 = mk.TensorColumn(np.arange(10))
>>> col2 = col1[:4]
>>> col2[0] = -1
>>> print(col1[0])
Is 0
or -1
printed out?
It turns out that in this case it is -1
that is
printed. This is because col2
is a “view” of the col1
array, meaning that
the two variables point to objects that share the same underlying data. However, if we
were to change the third line to col2 = col1[np.arange(4)]
, a seemingly
inconsequential change, then the underlying data would be copied and it would be 0
that is printed.
In this guide, we will discuss how to know when two variables in Meerkat share underlying data. In general, Meerkat inherits the copy and view behavior of its backend data structures (Numpy Arrays, Pandas Series, Torch Tensors). So, users who are are ‘ familiar with those libraries should find it straightforward to predict Meerkat’s copying and viewing behavior.
We’ll begin by defining some terms: coreferences, views and copies. These terms describe
the different relationships that could exist between two variables pointing to
Column
or DataFrame
objects. Then, we’ll
discuss how to know whether indexing a Meerkat data structures will result in a copy,
coreference or view.
Copies, Views, and Coreferences¶
Columns¶
Let’s enumerate the different relationships that could
exist between two column variables col1
and col2
.
Coreferences - Both variables refer to the same Column
object.
>>> col1 is col2
True
Of course, in this case, anything changes made to col1
will also be
made to col2
and vice versa.
Views - The variables refer to different Column
objects
(i.e. col1 is not col1
), but modifying the data of col1
affects col2
:
either because
col1.data
andcol2.data
reference the same object# a. the underlying data variables reference the same object >>> col1.data is col2.data True
or because
col1.data
is a view ofcol2.data
(or vice versa)## For example, if col1.data is np.ndarray >>> isinstance(col1.data, np.ndarray) True # b. the underlying data share memory >>> col1.data.base is col2.data.base True
How are views created? Views of a column are created in one of two ways:
Implicitly with
col._clone(data=new_data)
wherecol.data
shares memory withnew_data
for one of the reasons described above.Explicitly with
col.view()
which is simply a wrapper aroundcol._clone
:def view(self): return self._clone()
What about other attributes? (e.g.
loader
in anImageColumn
) It depends.col1
andcol2
refer to different column objects, so assignment to attributes incol1
will not affectcol2
(and vice versa):>>> col1.loader = fn1 >>> col1.loader == col2.loader False
However, these attributes are not copied! So, stateful changes to the attributes will carry across columns:
>>> col1.loader.size = 224 >>> col2.loader.size == 224 True
If we’d like attributes, we’ll have to use “Deep Copies”.
Copies– The variables refer to different Column
objects (i.e. col1 is not col1
), and modifying the data of
col1
does not affect col2
In this case, col1.data
and [col2.data](http://col2.data)
do not
share memory.
How are copies created? Copies of a column are created in one of two ways:
Implicitly with
col._clone(data=new_data)
where[col.data](http://col.data)
does not share memory withnew_data
.Explicitly with
col.copy()
which is simply a wrapper aroundcol._clone
:def copy(self): new_data = self._copy_data() return self._clone(data=new_data)
where
_copy_data
is a backend-specific method that copies the data. For example, if the backend is a Numpy Array, then_copy_data
will simplyreturn self.data.copy()
. This is an important point: each column must know how to truly copy it’s data.
What about other attributes? (e.g.
loader
in anImageColumn
) Same as “View” above.
DataFrames¶
Let’s do the same for two DataFrame variables df1
and df2
.
Coreferences - Both variables refer to the same DataFrame
object.
>>> df1 is df2
True
Of course, in this case, anything that is done to df1
will also be
done to df2
and vice versa.
Views - The variables refer to different DataFrame
objects
(i.e. df1 is not df2
), but some of the columns in df1
are
coreferences
or
views
of some of the columns in df2
How are views created? Views of a DataFrame are created in one of three ways:
Implicitly with
df._clone(data=new_data)
wheredf.columns
includes some columns withnew_data
for one of the reasons described above.Implicitly when a column from one DataFrame is added to another (e.g.
df1["a"] = df2["b"]
. Behind the scenes,Explicitly with
df.view()
which simply callscol.view()
on all its columns and then passes themdf._clone(data=view_columns)
What about other attributes? (e.g.
index_column
in anEntityDataFrame
) It depends.df1
anddf2
refer to different column objects, so assignment to attributes indf1
will not affectdf2
(and vice versa):>>> df1.loader = fn1 >>> df1.loader == df2.loader False
However, these attributes are not copied! So, stateful changes to the attributes will carry across DataFrames:
>>> df1.loader.size = 224 >>> df2.loader.size == 224 True
Copies– The variables refer to different DataFrame
objects
(i.e. df1 is not df2
), and all of the columns in df1
are
copies of the the columns in df2
How are copies created? Copies of a column are created in one of two ways.
Implicitly with
col._clone(data=new_data)
where[col.data](http://col.data)
does not share memory withnew_data
.Explicitly with
col.copy()
which is simply a wrapper aroundcol._clone
:def copy(self): new_data = self._copy_data() return self._clone(data=new_data)
where
_copy_data
is a backend-specific method that copies the data. For example, if the backend is a Numpy Array, then_copy_data
will simplyreturn self.data.copy()
. This is an important point: each column must know how to truly copy it’s data.
What about other attributes? (e.g.
index_column
in anEntityDataFrame
) Same as “View” above.
Behavior when Indexing¶
Indexing rows¶
In Meerkat, we select rows by indexing with int
, slice
,
Sequence[int]
, or an np.ndarray
, torch.Tensor
,
pandas.Series
with an integer or boolean type.
We can select rows from an Column
…
col: mk.Column = ...
# (1) int -> single value
value: object = col[0]
# (2) slice -> a sub column
new_col: mk.Column = col[0:10]
# (3) sequence -> a sub column
new_col: mk.Column = col[[0, 4, 6]]
… or from a DataFrame
df: mk.DataFrame = ...
# (1) int -> dict
row: dict = df[0]
# (2) slice -> a DataFrame slice
new_df: mk.DataFrame = df[0:10]
# (3) sequence -> a DataFrame slice
new_df: mk.Datapanel = df[[0, 4, 6]]
From a column. When selecting rows from a column col
, Meerkat
takes the following approach:
Step 1. Indexes the underlying data object stored at
[col.data](http://col.data)
(e.g. np.ndarray
or
torch.tensor
) always deferring to the copy/view strategy of that
data structure. This gives us a new data object, new_data
which may
or may not share memory with with the original col.data
depending on
the strategy of the underlying data structure.
Copy/View strategies of data structures underlying core Meerkat columns.
torch
When accessing the contents of a tensor via indexing, PyTorch follows Numpy behaviors that basic indexing returns views, while advanced indexing returns a copy. Assignment via either basic or advanced indexing is in-place. See more examples in Numpy indexing documentation.
numpy
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view). (source)
pandas
But in pandas, whether you get a view or not depends on the structure of the DataFrame and, if you are trying to modify a slice, the nature of the modification. (source)
Step 2.
Clones
the original column, col
, and stores the the newly indexed data
object, new_data
, in it (i.e. with col._clone(data=new_data)
.
So, selecting rows from a column col
returns either a
view
or a
copy,
depending on the underlying data structure.
From a DataFrame. When selecting rows from a DataFrame df
,
Meerkat takes the following approach:
Step 1. Indexes each of the columns using the strategy above.
Note: sometimes this step proceeds in batches according to the BlockManager.
Step 2.
Clones
the original DataFrame, df
, passing the newly indexed columns. This
new DataFrame will be:
either a view of the original
df
, if any of the indexed columns are viewsor a copy if all of the indexed columns are copies
Indexing columns¶
In Meerkat, we select columns from a DataFrame
by either indexing
with str
or a Sequence[str]
:
# (1) `str` -> single column
col: mk.Column = df["col_a"]
# (2) `Sequence[str]` -> multiple columns
df: mk.DataFrame = df[["col_a", "col_b"]]
When selecting columns from a DataFrame
, Meerkat always returns
a
coreference
to the underlying column(s) – not a copy or view.
Indexing a single column (i.e. with a
str
) returns the underlyingColumn
object directly. In the example belowcol1
andcol2
are coreferences of the same column.
# (1) `str` -> single column
>>> col1: mk.Column = df["col_a"]
>>> col2: mk.Column = df["col_a"]
>>> col1 is col2
True
Indexing multiple columns (i.e. with
Sequence[str]
) returns a view of theDataFrame
holding coreferences to the columns in the originalDataFrame
. This means theColumn
objects held in the newDataFrame
are the sameColumn
objects held in the originalDataFrame
.
# (1) `Sequence[str]` -> single column
>>> new_df: mk.DataFrame = df[["col_a", "col_b"]]
>>> new_df["col_a"] is df["col_a"]
True
>>> new_df["col_a"].data is df["col_a"].data
True