Blocks and the BlockManager

In Meerkat, the columns of a DataFrame are grouped together into blocks, sets of columns with similar underlying storage (e.g. NumPy arrays). Organizing columns into blocks enables:

  1. Vectorized row-wise operations (e.g. slicing, reduction)

  2. Simplified I/O and improved latency

The most important internal piece of the Meerkat DataFrame implementation is the BlockManager, a dict-like object that maps column names to columns. The BlockManager manages links between a DataFrame’s columns and data blocks (AbstractBlock, NumpyBlock) where the data is actually stored. It implements consolidate, which takes columns of similar type in a DataFrame and stores their data together in a block, and apply which applies row-wise operations (e.g. __getitem__) to the blocks in a vectorized fashion. Other important classes:

  • BlockRef objects link a block with the BlockManager. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things:

    1. A reference to the block (self.block)

    2. A set of columns in the BlockManager whose data live in the Block

  • BlockableMixin - a mixin used with Column that holds references to a column’s block and the columns index in the block

  • BlockView - a simple DataClass holding a block and an index into the block. It is typical for new columns to be created from BlockView

BlockManager

Manages all the columns in a DataFrame and holds references (BlockRef) to all the blocks in a DataFrame. This is done with two collections:

  • _columns, a dictionary mapping from column names to Column

  • _block_refs, a dictionary mapping from the blocks id to BlockRef

Implement the following methods:

``consolidate``

### PSEUDOCODE
block_groups = group blocks by signature
for group in block_groups:
    for block in group:
        # get a "view" of the subset of the columns in the block
    # (note this may take multiple )
    # concat the blocks and get mapping from name
  # and figure out the mapping of columns to index in block

IMPORTANT: After a consolidate, all columns have their own memory!

**apply**

How do block operations work?

  • Apply the operation to each block in the data panel,

    • Each new block should

  • Create mapping

**add**

  • Single

  • Multiple

**remove**

When deleting a column we have to be sure to delete the reference to the block ****

get_columns

BlockRef

A BlockRef is the link between a DataFrame and a single block. It consists of two things:

  • A reference to the block (self._block)

  • A set of columns (of typeBlockableMixin

AbstractBlock

Multiple A block can exist in multiple .

BlockableMixin

This is mixed into Column subclasses that can take part of a block (e.g.