etl_lib.data_source.ParquetBatchSource module

class ParquetBatchSource(file, context, task=None, **kwargs)[source]

Bases: BatchProcessor

BatchProcessor that reads a Parquet file using pyarrow.

The returned batch of rows will have an additional _row column, containing the source row of the data, starting with 0.

Parameters:
__init__(file, context, task=None, **kwargs)[source]

Constructs a new ParquetBatchSource.

Parameters:
get_batch(max_batch_size)[source]

Provides a batch of data to the caller.

The batch itself could be called and processed from the provided predecessor or generated from other sources.

Parameters:
  • max_batch__size – The max size of the batch the caller expects to receive.

  • max_batch_size (int)

Return type:

Generator[BatchResults, None, None]

Returns

A generator that yields batches.

static get_total_rows(file)[source]
Return type:

int

Parameters:

file (Path)