etl_lib.data_source.ParquetBatchSource module
- class ParquetBatchSource(file, context, task=None, **kwargs)[source]
Bases:
BatchProcessorBatchProcessor that reads a Parquet file using pyarrow.
The returned batch of rows will have an additional _row column, containing the source row of the data, starting with 0.
- Parameters:
file (Path)
context (ETLContext)
task (Task | None)
- __init__(file, context, task=None, **kwargs)[source]
Constructs a new ParquetBatchSource.
- Parameters:
file (
Path) – Path to the Parquet file.context (
ETLContext) –etl_lib.core.ETLContext.ETLContextinstance.kwargs – Will be passed on to the pyarrow.parquet.ParquetFile.iter_batches method.
task (Task | None)
- get_batch(max_batch_size)[source]
Provides a batch of data to the caller.
The batch itself could be called and processed from the provided predecessor or generated from other sources.
- Parameters:
max_batch__size – The max size of the batch the caller expects to receive.
max_batch_size (int)
- Return type:
- Returns
A generator that yields batches.