etl_lib.data_source.ParquetBatchSource module
- class ParquetBatchSource(context, task=None, file=None, **kwargs)[source]
Bases:
BatchProcessorBatchProcessor that reads a Parquet file using pyarrow.
The returned batch of rows will have an additional _row column, containing the source row of the data, starting with 0.
- Parameters:
context (ETLContext)
task (Task | None)
file (Path)
- __init__(context, task=None, file=None, **kwargs)[source]
Constructs a new ParquetBatchSource.
- Parameters:
context (
ETLContext) –etl_lib.core.ETLContext.ETLContextinstance.task (
Optional[Task]) –etl_lib.core.Task.Taskinstance owning this processor.file (
Path) – Path to the Parquet file.kwargs – Will be passed on to the pyarrow.parquet.ParquetFile.iter_batches method.
- get_batch(max_batch_size)[source]
Provides a batch of data to the caller.
The batch itself could be called and processed from the provided predecessor or generated from other sources.
- Parameters:
max_batch_size (
int) – The max size of the batch the caller expects to receive.- Return type:
- Returns
A generator that yields batches.