etl_lib.core.BatchProcessor module

class BatchProcessor(context, task=None, predecessor=None)[source]

Bases: ABC

Allows assembly of etl_lib.core.Task.Task out of smaller building blocks.

This way, functionally, such as reading from a CSV file, writing to a database or validation can be implemented and tested independently and re-used.

BatchProcessors form, a linked list, where each processor only knows about its predecessor.

BatchProcessors process data in batches. A batch of data is requested from the provided predecessors get_batch() and returned in batches to the caller. Usage of Generators ensure that not all data must be loaded at once.

Parameters:

task (Task)

__init__(context, task=None, predecessor=None)[source]

Constructs a new etl_lib.core.BatchProcessor instance.

Parameters:
  • contextetl_lib.core.ETLContext.ETLContext instance. It Will be available to subclasses.

  • task (Task) – etl_lib.core.Task.Task this processor is part of. Needed for status reporting only.

  • predecessor – Source of batches for this processor. Can be None if no predecessor is needed (such as when this processor is the start of the queue).

context

etl_lib.core.ETLContext.ETLContext instance. Providing access to general facilities.

abstractmethod get_batch(max_batch__size)[source]

Provides a batch of data to the caller.

The batch itself could be called and processed from the provided predecessor or generated from other sources.

Parameters:

max_batch__size (int) – The max size of the batch the caller expects to receive.

Return type:

Generator[BatchResults, None, None]

Returns

A generator that yields batches.

predecessor

Predecessor, used as a source of batches.

task

The etl_lib.core.Task.Task owning instance.

class BatchResults(chunk, statistics=<factory>, batch_size=9223372036854775807)[source]

Bases: object

Return object of the get_batch() method, wrapping a batched data together with meta information.

Parameters:
__init__(chunk, statistics=<factory>, batch_size=9223372036854775807)
Parameters:
Return type:

None

batch_size: int = 9223372036854775807

size of the batch.

chunk: List[Any]

The batch of data.

statistics: dict

dict of statistic information, such as row processed, nodes writen, ..

append_result(org, stats)[source]

Appends the stats dict to the provided org.

Parameters:
  • org (BatchResults) – The original BatchResults object.

  • stats (dict) – dict containing statistics to be added to the org object.

Return type:

BatchResults

Returns:

New BatchResults object, where the statistics attribute is the merged result of the

provided parameters. Values in the dicts with the same key are added.