etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask module

class ParallelCSVLoad2Neo4jTask(context, file, model=None, error_file=None, table_size=10, batch_size=5000, max_workers=None, prefetch=4, **csv_reader_kwargs)[source]

Bases: Task

Parallel CSV → Neo4j load using the mix-and-batch strategy.

Wires a CSV reader, optional Pydantic validation, a diagonal splitter (to avoid overlapping node locks), and a Cypher sink. Rows are distributed into (row, col) partitions and processed in non-overlapping groups.

Parameters:
  • context (ETLContext) – Shared ETL context.

  • file (Path) – CSV file to load.

  • model (Optional[Type[BaseModel]]) – Optional Pydantic model for row validation; invalid rows go to error_file.

  • error_file (Path | None) – Output for invalid rows. Required when model is set.

  • table_size (int) – Bucketing grid size for the splitter.

  • batch_size (int) – Per-cell target batch size from the splitter.

  • max_workers (int | None) – Worker threads per wave.

  • prefetch (int) – Number of waves to prefetch from the splitter.

  • **csv_reader_kwargs – Forwarded to etl_lib.data_source.CSVBatchSource.CSVBatchSource.

Returns:

TaskReturn with merged validation and Neo4j counters.

Notes

__init__(context, file, model=None, error_file=None, table_size=10, batch_size=5000, max_workers=None, prefetch=4, **csv_reader_kwargs)[source]

Construct a Task object.

Parameters:
run_internal()[source]

Place to provide the logic to be performed.

This base class provides all the housekeeping and reporting, so that implementation must/should not need to care about them. Exceptions should not be captured by implementations. They are handled by this base class.

Parameters:

kwargs – will be passed to run_internal

Return type:

TaskReturn

Returns:

An instance of TaskReturn.