etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask module
- class ParallelCSVLoad2Neo4jTask(context, file, model=None, error_file=None, table_size=10, batch_size=5000, max_workers=None, prefetch=4, **csv_reader_kwargs)[source]
Bases:
TaskParallel CSV → Neo4j load using the mix-and-batch strategy.
Wires a CSV reader, optional Pydantic validation, a diagonal splitter (to avoid overlapping node locks), and a Cypher sink. Rows are distributed into (row, col) partitions and processed in non-overlapping groups.
- Parameters:
context (
ETLContext) – Shared ETL context.file (
Path) – CSV file to load.model (
Optional[Type[BaseModel]]) – Optional Pydantic model for row validation; invalid rows go to error_file.error_file (
Path|None) – Output for invalid rows. Required when model is set.table_size (
int) – Bucketing grid size for the splitter.batch_size (
int) – Per-cell target batch size from the splitter.prefetch (
int) – Number of waves to prefetch from the splitter.**csv_reader_kwargs – Forwarded to
etl_lib.data_source.CSVBatchSource.CSVBatchSource.
- Returns:
TaskReturnwith merged validation and Neo4j counters.
Notes
_query() must return Cypher that starts with
UNWIND $batch AS row.Override _id_extractor() if your CSV schema doesn’t expose
start/end; the default usesetl_lib.core.SplittingBatchProcessor.dict_id_extractor().See the nyc-taxi example for a working subclass.
- __init__(context, file, model=None, error_file=None, table_size=10, batch_size=5000, max_workers=None, prefetch=4, **csv_reader_kwargs)[source]
Construct a Task object.
- Parameters:
context (
ETLContext) –ETLContextinstance. Will be available to subclasses.file (Path)
error_file (Path | None)
table_size (int)
batch_size (int)
max_workers (int | None)
prefetch (int)
- run_internal()[source]
Place to provide the logic to be performed.
This base class provides all the housekeeping and reporting, so that implementation must/should not need to care about them. Exceptions should not be captured by implementations. They are handled by this base class.
- Parameters:
kwargs – will be passed to run_internal
- Return type:
- Returns:
An instance of
TaskReturn.