Validation
==========

Input data can optionally be validated using the excellent `Pydantic <https://docs.pydantic.dev/latest/>`_ package.

The :class:`~etl_lib.core.ValidationBatchProcessor.ValidationBatchProcessor` class wraps a Pydantic model class so that it can be used inside a ``BatchProcessor`` chain.

Each row in the incoming batch is validated against the defined model. Rows that fail validation are written to a JSON file. Each row in that file retains the original input data along with all errors generated by Pydantic for that row.

The outgoing batch only contains the rows that successfully pass Pydantic validation.

An example class from the GTFS example demonstrates the implementation of loading data from a CSV file into Neo4j:

.. code-block:: python

    class LoadAgenciesTask(CSVLoad2Neo4jTask):

        class Agency(BaseModel):
            id: str = Field(alias="agency_id", default="generic")
            name: str = Field(alias="agency_name")
            url: str = Field(alias="agency_url")
            timezone: str = Field(alias="agency_timezone")
            lang: str = Field(alias="agency_lang")

        def __init__(self, context: ETLContext, file: Path):
            super().__init__(context, LoadAgenciesTask.Agency, file)

        def _query(self):
            return """ UNWIND $batch AS row
            MERGE (a:Agency {id: row.id})
                SET a.name = row.name,
                    a.url = row.url,
                    a.timezone = row.timezone,
                    a.lang = row.lang
            """

The ``class Agency(BaseModel)`` defines a simple Pydantic model for validation purposes.