Skip to content

configuration

BaseDatasetConfiguration

Bases: BaseModel

This is the class that should be imported into main.py files within dataset directories, and built upon with Dataset-specific parameters. Common examples are overwrite_download, overwrite_processing, or year_list.

Source code in data_manager/configuration.py
class BaseDatasetConfiguration(BaseModel):
    """
    This is the class that should be imported into
    `main.py` files within dataset directories, and
    built upon with Dataset-specific parameters.
    Common examples are `overwrite_download`,
    `overwrite_processing`, or `year_list`.
    """

    run: RunParameters
    """
    A `RunParameters` model that defines how this model should be run.
    This is passed into the `Dataset.run()` function.
    """

run: RunParameters instance-attribute

A RunParameters model that defines how this model should be run. This is passed into the Dataset.run() function.

RunParameters

Bases: BaseModel

This is a pydantic BaseModel that represents the run parameters for a Dataset. This model is consumed by Dataset.run() as settings for how to run the Dataset.

Source code in data_manager/configuration.py
class RunParameters(BaseModel):
    """
    This is a pydantic BaseModel that represents the run
    parameters for a Dataset. This model is consumed by
    Dataset.run() as settings for how to run the Dataset.
    """

    backend: Literal["local", "mpi", "prefect"] = "prefect"
    task_runner: Literal[
        "concurrent",
        "dask",
        "hpc",
        "kubernetes",
        "sequential",
    ] = "concurrent"
    """
    The backend to run the dataset on.
    Most common values are "sequential", and "concurrent"
    """
    run_parallel: bool = True
    """
    Whether or not to run the Dataset in parallel.
    """
    max_workers: Optional[int] = 4
    """
    Maximum number of concurrent tasks that may be run for this Dataset.
    This may be overridden when calling `Dataset.run_tasks()`
    """
    bypass_error_wrapper: bool = False
    """
    If set to `True`, exceptions will not be caught when running tasks, and will instead stop execution of the entire dataset.
    This can be helpful for quickly debugging a dataset, especially when it is running sequentially.
    """
    threads_per_worker: Optional[int] = 1
    """
    `threads_per_worker` passed through to the DaskCluster when using the dask task runner.
    """
    # cores_per_process: Optional[int] = None
    chunksize: int = 1
    """
    Sets the chunksize for pools created for concurrent or MPI task runners.
    """
    log_dir: str
    """
    Path to directory where logs for this Dataset run should be saved.
    This is the only run parameter without a default, so it must be set in a Dataset's configuration file.
    """
    logger_level: int = logging.INFO
    """
    Minimum log level to log.
    For more information, see the [relevant Python documentation](https://docs.python.org/3/library/logging.html#logging-levels).
    """
    retries: int = 3
    """
    Number of times to retry each task before giving up.
    This parameter can be overridden per task run when calling `Dataset.run_tasks()`
    """
    retry_delay: int = 5
    """
    Time in seconds to wait between task retries.
    This parameter can be overridden per task run when calling `Dataset.run_tasks()`
    """
    conda_env: str = "geodata38"
    """
    Conda environment to use when running the dataset.
    **Deprecated because we do not use this in the new Prefect/Kubernetes setup**
    """

bypass_error_wrapper: bool = False class-attribute instance-attribute

If set to True, exceptions will not be caught when running tasks, and will instead stop execution of the entire dataset. This can be helpful for quickly debugging a dataset, especially when it is running sequentially.

chunksize: int = 1 class-attribute instance-attribute

Sets the chunksize for pools created for concurrent or MPI task runners.

conda_env: str = 'geodata38' class-attribute instance-attribute

Conda environment to use when running the dataset. Deprecated because we do not use this in the new Prefect/Kubernetes setup

log_dir: str instance-attribute

Path to directory where logs for this Dataset run should be saved. This is the only run parameter without a default, so it must be set in a Dataset's configuration file.

logger_level: int = logging.INFO class-attribute instance-attribute

Minimum log level to log. For more information, see the relevant Python documentation.

max_workers: Optional[int] = 4 class-attribute instance-attribute

Maximum number of concurrent tasks that may be run for this Dataset. This may be overridden when calling Dataset.run_tasks()

retries: int = 3 class-attribute instance-attribute

Number of times to retry each task before giving up. This parameter can be overridden per task run when calling Dataset.run_tasks()

retry_delay: int = 5 class-attribute instance-attribute

Time in seconds to wait between task retries. This parameter can be overridden per task run when calling Dataset.run_tasks()

run_parallel: bool = True class-attribute instance-attribute

Whether or not to run the Dataset in parallel.

task_runner: Literal['concurrent', 'dask', 'hpc', 'kubernetes', 'sequential'] = 'concurrent' class-attribute instance-attribute

The backend to run the dataset on. Most common values are "sequential", and "concurrent"

threads_per_worker: Optional[int] = 1 class-attribute instance-attribute

threads_per_worker passed through to the DaskCluster when using the dask task runner.

get_config(model, config_path='config.toml')

Load the configuration for a Dataset.

This function reads a TOML configuration file (usually config.toml) out of the same directory as the main.py file, and returns a BaseDatasetConfiguration model filled in with the values from that configuration file.

Parameters:

Name Type Description Default
model BaseDatasetConfiguration

The model to load the configuration values into. This should nearly always be a Dataset-specific model defined in main.py that inherits `BaseDatasetConfiguration.

required
config_path Union[Path, str]

The relative path to the TOML configuration file. It's unlikely this parameter should ever be changed from its default.

'config.toml'
Source code in data_manager/configuration.py
def get_config(
    model: BaseDatasetConfiguration, config_path: Union[Path, str] = "config.toml"
):
    """
    Load the configuration for a Dataset.

    This function reads a TOML configuration
    file (usually `config.toml`) out of the
    same directory as the `main.py` file, and
    returns a `BaseDatasetConfiguration` model
    filled in with the values from that
    configuration file.

    Parameters:
        model: The model to load the configuration values into. This should nearly always be a Dataset-specific model defined in `main.py` that inherits `BaseDatasetConfiguration.
        config_path: The relative path to the TOML configuration file. It's unlikely this parameter should ever be changed from its default.
    """
    config_path = Path(config_path)
    if config_path.exists():
        with open(config_path, "rb") as src:
            return model.model_validate(tomllib.load(src))
    else:
        return FileNotFoundError("No TOML config file found for dataset.")