configuration
BaseDatasetConfiguration
¶
Bases: BaseModel
This is the class that should be imported into
main.py
files within dataset directories, and
built upon with Dataset-specific parameters.
Common examples are overwrite_download
,
overwrite_processing
, or year_list
.
Source code in data_manager/configuration.py
run: RunParameters
instance-attribute
¶
A RunParameters
model that defines how this model should be run.
This is passed into the Dataset.run()
function.
RunParameters
¶
Bases: BaseModel
This is a pydantic BaseModel that represents the run parameters for a Dataset. This model is consumed by Dataset.run() as settings for how to run the Dataset.
Source code in data_manager/configuration.py
bypass_error_wrapper: bool = False
class-attribute
instance-attribute
¶
If set to True
, exceptions will not be caught when running tasks, and will instead stop execution of the entire dataset.
This can be helpful for quickly debugging a dataset, especially when it is running sequentially.
chunksize: int = 1
class-attribute
instance-attribute
¶
Sets the chunksize for pools created for concurrent or MPI task runners.
conda_env: str = 'geodata38'
class-attribute
instance-attribute
¶
Conda environment to use when running the dataset. Deprecated because we do not use this in the new Prefect/Kubernetes setup
log_dir: str
instance-attribute
¶
Path to directory where logs for this Dataset run should be saved. This is the only run parameter without a default, so it must be set in a Dataset's configuration file.
logger_level: int = logging.INFO
class-attribute
instance-attribute
¶
Minimum log level to log. For more information, see the relevant Python documentation.
max_workers: Optional[int] = 4
class-attribute
instance-attribute
¶
Maximum number of concurrent tasks that may be run for this Dataset.
This may be overridden when calling Dataset.run_tasks()
retries: int = 3
class-attribute
instance-attribute
¶
Number of times to retry each task before giving up.
This parameter can be overridden per task run when calling Dataset.run_tasks()
retry_delay: int = 5
class-attribute
instance-attribute
¶
Time in seconds to wait between task retries.
This parameter can be overridden per task run when calling Dataset.run_tasks()
run_parallel: bool = True
class-attribute
instance-attribute
¶
Whether or not to run the Dataset in parallel.
task_runner: Literal['concurrent', 'dask', 'hpc', 'kubernetes', 'sequential'] = 'concurrent'
class-attribute
instance-attribute
¶
The backend to run the dataset on. Most common values are "sequential", and "concurrent"
threads_per_worker: Optional[int] = 1
class-attribute
instance-attribute
¶
threads_per_worker
passed through to the DaskCluster when using the dask task runner.
get_config(model, config_path='config.toml')
¶
Load the configuration for a Dataset.
This function reads a TOML configuration
file (usually config.toml
) out of the
same directory as the main.py
file, and
returns a BaseDatasetConfiguration
model
filled in with the values from that
configuration file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
BaseDatasetConfiguration
|
The model to load the configuration values into. This should nearly always be a Dataset-specific model defined in |
required |
config_path
|
Union[Path, str]
|
The relative path to the TOML configuration file. It's unlikely this parameter should ever be changed from its default. |
'config.toml'
|