configuration
BaseDatasetConfiguration
¶
Bases: BaseModel
This is the class that should be imported into
main.py
files within dataset directories, and
built upon with Dataset-specific parameters.
Common examples are overwrite_download
,
overwrite_processing
, or year_list
.
Source code in data_manager/configuration.py
run
instance-attribute
¶
A RunParameters
model that defines how this model should be run.
This is passed into the Dataset.run()
function.
RunParameters
¶
Bases: BaseModel
This is a pydantic BaseModel that represents the run parameters for a Dataset. This model is consumed by Dataset.run() as settings for how to run the Dataset.
Source code in data_manager/configuration.py
bypass_error_wrapper = False
class-attribute
instance-attribute
¶
If set to True
, exceptions will not be caught when running tasks, and will instead stop execution of the entire dataset.
This can be helpful for quickly debugging a dataset, especially when it is running sequentially.
chunksize = 1
class-attribute
instance-attribute
¶
Sets the chunksize for pools created for concurrent or MPI task runners.
conda_env = 'geodata38'
class-attribute
instance-attribute
¶
Conda environment to use when running the dataset. Deprecated because we do not use this in the new Prefect/Kubernetes setup
log_dir
instance-attribute
¶
Path to directory where logs for this Dataset run should be saved. This is the only run parameter without a default, so it must be set in a Dataset's configuration file.
logger_level = logging.INFO
class-attribute
instance-attribute
¶
Minimum log level to log. For more information, see the relevant Python documentation.
max_workers = 4
class-attribute
instance-attribute
¶
Maximum number of concurrent tasks that may be run for this Dataset.
This may be overridden when calling Dataset.run_tasks()
retries = 3
class-attribute
instance-attribute
¶
Number of times to retry each task before giving up.
This parameter can be overridden per task run when calling Dataset.run_tasks()
retry_delay = 5
class-attribute
instance-attribute
¶
Time in seconds to wait between task retries.
This parameter can be overridden per task run when calling Dataset.run_tasks()
run_parallel = True
class-attribute
instance-attribute
¶
Whether or not to run the Dataset in parallel.
task_runner = 'concurrent'
class-attribute
instance-attribute
¶
The backend to run the dataset on. Most common values are "sequential", and "concurrent"
threads_per_worker = 1
class-attribute
instance-attribute
¶
threads_per_worker
passed through to the DaskCluster when using the dask task runner.
get_config(model, config_path='config.toml')
¶
Load the configuration for a Dataset.
This function reads a TOML configuration
file (usually config.toml
) out of the
same directory as the main.py
file, and
returns a BaseDatasetConfiguration
model
filled in with the values from that
configuration file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
BaseDatasetConfiguration
|
The model to load the configuration values into. This should nearly always be a Dataset-specific model defined in |
required |
config_path
|
Union[Path, str]
|
The relative path to the TOML configuration file. It's unlikely this parameter should ever be changed from its default. |
'config.toml'
|