Overview of the Dataset
class¶
The idea behind the Dataset
class is that it represents the complete logic of a dataset import, providing "a means of bundling data and functionality together".
Once you have determined what you want your script to accomplish, this class provides a framework that:
- Organizes groups of tasks into "task runs," standardizing their outputs and logging their progress,
- Provides convenience functions to help manage common tasks in safer ways, and
- Takes care of running the pipeline on our backend infrastructure.
The Dataset
class is provided by a Python package called data_manager, stored in the /data_manager
directory in the geo-datasets repository.
By updating the data_manager package, we can update the behavior of all pipelines at once.
Each dataset (in /datasets
) can choose to use any version of data_manager using a configuration parameter (more on that later).
Dataset
Class Functions¶
This is a high-level overview of the functions within the Dataset
class.
For technical details, please refer to the corresponding reference page.
Required Functions¶
main()
¶
When a Dataset
is run, Dataset.main()
gets called.
main()
defines the game plan for a dataset run, describing the order of each set of tasks.
To do this, main()
contains function calls wrapped with self.run_tasks()
to manage groups of tasks.
It might be helpful to read the main()
function in the script template below to see how this works.
Provided Functions¶
run_tasks()
¶
todo
tmp_to_dst_file()
¶
todo
Adding Your Own Functions¶
When writing a Dataset
, it will be necessary to add your own functions to power it.
For example, most pipelines will include functions to download units of data.
This is illustrated in the template code below.
The BaseDatasetConfiguration
Model¶
Info
In pydantic lingo, a "model" is a class that inherits pydantic.BaseModel
and includes internal type-checking logic.
Check out the pydantic documentation for more information.
BaseDatasetConfiguration
is a pydantic model that represents the configuration parameters for running a dataset.
As well as defining a class that inherits Dataset
, you should also define a configuration class that inherits BaseDatasetConfiguration
The run
Parameter¶
It comes with one built-in parameter out-of-the-box, called run
.
run
defines the options for how the computer should run the dataset, such as if the tasks should be ran sequentially or in parallel.
The config file (see below) can override any of the default run parameters in the [run]
table.
Main Script Template¶
from pathlib import Path
from data_manager import BaseDatasetConfiguration, Dataset, get_config# (1)!
class ExampleDatasetConfiguration(BaseDatasetConfiguration):# (2)!
raw_dir: str
output_dir: str
years: List[int]# (3)!
overwrite_download: bool
overwrite_processing: bool
class ExampleDataset(Dataset):# (4)!
name = "Official Name of Example Dataset"# (5)!
def __init__(self, config: ESALandcoverConfiguration):# (6)!
self.raw_dir = Path(config.raw_dir)
self.output_dir = Path(config.output_dir)# (7)!
self.years = config.years
self.overwrite_download = config.overwrite_download# (8)!
self.overwrite_processing = config.overwrite_processing
def download(self, year):# (9)!
logger = self.get_logger()
# Logic to download a year's worth of data
return output_file_path
def process(self, input_path, output_path):
logger = self.get_logger()
if self.overwrite_download and not self.overwrite_processing:
logger.warning("Overwrite download set but not overwrite processing.")# (10)!
if output_path.exists() and not self.overwrite_processing:
logger.info(f"Processed layer exists: {input_path}")
else:
logger.info(f"Processing: {input_path}")
tmp_input_path = self.process_dir / Path(input_path).name
return
def main(self):
logger = self.get_logger()
os.makedirs(self.raw_dir / "compressed", exist_ok=True)
os.makedirs(self.raw_dir / "uncompressed", exist_ok=True)
# Download data
logger.info("Running data download")
download = self.run_tasks(self.download, [[y] for y in self.years])
self.log_run(download)
os.makedirs(self.output_dir, exist_ok=True)
# Process data
logger.info("Running processing")
process_inputs = zip(
download.results(),
[self.output_dir / f"esa_lc_{year}.tif" for year in self.years],
)
process = self.run_tasks(self.process, process_inputs)
self.log_run(process)
# ---- BEGIN BOILERPLATE ----(11)
try:
from prefect import flow
except:
pass
else:
@flow
def name_of_dataset(config: DatasetConfigurationName):
DatasetClassName(config).run(config.run)
if __name__ == "__main__":
config = get_config(DatasetConfigurationName)
DatasetClassName(config).run(config.run)
- This import is explained in full in the Adding Boilerplate section.
- This is the configuration pydantic model, inherited from
BaseDatasetConfiguration
. See configuration for more information. - Since pydantic type checks when data is loaded into a model, this type hint enforces the concent of the config file
config.toml
. If the type isList[int]
, the TOML representation of this parameter will have to look something like: - Here is the main
Dataset
definition. Note that each of its attributes and methods are indented below. Also, the Python community has decided to name classes using the CapWords convention. - This
str
attribute of theDataset
class should be set to the full proper name of the dataset, for convenient reference. In the Prefect UI, deployed pipelines will be labeled with this name. - The
__init__()
function is called when a class is first instantiated. This function sets all of the variables withDataset
(stored as attributes ofself
) for future reference by the other methods withinDataset
. pathlib.Path
makes working with file paths so much nicer. More on that here.- All these "
self.XXX = config.XXX
" lines could be replaced with a singleself.config = config
statement. Then, other methods could referenceself.config.overwrite_download
, for example. Your call as to what feels cleaner / more ergonomic. - Here is the first custom method in this example.
When this
Dataset
class is run, themain()
method will call thisdownload()
method for each year it wants to download. - Here is a nice example of the
logger
in use. As long as you add the linelogger = self.get_logger()
at the top of anyDataset
method, you can call it to automatically log pipeline events.logger
supports the levelsdebug
,info
,warning
,error
, andcritical
. - Explained in detail in the Adding Boilerplate section.
Configuration¶
In addition to main.py
, we store configuration values in a separate TOML file, config.toml
.
How the Config File is Loaded¶
...
Template Config File¶
# top-level key/value pairs load into dataset configuration(1)
raw_dir = "/sciclone/aiddata10/REU/geo/raw/esa_landcover"
years = [ 2018, 2019, 2020 ]
overwrite_download = false
api_key = "f6d4343e-0639-45e1-b865-84bae3cce4ee"
[run]# (2)!
max_workers = 4
log_dir = "/sciclone/aiddata10/REU/geo/raw/example_dataset/logs"# (3)
[repo]# (4)!
url = "https://github.com/aiddata/geo-datasets.git"
branch = "master"
directory = "datasets/example_dataset"# (5)!
[deploy]# (6)!
deployment_name = "example_dataset"
image_tag = "05dea6e"# (7)!
version = 1
flow_file_name = "main"
flow_name = "example_dataset"
work_pool = "geodata-pool"
data_manager_version = "0.4.0"# (8)!
- As this comment implies, the top-level key/value pairs (those not within a [table] as seen below) are loaded into a
BaseDatasetConfiguration
model as defined inmain.py
. - These...
- This is the one required parameter in the
run
table.log_dir
instructs theDataset
where to save log files for each run. - The
repo
table instructs the deployment where to find the dataset once it's been pushed to the geo-datasets repository on GitHub. This table should generally be left as-is, replacing "example_dataset" with the name of your dataset as appropriate. - This refers to the path to the dataset directory relative to the root of the repository.
- The
deploy
table provides the deployment script with settings and metadata for the Prefect deployment. - The OCI image tag for the container to run this deployment in. See the deployment guide for more information.
- The data_manager package is versioned using git tags, pushed to the geo-datasets repository on GitHub. This string specifies which tag to pull from GitHub and install when the container spins up.