Setting Up Your Environment¶
Before developing a dataset pipeline, you'll need a development environment with the appropriate packages installed.
Note
You will need to use a command-line interface to work with conda and other Python-related tools. A great resource for getting comfortable with the command-line is the MIT Missing Semester course, which is available for free online. If you use Microsoft Windows, please consider installing Windows Subsystem for Linux.
Environment Management System¶
When developing on your local machine, you'll likely need a system for compartmentalizing environments you use for different development projects. While this setup is entirely up to you, we've found success using conda. Another option is mamba is a faster alternative to conda that is fully compatible.
Whichever tool you choose, follow its installation instructions before proceeding.
Clone the geo-datasets Repository¶
- Make sure git is installed.
cd
to the directory you'd like to clone geo-datasets into. This can be~/Documents
, for example- Run
git clone git@github.com:aiddata/geo-datasets.git
. cd
intogeo-datasets
.
Install Dependencies¶
Note
This section assumes you are using conda (or mamba). If you are using some other environment management system, you'll have to adapt these instructions accordingly.
- Create an environment for geo-datasets. We usually name the environment "geodataXXX", replacing the "XXX" with the version of Python we are currently using. At the time of writing, that was 3.11:
- Activate your new environment
- Change directory to the
kubernetes/containers/job-runner
subdirectory of the geo-datasets repository - Install Python packages used by the latest job runner