Installation¶
This section covers how to setup an environment used for building your data lake by means of the aws cdk , and how to install the oedi
package and use oedi
commands to run crawlers and test SQL queries.
The easiest way to setup the environment is using Docker, but you can also set it up in your local environment step by step.
Please refer to the oedi S3 viewer for information about what data sets are currently available.
Docker Environment¶
First, you will need to install and configure Docker. To do this, please refer to Docker’s documentation for your specific machine. Once you have Docker installed, there are two ways to obtain the Docker image of the oedi
tools: either pull it from Docker Hub,
or build it from the source code.
Pull Docker Image from Docker Hub¶
The simplest way to obtain the Docker image is to pull it directly from our Docker Hub repo. To do anything with Docker, you will first need to get an instance of the Docker daemon running. If you installed Docker Desktop, then you just need to open the app, and the daemon will start automatically. Next, open a command line and run:
$ docker pull openenergydatainitiative/oedi
If you are using Docker Desktop, you should now see the image under the Images tab. Alternatively, you can run docker images
in the terminal to see a list of images.
Note
The deployment package of AWS data lake was migrated from cdk1
to cdk2
. The last version that supports cdk1
is v0.1.6 .
From v0.2.0
, the AWS data lake deployment starts to use cdk2
. As cdk2
does not include the experimental L2/L3 constructs which were used by this package before v0.1.6
(included),
it caused compatibility issue related to Glue databases. If you already deployed the data lake, please destroy before trying to re-deploy with new versions.
Build Docker Image from Source Code¶
If you’re having trouble with Docker Hub, you can have Docker build the image from a clone of our repo. Get a copy of the source code from our public Github repository - open-data-access-tools:
$ git clone git@github.com:openEDI/open-data-access-tools.git
In the terminal, navigate to the directory where you saved the source code, open-data-access-tools
,
and build the Docker image using the build
command:
$ cd <path to open-data-access-tools folder>
$ docker build -t openenergydatainitiative/oedi .
If you are using Docker Desktop, you should now see the image under the Images tab. Alternatively, you can run docker images
in the terminal to see a list of images.
Run OEDI Docker Container¶
In order to use this tools, you’ll need to have an AWS account and provide your AWS credentials.
The AWS credentials could be specified with the docker run
command, there are many potential ways, here we provide three, you can use any of them.
Attach the
.aws
with--volume / -v
flag
$ docker run --rm -it \
-v <path to credentials>:/root/.aws \
openenergydatainitiative/oedi bash
Pass AWS environment variables with
--env / -e
flag
$ docker run --rm -it \
-e AWS_ACCESS_KEY_ID=<YOUR KEY ID> \
-e AWS_SECRET_ACCESS_KEY=<YOUR SECRET EKY> \
-e AWS_DEFAULT_REGION=<AWS REGION> \
openenergydatainitiative/oedi bash
Pass AWS environment variables with
--env-file
flag
Create a text file, for example, named
credentials.txt`
, and save AWS credentials information,
AWS_ACCESS_KEY_ID=<YOUR KEY ID>
AWS_SECRET_ACCESS_KEY=<YOUR SECRET EKY>
AWS_DEFAULT_REGION=<AWS REGION>
Then run the docker container like this,
$ docker run --rm -it \
--env-file credentials.txt \
openenergydatainitiative/oedi bash
Now, you are in an oedi
container environment, and then can build and use your OEDI data lake!
Local Environment¶
If you want to setup the environment directly into your computer, please follow the steps below.
Get a copy of the source code from our public Github repository - open-data-access-tools:
$ git clone git@github.com:openEDI/open-data-access-tools.git
2. Install Node.js (>=10.3.0) and npm
to your computer. The cdk
command-line tool and the AWS Construct Library are developed in TypeScript and
run on Node.js, and the bindings for Python use this backend and toolset as well.
Create a virutal Python environment for the project.
It’s recommended to create a virtual environment for a Python project. There are many tools and
tutorials online about this, like virtualenv,
virtualenv
with virtualenvwrapper,
pipenv, conda, etc.
You can choose based on your own perference. Here, we use virtualenv
with virtualenvwrapper
as
an example.
# Make virtual environment
$ mkvirtualenv -p python3 oedi
# Activate virtual environment
$ workon oedi
# Deactivate virtual environment
(oedi) $ deactivate
4. Make sure your oedi
virtual environment is activated, then go the root directory of
open-data-access-tools
and install this package editablely.
$ workon oedi
(oedi) $ cd open-data-access-tools
(oedi) $ pip install -e .
Change work directory to the one that contains AWS CDK app.
(oedi) $ cd oedi/AWS
(oedi) $ pwd
~/open-data-access-tools/oedi/AWS
Now, you are in the oedi
local environment, and build and use OEDI data lake.