Build Data Lake¶
Now that you have your oedi
environment set up, you are almost ready to access the data.
Configure OEDI¶
First, you need to run the configuration command,
(oedi) $ oedi config sync
This will use your AWS
credentials file to establish access to the cloud. It will also create a config file,
config.yaml
, in ~/.oedi
directory, which contains the default OEDI config settings.
You need to edit this file to select the data you would like to access and provide the path to your s3 staging location. Here is what the default config file will look like:
(oedi) $ oedi config show --provider AWS
AWS:
Region Name: us-west-2
Datalake Name: oedi-data-lake
Databases:
- Identifier: pv_rooftops
Name: oedi_pv_rooftops
Locations:
- s3://oedi-data-lake/pv-rooftop/aspects/
- s3://oedi-data-lake/pv-rooftop/buildings/
- s3://oedi-data-lake/pv-rooftop/developable-planes/
- s3://oedi-data-lake/pv-rooftop/rasd/
- s3://oedi-data-lake/pv-rooftop-pr/developable-planes/
- Identifier: buildstock
Name: oedi_buildstock
Locations:
- s3://nrel-pds-building-stock/comstock/athena/2020/comstock_v1/state
- s3://nrel-pds-building-stock/comstock/athena/2020/comstock_v1/metadata
- Identifier: tracking_the_sun
Name: oedi_tracking_the_sun
Locations:
- s3://oedi-data-lake/tracking-the-sun/2018/
- s3://oedi-data-lake/tracking-the-sun/2019/
- s3://oedi-data-lake/tracking-the-sun/2020/
- Identifier: atb
Name: oedi_atb
Locations:
- s3://oedi-data-lake/ATB/electricity/parquet/2019/
- s3://oedi-data-lake/ATB/electricity/parquet/2020/
- s3://oedi-data-lake/ATB/electricity/parquet/2021/
- Identifier: pvdaq
Name: oedi_pvdaq
Locations:
- s3://oedi-data-lake/pvdaq/parquet/site/
- s3://oedi-data-lake/pvdaq/parquet/system/
- s3://oedi-data-lake/pvdaq/parquet/inverters/
- s3://oedi-data-lake/pvdaq/parquet/meters/
- s3://oedi-data-lake/pvdaq/parquet/metrics/
- s3://oedi-data-lake/pvdaq/parquet/modules/
- s3://oedi-data-lake/pvdaq/parquet/mount/
- s3://oedi-data-lake/pvdaq/parquet/other-instruments/
- s3://oedi-data-lake/pvdaq/parquet/pvdata/
Staging Location: s3://user-owned-staging-bucket/
OEDI may have multiple providers in the future. For now, we only focus on AWS
.
Those configurations will be applied to AWS and related services in your data lake.
Region Name
: the AWS resources are tied to the Region that you specified.
Datalake Name
: the stack name of AWS CloudFormation.
Databases
: the databases that will be created in AWS Glue.
Identifier
: the string that can identifer the dataset.
Name
: the name of database that would be created in AWS Glue.
Locations
: the AWS S3 locations with columnar dataset.
Staging Location
: the AWS S3 location used by Athena for query outputs.
Use a text editor of your choice to modify the file (e.g. vi oedi/config.yaml
). At a minimum, you will need to change the staging location to a bucket that your AWS
account has access to. If you provide a path to a bucket that does not exist, then AWS will create the bucket for you. However, bucket names must be unique across the whole region, so you cannot use a bucket name that someone else has taken. You can either use a bucket name that you think is unlikely to exist, or use the AWS
managment console to create the bucket in your account manually, so that you know it will work (don’t forget to choose the correct region!).
Additionally, you should only choose the databases and tables that you are interested in so that your AWS
costs are minimized.
Deploy Data Lake¶
Once your configuration file is ready, you can use cdk
commands to manage the
AWS infrastructure required by the data lake.
Please change your directory to open-data-access-tools/oedi/AWS
which contains the CDK
app.
(oedi) $ cd /open-data-access-tools/oedi/AWS
To deploy the data lake stack configured above, we need to run the command:
(oedi) $ cdk deploy
What happens behind the scene? Based on the configurations provided above, a stack of AWS resources are created, updated or deleted. Assume it’s the first time that we deploy the data lake, then the following resources are created in the data lake.
An AWS Glue database was created.
An AWS Glue crawler role was assumed, used for creating crawlers.
A number of AWS Glue crawlers were created.
Now, you have the data lake infrastructure launched. Later on, after any change to config.yaml
, you will need to re-deploy via cdk deploy
to apply the updated configurations.
There are also other common cdk
commands, like these:
cdk ls
, list all stacks in the app.
cdk synth
, emits the synthesized CloudFormation template.
cdk diff
, compare deployed stack with current state.
cdk destroy
, it deletes all AWS resources deployed.
cdk docs
, open CDK documentation.
For more information about cdk
commands, please refer to the official documentation -
https://docs.aws.amazon.com/cdk/latest/guide/home.html.