MENU

OpenNEX Climate Data Access Tools

In this user manual, we show you how to use the climate access tools to create and use your own datasets. These tools were produced by Planet OS in collaboration with the NASA Earth Exchange (NEX) team with the intent of improving data access to OpenNEX datasets.

OpenNEX contains a huge reserve of Coupled Model Intercomparison Project Phase 5 (CMIP5) climate modelling data that is freely available to all researchers. This includes DCP30 model runs for the Continental United States and GDDP model runs for the full globe.

The climate access tool makes it easy to select the data of interest to you and create a custom data product from that data. The climate access tool requires no programming and produces CSV or NetCDF files that can be used for climate science work done in languages such as Python, R, Matlab, and even Excel.

Please note that the climate access tools have been released as a pilot version. That means that it has a very preliminary set of capabilities and features. We plan to refine and extend it based on what users (such as yourself) find useful and want. Please share any feedback you have with us by sending an email to feedback@planetos.com.

OpenNEX Datasets

The OpenNEX project provides data and machine images via the Amazon Web Services (AWS) cloud platform for a variety of earth exploration data from the NASA Earth Exploration (NEX) program. An overview on the available resources can be found on the AWS blog.

The two datasets supported in this release are NEX-DCP30 and NEX-GDDP. Both are downscaled climate scenarios that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5). They have been run across the greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5).

The datasets cover the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run). Note that not all models cover all the scenarios. See Appendix II for a chart of which models have which of the four scenarios.

For information on the provenance of each model, see the overview at http://cmip-pcmdi.llnl.gov/cmip5/availability.html. Not all the models mentioned on that page are available as part of OpenNEX.

DCP30

The NEX-DCP30 dataset includes downscaled projections from 33 models calculated for all four RCPs (RCP 2.6, RCP 4.5, RCP 6.0, and RCP 8.5), as well as ensemble statistics calculated for each of the RCPs from all model runs available.

Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation. The spatial coverage encompasses the conterminous United States at a resolution of 30 arc seconds at a monthly temporal resolution. The total dataset size is 17 TB, with an individual file size of 2 GB.

For further information on the methods and contents of the data refer to the technical note on the NEX-DCP30 dataset.

GDDP

The NEX-GDDP dataset includes downscaled projections for RCP 4.5 and RCP 8.5 from the 21 models and scenarios for which daily scenarios were produced and distributed under CMIP5. Each of the climate projections includes daily maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2100. The spatial coverage encompasses the entire earth at a resolution of 0.25 degrees (~25 km x 25 km).

For further information on the methods and contents of the data refer to the technical note on the NEX-GDDP dataset.

How it Works

There are three stages to creating and using a custom climate dataset using the tool.

Select Your Data

Using a simple web front end, select the climate data that you want by model, date, scenario, variable, and geographic region.

Create Your Dataset

Once your selection is made, the tool creates a simple script to transform the selected data into a CSV or NetCDF file. This script can be run on any supported system, however we recommend running it in Amazon’s cloud because that’s where the OpenNEX data lives, making it the fastest and most performant environment. Don’t worry, we’ll walk you through this in detail - it’s not hard!

When a geographic region under 25 km² is selected, a download link is also provided, allowing you to access the data directly from your browser.

Use the Data

All data and programming environments have great, easy-to-use tools for handling the resulting CSV files.

Selecting Data

To begin, visit the Climate Data Access Tool and select a dataset of interest. In this document, we’ll be working with data from the DCP30 dataset.

NASA NEX-DCP30 Dataset

The DCP30 dataset screen allows you to select the models, dates, variables, and scenarios that you wish to include in your dataset, as well as a map for selecting the geographic extent.

Overview

To make a selection, do the following:

The order doesn’t matter here, selections can be made in any order you want. When you’re happy with your selection, click “Create New Dataset” to generate your script.

As you build your dataset, be mindful of the size of your selection. In particular, selecting large geographic regions will result in many lines of data. For example, if you choose the entire continental US for one month, one model, one scenario and one variable, you’ll get 12 million points. Because CSV is not a very efficient format, this selection creates over a gigabyte of CSV data.

Selecting a Region on the Map

Spatial extent selection

Selecting an area of interest directly on the map is straightforward.

First, navigate to the region you’re interested in. You can use the zoom controls in the bottom left corner to zoom in and out, and pan by clicking and dragging anywhere on the map.

Once you’ve located the region of interest, click on the “Select an Area” button to activate the polygon tool.

Now you can click anywhere on the map to start your polygon. As you drag the mouse, you’ll pull out a dotted line:

Note the floating tooltip that shows the longitude and latitude to help you know where to click.

Additional single clicks will add points to the polygon. As your area takes shape, the enclosed region will be shaded and selection points denoted with white squares.

To complete the polygon, you can either double-click to make the last point or simply click back on the first point.

Once the polygon is complete, its coordinates and enclosed area are displayed below the map.

Loading Polygons From a File

Upload a shapefile

If you have a file in a standard format that describes the area you’re interested in, you can upload it by using the “Upload a Shapefile” button on the top left of the map.

Alternatively, you can simply drag-and-drop the file onto the map.

Shapefiles (raw or zipped), GeoJSON files, and KML files are supported. You can even upload a file containing multiple polygons.

Thare are two caveats worth mentioning regarding polygon uploads.

First, if a polygon contains holes, those holes will be ignored and the entire area within the polygon will be included in the geographic selection.

Second, polygons with very high complexity can potentially make both the web interface and the data access tool run slowly. In such cases, you may wish to simplify your polygons prior to upload with a tool like GDAL/OGR.

After You’ve Created Your Dataset

Once you’ve submitted your selection criteria by clicking the “Create New Dataset” button, your data access options will appear in the right sidebar, as well as a permalink to the saved selection criteria.

Data access options

Data Access Options

All datasets can be accessed using a Docker container. The command to launch this container is provided under the Deploy Docker Container heading. We’ll cover how to use this command to deploy a Docker container in the next section.

Datasets with an area less than 25km² can be downloaded directly in either CSV or NetCDF format. This option is not available for datasets that exceed 25km² in geographic extent.

Share This Dataset

When a new dataset is created, the tool stores the selection criteria that were used and assigns the dataset a unique identifier.

The dataset’s unique ID and corresponding permalink is shown below the Share This Dataset header. You could share this URL with a colleague, so they could reference the same dataset in their own analysis, or return at a later time to slightly modify your selection criteria and generate a related dataset.

When the “Create New Dataset” button is clicked, your browser’s location will update with the permalink, allowing you to bookmark the page in your browser or save it to a bookmarking tool such as Pinboard.

Data Access

Datasets created using the Climate Data Access Tool may be accessed in two ways.

The first is by deploying a Docker container that assembles and serves the data, an option that is always available regardless of the datasets geographic extent.

The second is via a direct download in either CSV or NetCDF format. This option is not also available, and depends on the geographic extent of the selection.

Dataset Downloads

Datasets with an area less than 25km² can be downloaded directly by clicking on the CSV or NetCDF button.

If your data workflow only requires a URL to the data, you can skip the download process and right-click on either the CSV or NetCDF button to copy the link address to your clipboard. This link can then be pasted into the script or program you use to access the data.

Note that these direct download options are only available for small datasets where the selected geographic extent is less than 25km². They will not be shown for datasets that exceed this threshold.

Deploy A Data Server via Docker

The second option involves using a Docker container to deploy an access server that will acquire and deliver your requested data. The command you’ll need to execute to launch the server is displayed in the Deploy Docker Container section.

Deploy Docker Container

This command can be run on a Linux or OS X system command line. Windows is not currently supported, but details on how to get access from a Windows computer are provided below.

# A sample Docker container deployment command
curl -sS http://opennex.planetos.com/p/VFuTg | bash

An example of the basic shell command is shown at right. Let’s quickly review the command to understand what’s happening.

curl is a command that downloads a file from the Internet. It’s basically a web browser without the browser part. bash is the Linux command interpreter and the pipe character | tells the computer to pass the downloaded file to bash for execution.

The real magic is in http://opennex.planetos.com/p/VFuTg. This URL will be different every time you select new data and contains a shell script that deploys the data access component responsible for retrieving, formatting, and delivering your data.

This script uses a technology called Docker which allows complex programs to run on various kinds of computers without installing any software, except for Docker itself. The script will check whether Docker is installed and, depending what kind of system you’re on, offer to install it for you or show you where to go to install it yourself.

When you run a climate data access script for the first time on a machine, it will take a few minutes to download the necessary images, depending on your network bandwidth. Once you’ve run any script once, all subsequent scripts will be fast to start because all the images will be loaded.

Where to Deploy Your Server

The data access server can run on various types of machines including your desktop, a local Linux server, a node in Amazon’s Elastic Compute Cloud (EC2), or a node in another cloud.

The OpenNEX datasets are stored in Amazon S3, so EC2 nodes provide very fast access to the source files. This will make a significant difference in how fast your dataset is assembled, especially when dealing with selections that cover a large geographic region, long temporal extent, and including multiple models and scenarios.

If you don’t have an Amazon account, it’s easy to set one up and try the access tools using Amazon’s free tier instances. See the Appendix to learn how to set up an account and launch an instance of your own. If you already have an Amazon account, you can use an existing instance or launch a new one.

Running the Server

Launch the data access server with the supplied command

curl -sS http://opennex.planetos.com/p/VFuTg | bash

Output from the command above

The available endpoint is:
    http://192.168.99.100:7645/data.csv
        Dataset: NEX-DCP30
        Model: FGOALS-g2
        Scenario: rcp45
        Variable: tasmax
        Dates: 2016-01-01 to 2021-12-31
        Region 1: (-89.3794, 43.0804), (-89.3419, 43.0591), (-89.3494, 43.0280),
                  (-89.4277, 43.0413), (-89.3863, 43.0795), (-89.3794, 43.0804)

        CSV columns and values:
          1 - Date
          2 - Longitude
          3 - Latitude
          4 - Model
          5 - Scenario
          6 - Variable
          7 - Value

There are various ways that you can run the data access server.

If you run it as shown at right, with no arguments, you’ll start a local web server that you can use to access the data, either locally or remotely. Note that in order to access the data remotely, you’ll need to ensure the port is exposed on the host machine.

The output shows you what data is in your data product and then waits for you to request it. The available endpoint at which data can be requested is shown at the top. In this example, it’s located at http://192.168.99.100:7645/data.csv, however your local environment may differ in address.

Optional Arguments

# Display available script options using `/dev/stdin --help`
curl -sS http://opennex.planetos.com/p/VFuTg | bash /dev/stdin --help

Output from the command above

With no command line arguments, the script will start an HTTP server and output
the available dataset endpoints that match your request.

You can modify the script's behavior with options. To use options when piping
straight from the web, do something like:
    $ curl <url> | bash /dev/stdin --extract data

Script options:
    -l | --list                      List available datasets and their formats
    -e | --extract <endpoint(s)>     Transform one or more datasets (separated by commas),
                                     saving them in the current directory
    -a | --all                       Transform all available datasets, saving them in the
                                     current directory
    -f | --format                    Pull data in the requested format (default: csv)
    -t | --thredds                   Pull from the THREDDS server at http://dataserver.nccs.nasa.gov
                                     rather than S3
         --localhost                 Allow access only from this host
    -h | --help                      Show this help message and exit

All of the above options except for "--localhost" will shut down the server as soon as they are complete.

By adding options to your script invocation, you can change its behavior. You can see all your options by adding /dev/stdin --help at the end of the command line.

The most common option is -a, which automatically downloads data into a file called “data.csv” in the local directory.

Another commonly used option is -f nc, which is used to format data as NetCDF instead of the default CSV.

When you use options, the Docker container is stopped when the command is complete so no server is available afterwards.

Accessing Data

You can request this data either from the machine you’re running the climate access server on or from another machine that has access to that port (like your laptop).

Use curl to grab data.csv from the server endpoint

curl http://localhost:7645/data.csv > output.csv

The easiest way to access the data is to simply use the curl command again. The command at right will grab the output from the climate access server and send it to output.csv in the current directory.

However you don’t need to save the output to a file at all. You can read it directly from the server into the tool you’re using to do the data analysis. We’ll discuss how to do that below.

Monitoring the Container

Display the Docker container log

docker logs -f opennex-server

Sometimes it’s handy to peek into the Docker container to see what it’s doing. When running the Docker container locally and not in an Amazon EC2 instance, acquisition of the source files can take some time and is subject to dataset size and network bandwidth.

You may suspect your container is unresponsive, but it may simply be busy acquiring data. For example, in the Temperatures in Chicago use case it took ~14 minutes to complete the endpoint request on a local machine. While the R command may appear hung, the Docker logs reveal ongoing container activity as the relevant files are sourced from S3.

Shutting Down

Shut down your Docker container

# Note that you may need to prefix this with `sudo`
docker rm -f opennex-server

When you are done using your climate access tool, you can shut down the docker container by running the command at right.

However, the container doesn’t use much resources when you’re not running it and new requests will replace previous ones, so this step is not required.

Updating Your Server

If you return to a new dataset in the browser and then run the provided script, it automatically replaces the previous data server with your new one. You can iteratively define and use datasets over and over with no penalty, however do note that

The server itself is currently designed for a single instance per machine, so multiple users in the same instance should not be starting servers simultaneously.

Future plans include the ability to select multiple datasets in a single climate access instance, to add new datasets to existing instances, and to support multiple users and accesses in parallel.

Working With Your Dataset

Data can be accessed in both CSV and NetCDF format. This sections highlights the commands used to request data in either format, as well as the structure of the data that’s returned.

CSV Format

Use curl to inspect data.csv file from a Docker deployed data access server. Note that the IP of your container may differ than the one shown below.

curl -sS http://192.168.99.100:7645/data.csv | more

Sample CSV format response

Date,Longitude,Latitude,Model,Scenario,Variable,Value
2016-01-01,-89.3792,43.0292,FGOALS-g2,rcp45,tasmax,268.77978515625
2016-01-01,-89.3708,43.0292,FGOALS-g2,rcp45,tasmax,268.77032470703125
2016-01-01,-89.3625,43.0292,FGOALS-g2,rcp45,tasmax,268.76373291015625
2016-01-01,-89.3542,43.0292,FGOALS-g2,rcp45,tasmax,268.8587341308594
2016-01-01,-89.3458,43.0292,FGOALS-g2,rcp45,tasmax,268.8514099121094
2016-01-01,-89.4292,43.0375,FGOALS-g2,rcp45,tasmax,268.86834716796875
2016-01-01,-89.4208,43.0375,FGOALS-g2,rcp45,tasmax,268.8631591796875
2016-01-01,-89.4125,43.0375,FGOALS-g2,rcp45,tasmax,268.945068359375
...

You can access datasets produced by the climate access tools in a very simple comma-separated value (CSV) format that can be read by almost all packages and languages. If you’re not familiar with data in CSV format, it’s easy: each line represents a single point in the dataset and the values in the line have commas in between them.

The fields on each line are as follows:

NetCDF Format

Use curl to save your dataset in NetCDF format to the local directory. This command works remotely as well, but requires exposing your data access server and replacing the hostname appropriately.

The ncdump -h command returns the data.nc header information.

curl http://192.168.99.100:7645/data.nc > data.nc

Use the ncdump tool to view the dimensions, variables, and attributes of the saved data.nc file. Learn more about ncdump.

ncdump -h data.nc

Sample response from the ncdump -h data.nc command.

netcdf data {
dimensions:
    time = 72 ;
    lon = 11 ;
    lat = 7 ;
    bnds = 2 ;
variables:
    double time(time) ;
        time:units = "days since 1950-01-01 00:00:00" ;
        time:calendar = "standard" ;
        time:axis = "T" ;
        time:long_name = "time" ;
        time:standard_name = "time" ;
        time:bounds = "time_bnds" ;
        time:_CoordinateAxisType = "Time" ;
    double lon(lon) ;
        lon:valid_range = 0., 360. ;
        lon:standard_name = "longitude" ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
        lon:bounds = "lon_bnds" ;
        lon:axis = "X" ;
        lon:_CoordinateAxisType = "Lon" ;
    double lat(lat) ;
        lat:standard_name = "latitude" ;
        lat:long_name = "latitude" ;
        lat:units = "degrees_north" ;
        lat:bounds = "lat_bnds" ;
        lat:axis = "Y" ;
        lat:_CoordinateAxisType = "Lat" ;
    double time_bnds(time, bnds) ;
    double lon_bnds(lon, bnds) ;
    double lat_bnds(lat, bnds) ;
    float tasmax(time, lat, lon) ;
        tasmax:_FillValue = 1.e+20f ;
        tasmax:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_FGOALS-g2_rcp45_r0i0p0.nc areacella: areacella_fx_FGOALS-g2_rcp45_r0i0p0.nc" ;
        tasmax:cell_measures = "area: areacella" ;
        tasmax:cell_methods = "time: maximum (interval: 20minutes) within days time: mean over days" ;
        tasmax:comment = "monthly mean of the daily-maximum near-surface air temperature." ;
        tasmax:coordinates = "height" ;
        tasmax:history = "Original Name is TREFMXAV. 2011-12-26T06:10:16Z altered by CMOR: Treated scalar dimension: \'height\'." ;
        tasmax:long_name = "Daily Maximum Near-Surface Air Temperature" ;
        tasmax:missing_value = 1.e+20f ;
        tasmax:standard_name = "air_temperature" ;
        tasmax:units = "K" ;

// global attributes:
        :creation_date = "Mon Jul 23 02:17:20 PDT 2012" ;
        :parent_experiment = "historical" ;
        :parent_experiment_id = "historical" ;
        :parent_experiment_rip = "r1i1p1" ;
        :Conventions = "CF-1.4" ;
        :project_id = "NEX" ;
        :product = "downscaled" ;
        :institution = "NASA Earth Exchange, NASA Ames Research Center, Moffett Field, CA 94035" ;
        :institute_id = "NASA-Ames" ;
        :realm = "atmos" ;
        :modeling_realm = "atmos" ;
        :region = "CONUS" ;
        :CMIPtable = "Amon" ;
        :version = "1.0" ;
        :downscalingModel = "BCSD" ;
        :experiment_id = "rcp45" ;
        :frequency = "mon" ;
        :table_id = "Table Amon" ;
        :realization = "1" ;
        :initialization_method = "1" ;
        :physics_version = "1" ;
        :variableName = "tasmax" ;
        :tracking_id = "27d795da-af48-11e2-9374-e41f13efa9fe" ;
        :driving_data_tracking_ids = "N/A" ;
        :driving_model_id = "FGOALS_g2" ;
        :driving_model_ensemble_member = "r1i1p1" ;
        :driving_experiment_name = "historical" ;
        :driving_experiment = "historical" ;
        :region_id = "CONUS" ;
        :region_lexicon = "http://en.wikipedia.org/wiki/Contiguous_United_States" ;
        :resolution_id = "800m" ;
        :title = "800m Downscaled NEX CMIP5 Climate Projections for the Continental US" ;
        :model_id = "BCSD" ;
        :references = "BCSD method: Wood AW, Maurer EP, Kumar A, Lettenmaier DP, 2002, J Geophys Res 107(D20):4429 & \n",
            " Wood AW, Leung LR, Sridhar V, Lettenmaier DP, 2004, Clim Change 62:189-216\n",
            " Reference period obs: PRISM (http://www.prism.oregonstate.edu/)" ;
        :DOI = "http://dx.doi.org/10.7292/W0WD3XH4" ;
        :experiment = "RCP4.5" ;
        :contact = "Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget Thrasher: bridget@climateanalyticsgroup.org, and Dr. Mark Snyder: mark@climateanalyticsgroup.org" ;
        :_CoordSysBuilder = "ucar.nc2.dataset.conv.CF1Convention" ;

When creating large or complex datasets, it’s more efficient to use NetCDF, a binary format optimized for handling arrays. NetCDF is usually more complex to deal with than CSV, but Python and R both have easy to use libraries that let you work with NetCDF files.

NetCDF is not a streaming format, so you can’t consume a NetCDF file as it’s produced the way you can with CSV. Therefore, it usually makes sense to download the entire file before using it by using curl on a running data access server (see sidebar for example).

An additional method is to use the -f nc -a options to run the Docker container only to build the “data.nc” file. The command below will launch a Docker container, create a data.nc file in the local directory, then exit

curl -sS http://opennex.planetos.com/p/VFuTg | bash /dev/stdin -f nc -a

Accessing Multiple Datasets

If you want to create a series of datasets and explore them iteratively, repeatedly starting new containers can be annoying. To make this easier, once you’ve started a container you can use it to access any dataset by using its unique ID in the URL as follows:

http://192.168.99.100:7645/dataset/<unique-id>/data.csv

For NetCDF, simply replace the .csv with .nc.

The unique ID of your dataset is displayed near the bottom of the right sidebar on the web site when you’re viewing your dataset:

Dataset unique ID

Using this facility allows you to create a new dataset using the selection tools and immediately use it in your application as long as you leave a Docker container running.

Remote Access via AWS EC2 Instance

Many users want to take advantage of the speed of filtering and transforming the data directly in the Amazon AWS cloud, but want to use their desktop tools to fetch and analyze the data. By default, instances in EC2 only allow access to the ssh protocol.

Here are some ways to use the data from your Amazon EC2 instance directly from your desktop:

Customize your security group

EC2 instances have associated security groups. You can view the security group for your instance by selecting the instance on the AWS console and clicking on the security group name in the “Description” tab for that instance.

While viewing the security group, select the “Inbound” tab and click “Edit” to modify the rules. To open the port that the climate access tool uses (7645), click “Add Rule” and select “Custom TCP Rule”, enter 7645 in the “Port Range” column, then either select your IP address or “Anywhere” in the “Source” field.

Now you should be able to directly access the climate access tool by entering http://:7645/data.csv.

Create an ssh tunnel

ssh can make a port on your desktop machine access a port on your server. First run an instance of ssh as follows:

ssh -L7645:127.0.0.1:7645 <IP-address-of-your-EC2-node>

Now you can connect to the local port and your data will be pulled from the remote system. Just use the url http://localhost:7645/data.csv.

Copy the file using scp

Sometimes it’s easiest just to generate the file remotely on the EC2 instance and copy it to your desktop. In this case, simply use the -a option on the downloaded script as shown above in the section “Other ways to use the script” which will create a file called “data.csv” on your instance. Copy it to your desktop with:

scp <IP-address-of-your-EC2-node>:data.csv

Examples

This section includes examples of how datasets generated by the climate access tool can be used within R and Python workflows.

R

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. For more information on R, visit the R Project website.

Temperatures in Chicago

An example using R to analyze DCP30 data. First we load the required packages and create the load.data and do.graph functions.

library(ggplot2)
library(dplyr)

load.data <- function(ipaddr) {
  options(timeout=600)
  print(system.time(temps <- read.csv(url(sprintf("http://%s:7645/data.csv", ipaddr)),
                                      colClasses=c(Date="Date"))))
  temps$Temperature <- temps$Value - 273.15
  temps
}

do.graph <- function(temps) {
  by.month <-
    temps %>%
    group_by(Date,Scenario) %>%
    summarise(Mean=mean(Temperature),
              Max=max(Temperature),
              Min=min(Temperature))
  by.month$year <- as.Date(format(by.month$Date, "%Y-01-01"))

  by.year <-
    by.month %>%
    group_by(year, Scenario) %>%
    summarise(Temperature=max(Max))

  ggplot(by.year, aes(x=year, y=Temperature,
                      color=Scenario, group=Scenario)) +
    geom_line() +
    stat_smooth(method="loess", level=.8) +
    ggtitle(sprintf("Maximum mean temperature for warmest month using model %s",
                    temps$Model[1]))
}

Next, assuming the data server is deployed and our endpoint is available at http://192.168.99.100:7645/data.csv, let’s load the data.

chicago <- load.data("192.168.99.100")

Note that depending on the location of your data server and your network bandwidth, loading the data may take some time. Performance is best using an EC2 instance, giving it’s locality to the S3 source data.

   user  system elapsed
439.136   9.057 843.423

Use the summary function to get a statistical summary of the temperature data

summary(chicago)

The response of the above summary command

      Date              Longitude         Latitude            Model               Scenario
 Min.   :1950-01-01   Min.   :-87.87   Min.   :41.59   CESM1-CAM5:8512128   historical:1103424
 1st Qu.:2018-12-24   1st Qu.:-87.77   1st Qu.:41.72                        rcp26     :1852176
 Median :2045-12-16   Median :-87.69   Median :41.81                        rcp45     :1852176
 Mean   :2043-03-27   Mean   :-87.69   Mean   :41.81                        rcp60     :1852176
 3rd Qu.:2072-12-08   3rd Qu.:-87.62   3rd Qu.:41.90                        rcp85     :1852176
 Max.   :2099-12-01   Max.   :-87.53   Max.   :42.00
   Variable           Value        Temperature
 tasmax:8512128   Min.   :264.0   Min.   :-9.153
                  1st Qu.:281.4   1st Qu.: 8.219
                  Median :291.7   Median :18.531
                  Mean   :290.7   Mean   :17.527
                  3rd Qu.:300.6   3rd Qu.:27.453
                  Max.   :314.5   Max.   :41.340

Plot the data using the do.graph function

do.graph(chicago)

The following example uses the statistical language R and DCP30 data to explore maximum temperatures in the Chicago area according to a single climate model (CESM1-CAM5) under various RCP scenarios.

R code and console responses are shown in the sidebar. The dataset used in this example is available at http://opennex.planetos.com/dcp30/ZKXr4.

DCP30 data in the Chicago region

Working with CSV files in R is easy, and you can even access the container endpoint directly, instead of saving CSV files locally. This saves a step if you’re working with multiple datasets. In this example however, we’ll be focused on just one.

Increase the Default Timeout

Since the copy from S3 can be slow, it’s a good idea to increase the timeout used for retrieval. Just use the options function:

options(timeout=600)

Reading CSV Data

To read the file straight from your Docker container, use read.csv with the endpoint url provided when you deployed your data server. Note that localhost is used below as an example and your endpoint hostname may differ. If you’ve deployed the container remotely (e.g. in EC2), you’ll need to replace localhost with the appropriate IP address and ensure it’s remotely accessible.

temps <- read.csv(url("http://localhost:7645/data.csv"), colClasses=c(Date="Date"))

Handling Variables

We explicitly specify the Date column as a date so that R will convert it to its internal Date class. R will figure out the other column types without you having to specify them.

Temperature variables are provided in Kelvin, you’ll probably want to convert from to Celsius. Converting to Fahrenheit is left as an exercise for the reader.

temps$Temperature <- temps$Value - 273.15

Custom Functions

We’ve created two functions in R to do some very simple data analysis: load.data combines the read.csv and the Kelvin -> Celsius conversion into a single function which takes the IP address of the climate data access tool container.

do.graph uses the dplyr and ggplot2 libraries to plot the temperature for the warmest month of each year under the various scenarios and adds smoothing and confidence on top of the raw data.

Using these functions, we can load the data, see the summary of what’s in it, and visualize the warmest months in our area of interest from 1950 - 2099.

Maximum mean temperature for warmest months

Incrementally Processing Large Datasets

An example showing incremental processing of a large CSV dataset to determine maximum and minimum temperatures (with location), within the continental US in 2016.

print.min.max <- function(ipaddr) {
  minval <- 1000.0
  maxval <- 0.0
  mins <- NULL
  maxs <- NULL

  options(timeout=600)
  con <- url(sprintf("http://%s:7645/data.csv", ipaddr), open="rt")

  header <- readLines(con, n=1)
  while (length(input <- readLines(con, n=10000)) > 0){
    for (i in 1:length(input)){
      line <- input[i]
      fields <- unlist(strsplit(line,',',fixed=TRUE))
      val <- as.numeric(fields[7])
      var <- fields[6]
      if (var == "tasmax" && val > maxval) {
        maxval <- val
        maxs <- fields
      }
      if (var == "tasmin" && val < minval) {
        minval <- val
        mins <- fields
      }
    }
  }
  close(con)

  if (!is.null(mins)) {
    print(sprintf("The minimum temperature is %.2f degrees C on %s at (%.3fW, %.3fN)",
                  minval-273.15, substr(mins[1], 1, 7),
                  as.numeric(mins[2]), as.numeric(mins[3])))
  }
  if (!is.null(maxs)) {
    print(sprintf("The maximum temperature is %.2f degrees C on %s at (%.3fW, %.3fN)",
                  maxval-273.15, substr(maxs[1], 1, 7),
                  as.numeric(maxs[2]), as.numeric(maxs[3])))
  }
}

Sometimes, we want to read more data than will work with a data frame. In this case, we can use readLines to read the data in chunks from the server and process each chunk individually.

In this example, we read 10,000 lines at a time from the server, remembering the minimum and maximum monthly average temperatures as we go. At the end, we say when and where that data was found within the dataset.

Running this for the entire continental US for the year 2016 using the CESM1-CAM5 model under the RCP4.5 scenario, we see this:

Continental US minimum and maximum temperatures

The hottest month will be July in Death Valley, California and the coldest month will be December in Bighorn National Forest in Wyoming. These seem about right!

To compute this result, the print.min.max function examined 291 million data points, which were 17.1GB in CSV format. This would have been challenging to store in memory as a pandas dataframe. The computation took about 1.25 hours on an EC2 t2.small instance (though this is highly variable) and would run that much faster on a larger instance.

Processing NetCDF Files

In the previous example, the data access tool had to convert 291 million data points to CSV and then R parsed each of those lines back into numeric values. In cases like these, it can be more efficient to use the binary NetCDF format. In addition, since NetCDF works with matrices, it can be a more natural fit for many techniques which work with matrices rather than data frames.

Here, we’ll redo the above example of finding the minimum and maximum temperature for the entire continental US for the year 2016 according to the CESM1-CAM5 model under the RCP4.5 scenario. We’ll use the same source data, which is available at http://opennex.planetos.com/dcp30/LpJMh.

Spin up a container and save the data as a NetCDF file

time (curl -sS http://opennex.planetos.com/p/LpJMh | bash /dev/stdin/ -a -f nc)

First, let’s spin up our container locally and generate a NetCDF file. We’ll use the optional argument -a to transform all available datasets and -f nc to format the output as NetCDF.

This gives us the file data.nc, which is 565MB and includes two variables, tasmax and tasmin.

R code example of how to read a data access server NetCDF file and determine minimum and maximum average temperature.

library(ncdf4)

print.min.max.nc <- function(filename) {
  src <- nc_open(filename)
  lats=ncvar_get(src,"lat")
  lons=ncvar_get(src,"lon")

  # Get the minimum temp and where it occurred for all the months
  monthly <- sapply(1:src$dim[1]$time$len, # No. of months
                    function(i) {
                      tmp=ncvar_get(src, "tasmin",
                                    start=c(1,1,i),
                                    count=c(-1,-1,1));
                      mn=min(tmp, na.rm=T);
                      result=which(tmp==mn, arr.ind = T);
                      c(result, mn)})
  # Find the month with the coldest temperature
  min.mon <- which.min(monthly[3,])
  lon <- 360-lons[monthly[1,min.mon]]
  lat <- lats[monthly[2,min.mon]]
  temp <- monthly[3,min.mon]-273.15

  print(sprintf("The minimum temperature is %.2f degrees C on 2016-%02d at (%.3fW, %.3fN)",
                temp, min.mon, lon, lat))

  # Get the maximum temp and where it occurred for all the months
  monthly <- sapply(1:src$dim[1]$time$len, # No. of months
                    function(i) {
                      tmp=ncvar_get(src, "tasmax",
                                    start=c(1,1,i),
                                    count=c(-1,-1,1));
                      mx=max(tmp, na.rm=T);
                      result=which(tmp==mx, arr.ind = T);
                      c(result, mx)})
  # Find the month with the hottest temperature
  max.mon <- which.max(monthly[3,])
  lon <- 360-lons[monthly[1,max.mon]]
  lat <- lats[monthly[2,max.mon]]
  temp <- monthly[3,max.mon]-273.15

  print(sprintf("The maximum temperature is %.2f degrees C on 2016-%02d at (%.3fW, %.3fN)",
                temp, max.mon, lon, lat))
  nc_close(src)
}

We can use the ncdf4 package to read this data into R. Since the data requested is large, we need to read it in chunks. In this case, we read it one month at a time, but it can sometimes be challenging to determine the correct batch sizes to work with.

Run the print.min.max.nc function on the local data.nc file. We’ll time it to see how it performs.

system.time(print.min.max.nc("~/data.nc"))

When we run the print.min.max.nc function, we get the same minimum and maximum temperatures as the CSV example above. However the use of NetCDF provides a significant 5x performance improvement, 15 minutes for the combination of creating the file and running the R code compared to 1.25 hours for CSV.

Maximum and minimum temperatures in the continental US

Python

Python has some great tools to read and process tabular data stored in CSV format. We’re going to look at two ways to examine the data here.

First, we’re going to work with some data using the pandas and matplotlb libraries to read it, do a simple computation, and plot the results. Then, we’ll show how to use the csv module from the standard library to process very large datasets iteratively. If you have different tools that you’re using for processing data, it should be easy to use those as well.

These examples are published as Jupyter notebooks in the Planet OS notebooks repository, which can be reviewed online or cloned and run locally.

DCP30 Analysis in Pandas

Pandas is a powerful data analysis library that works in conjunction with other libraries such as numpy and scipy as part of python’s rich ecosystem for scientific and numerical processing. You can learn all about pandas at http://pandas.pydata.org.

Pandas provides a tabular data structure called a dataframe that corresponds very well to the data layout produced by the climate access tool. We’ve published an example that uses a dataframe to look at how temperature projections in a model change under the different scenarios presented.

View the OpenNEX DCP30 Analysis Using Pandas notebook on GitHub.

Incrementally Processing Large Datasets with the CSV Module

When we want to process more data than can fit into memory, we can use the csv module to read data from the stream as it’s sent.

Similar to the earlier R example, we’ll identify the minimum and maximum temperatures in the continental United States as predicted by the CESM1-CAM5 climate model under the RCP4.5 scenario.

View the Processing Large Datasets with the CSV Module notebook on GitHub.

Appendix I: Creating a Node in Amazon’s Elastic Compute Cloud

Setting up Amazon EC2 instances to run the data access tool is easy. If you don’t already have an AWS account, you can set one up for free which will give you the resources you need to run the data access tool for a year. Of course, if you want higher performance, you may need to get a bigger instance than is allowed under the free program, but even this is very affordable for the periods of a single run.

Note: This section is aimed at those new to AWS. If you have experience using AWS, there’s no need to follow these instructions as laid out. If you want to set up your instance differently (or via a mechanism other than the AWS console), that will be fine. You can also use an existing Linux-based EC2 instance, if you prefer.

Create an AWS Account

If you haven’t used Amazon Web Services (AWS) before, simply go to https://aws.amazon.com to sign up for an account and get access to the free tier services.

On that page, you’ll see the big button for creating a free account. Just, click it.

On the following page, enter your email and choose “I am a new user”. Then click on “Sign in using our secure server”.

From this point, you can just follow the prompts to get your new account. You will need three things in order to do that: a credit card to guarantee the account (it won’t be charged unless you access non-free resources), a phone number to use for confirming the account, and an email address to associate with the account.

Setup an EC2 Instance

To run the climate access tool, you need to create an EC2 instance to run it. You can do this from the AWS web-based dashboard very easily.

First, go to http://console.aws.amazon.com. You will be prompted to log in to your AWS account if you are not already logged in.

From that point, select EC2 from the gigantic screen full of choices.

This will take you to the EC2 dashboard which has a big blue button for launching a new instance.

Once you’ve clicked there, you can choose an image for the instance. This specifies the OS you want to use there. We recommend the standard Ubuntu image, but the access tool will work on most common Linux images.

Next you get to choose an instance type. This lets you specify the size and characteristics of the instance you’re creating. If you want to use the free tier, make sure the instance you choose is marked “Free tier eligible”:

At this point, you can simply click “Review and Launch” to start the instance or walk through the wizard to configure the specifics of your instance.

Two things to consider here are:

Once you’ve reviewed your configuration and launched your instance, you need to create a key pair to go with it.

Download your key pair and put it in a safe place. You’ll need it to access your system. If you’re on Mac or Linux, make sure that you make the key privately accessible to only your user (“chmod 600 filename” will do it) or you won’t be able to use it to access your system.

Once you hit “Launch Instances,” your instance will launch. It will take about 5 minutes to be fully ready. You can view its status on the EC2 dashboard.

When the instance has launched, you can get its public IP address by finding it on the EC2 dashboard and looking in the “Public IP” column.

You’ll need this address to connect to the instance.

Connect to Your Instance

If you’re using a Linux or Mac system or you have a Windows system with Cygwin installed, you can use ssh from the command line to connect to your instance.

ssh -i ~/.ssh/climate-test.pem -L7645:localhost:7645 ubuntu@52.90.217.249

Use an ssh command like the one in the sidebar to connect.

This command has the following parts:

If you’re on a Windows system without Cygwin, you can use a GUI-based ssh client like putty. See http://www.chiark.greenend.org.uk/~sgtatham/putty/ for information about getting and using putty.

If you’re going to use use putty, you need to convert the key you got from your EC2 using puttygen. You can get puttygen at the same page you get putty from. Instructions for doing the conversion are at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html, but it’s pretty easy to do it without the instructions.

Running Jupyter or RStudio

Once your instance has docker installed (this will be done automatically when you run the climate tool for the first time as described above), you can use it to create a Jupyter environment for doing analysis in Python or an RStudio environment to create an R environment. This can be easier than installing these packages on your client machine if you only want them for analyzing the DCP30 data.

Jupyter for Python Users

For Python users, the Jupyter team has put together a series of docker images that automatically run a Jupyter notebook server without any installation. You can find them at https://github.com/jupyter/docker-stacks. We’ve been using their scipy-notebook which includes many of the analytical packages that you’ll want.

Launch Jupyter via Docker

docker run -d -p 8888:8888 -v ${HOME}/notebooks:/home/jovyan/work jupyter/scipy-notebook

To start it, run the command in the sidebar at the command prompt in your EC2 instance.

This will create a notebooks directory in your home directory on that instance which will be where things created in the Jupyter server will be saved. Make sure to copy these notebooks to your local machine sometimes (you can use scp or simply use the “File > Download As > IPython Notebook” menu item in Jupyter).

To reference the climate tool from the notebook, you’ll need to use the internal IP address of your instance. To find this, run ip addr show eth0 and look for the “inet” key.

To access this Jupyter from your desktop, simply add the argument -L8888:localhost:8888 to the ssh command that you created above. Then you can steer your browser to http://localhost:8888/ to get to the Jupyter main page.

RStudio for R Users

For R users, the rocker-org project (https://github.com/rocker-org) provides useful docker images for R. We have been using their hadleyverse image for our work.

Launch RStudio via Docker

docker run -d -p 8787:8787 -v ${HOME}/rstudio:/home/rstudio rocker/hadleyverse

To set up an RStudio server using this image, run the command in the sidebar at the command prompt in your EC2 instance.

To access this RStudio from your desktop, simply add the argument -L8787:localhost:8787 to the ssh command that you created above. Then you can steer your browser to http://localhost:8787/ to get to RStudio. To login, use the username “rstudio” and the password “rstudio”.

This will create an rstudio directory in your home directory on that instance which will be where things created in RStudio will be saved. Make sure to copy any files to your local machine sometimes (you can use scp or simply use the “Export…” command on the “More” dropdown in the Files tab).

To reference the climate tool from RStudio, you’ll need to use the internal IP address of your instance. To find this, run ip addr show eth0 and look for the “inet” key.

Appendix II: Climate Model Data Availability

The following charts display what RCP scenarios are available per climate model.

DCP30 Data Availability

Only the historical and RCP4.5 scenarios are available for all climate models in the DCP30 dataset.

Note that the models HadGEM2-CC and HadGEM2-ES end their retrospective runs in November 2005 and start their prospective runs in December 2005. This is one month earlier than all the other models.

Model Historical RCP2.6 RCP4.5 RCP6.0 RCP8.5
ACCESS1.0 x x x
BCC-CSM1.1 x x x x x
BCC-CSM1.1(m) x x x
BNU-ESM x x x x
CanESM2 x x x x
CCSM4 x x x x x
CESM1(BGC) x x x
CESM1(CAM5) x x x x x
CMCC-CM x x x
CNRM-CM5 x x x
CSIRO-Mk3.6.0 x x x x x
FGOALS-g2 x x x x
FIO-ESM x x x x x
GFDL-CM3 x x x x x
GFDL-ESM2G x x x x x
GFDL-ESM2M x x x x x
GISS-E2-H-CC x x
GISS-E2-R x x x x x
GISS-E2-R-CC x x
HadGEM2-AO x x x x x
HadGEM2-CC x x x
HadGEM2-ES x x x x x
INM-CM4 x x x
IPSL-CM5A-LR x x x x x
IPSL-CM5A-MR x x x x x
IPSL-CM5B-LR x x x
MIROC-ESM x x x x x
MIROC-ESM-CHEM x x x x x
MIROC5 x x x x x
MPI-ESM-LR x x x x
MPI-ESM-MR x x x x
MRI-CGCM3 x x x x
NorESM1-M x x x x x
Ensemble Average x x x x x
Quartile 25 x x x x x
Quartile 50 x x x x x
Quartile 75 x x x x x

GDDP Data Availability

Historical, RCP4.5, and RCP8.5 scenarios are available for all climate models in the GDDP dataset.

Note however, that the following data is missing from the GDDP dataset: - ACCESS1-0 is missing the variable pr (precipitation) for RCP 8.5 for the year 2100. - GDFL-CM3 is missing the variable pr (precipitation) for RCP 4.5 for the years 2096-2100. - BCC-CSM1-1 and MIROC5 are missing all data for the year 2100.

Model Historical RCP4.5 RCP8.5
ACCESS1.0 x x x
BCC-CSM1.1 x x x
BNU-ESM x x x
CanESM2 x x x
CCSM4 x x x
CESM1(BGC) x x x
CNRM-CM5 x x x
CSIRO-Mk3.6.0 x x x
GFDL-CM3 x x x
GFDL-ESM2G x x x
GFDL-ESM2M x x x
INM-CM4 x x x
IPSL-CM5A-LR x x x
IPSL-CM5A-MR x x x
MIROC-ESM x x x
MIROC-ESM-CHEM x x x
MIROC5 x x x
MPI-ESM-LR x x x
MPI-ESM-MR x x x
MRI-CGCM3 x x x
NorESM1-M x x x

Support

Supported Browsers

This pilot version of climate access supports Google Chrome, Apple Safari, and Firefox.

Slack Community

Join the Planet OS Slack Community to discuss new features, request datasets, and ask for help from Planet OS team members and other data enthusiasts.

Contact Us

Feedback, support requests, and other inquiries may be directed to help@planetos.com.