Project Kickoff at NCSA

Image credit: Blue Waters, National Petascale Computing Facility

Project Kickoff at NCSA

By David LeBauer; ARPA-E TERRA Reference Data and Computing Team Comment

Attendees

Gabrielle Allen, Maxwell Burnette, Doug Fein, Chris Harbourt, Rob Kooper, Daniel Lapine, David LeBauer, Stephen Long, Yan Liu, Paul Miller, David Raila, Jay Roloff, Aiman Soliman, Rachel Shekar, Edward Seidel, John Towns, Kandace Turner

Agenda

  • 9:00-9:05 Welcome and introductions
  • 9:05-9:20 David LeBauer: Introduction to the TERRAref program
  • 9:20 - 9:50 Introductions to Collaborating NCSA Teams
    • Yan Liu: CyberGIS Center for Advanced Digital and Spatial Studies
    • Ed Seidel: National Data Service (NDS)
    • Rob Kooper: Innovative Software and Data Analysis (ISDA)
  • 9:50-10:05 Break
  • 10:05-10:25 Open forum
  • 10:25-10:55 Structured discussion
  • 10:55-11:00 Summary and Closing

Presentations

David LeBauer, Project motivation and overview

Rob Kooper: ISDA, BrownDog, and Clowder

Yan Liu: CyberGIS Center

Ed Siedel: Midwest Big Data Hub and the National Data Service

The Midwest Big Data Hub

Ed Siedel described the Midwest Big Data Hub, a project that was recently funded by NSF to coordinate Academic, Government, and Industry users and creators of Big Data. Ed also described The National Data Service and the opportunity for the

One of the proposed ‘spokes’ of the Midwest Big Data Hub will be ‘Digitatl Agriculture’, and TERRAref is participating in this effort.

TERRA Ref is helping to coordinate the Digital Agriculture spoke of the MidWest Big Data Hub. We also support the National Agricultural Research Data Network for Harmonized Data (NARDN-HD) spoke of the Southern BD Hub, and PI Cheryl Porter is on our reference data products team to help us build NARDN-HD compatibility into the system.

The National Data Service

The National Data Service (NDS) is an emerging vision for how scientists and researchers across all disciplines can find, reuse, and publish data. It builds on the data archiving and sharing efforts already underway within specific communities and links them together with a common set of tools. - NDS website

The TERRA Ref project is one of the initial projects featured in the NDS Labs.

Discussion Topics

Data Access and Intellectual Property

Paul Miller from Agrible expressed interest in using the reference data as well as the computing pipeline. Contributing to open source development of the cyberinfrastructure is an option. They are a user focused company aimed at helping farmers optimize their agronomic practices.

  • Q: Are there any constraints on the use of data?
  • A: The reference dataset and algorithms developed by TERRA Ref will be open-access, but the computing pipeline will be built to protect intellectual property of end users.
  • The platform will be designed for easy deployment on any OpenStack server, and components will function independently as well as within the overall pipeline.
  • TERRAref will need to set up different levels of open access, but host all data/algorithms. Agrible needs to know the constraints on the data

Data Volume

The funding agency ARPA-E has revised the system specifications, meaning that total data volume will be 10x larger than originally estimated: 10 PB instead of the 1 PB for which we have budgeted. There was a discussion about options for storing and managing this increased requirement.

The central resources used by the TERRA Ref team at NCSA will be ROGER, campus cluster, openstack NEBULA, and ADS. XSEDE and Blue Waters allocations may also be available.

  • XSEDE allocations must be renewed annually; this is because it is not possible to predict future space and computing needs.
  • TERRAref should ensure that software is capable of handling large data volumes. This can be tested using synthetic data.

Follow up

  1. Apply for Blue Waters storage space.
  2. Address open questions:
    • What type of data access will we need to provide (when, how often, to whom)?
    • Will the data be stored long term or will it be active data?
    • What are the computing requirements? Will they require specific archtectures or resources available through XSEDE or Blue Waters?
    • What data products will be kept in the long term? Raw or just products? * What will the annual access patterns be?
    • What is the time and space dimension of data that algorithms will use?
    • Will historical data only be accessible at limited times?
    • Will there be distributed data management?
    • What will happen with queries that are too large? How will we adapt to changing needs of end users?
  3. Collaborate with Digital Agriculture Spoke of Midwest Big Data Hub
comments powered by Disqus