SPIDR middleware for WDCs

System Architecture

[7] The SPIDR system architecture has the following main components: a web-portal, metadata repository, visualization and data mining engines, and a grid of virtual data sources exposed to the external clients including the SPIDR portal via data query and inject web services. Behind a data source's web service one can have a database, a set of files in a local to server file system, or a set of URLs to remote data sources.

SPIDR Portals

[8] A web-portal serves as an agent between the user and the Grid of environmental data sources. It performs two main functions. The first function is metadata management, which allows for fast and efficient catalog-level metadata search. Here by catalog-level metadata we mean general descriptions of data resources, stored as a managed collection of XML documents with a known XML schemas (i.e.. owner info, geographic coverage, time coverage, data description, visualization methods, etc.). Our catalog-level metadata collection works much the same as other similar resources, e.g. Global Change Master Directory (GCMD) from NASA (http://gcmd.nasa.gov).

Figure 1

[9] The second function of the web-portal is data access. In Figure 1 the web-portal is shown as a client, which connects to virtual data sources, retrieves the requested data, and delivers it back to the user. Advanced web-portal functions can include visualization and data mining. Data access web forms are built using inventory (or granule-level) metadata describing the availability of stations - satellites - instruments - parameters or channels for the given time interval. The inventory metadata can be used also to compare and synchronize mirrored data sources.

[10] The SPIDR portal combines a central metadata registry with a set of distributed data web services, web map services, and replica sets of data files. A user can search catalog-level metadata and inventory, use a persistent data basket to save the selection between the sessions, and plot or download the selected data in different formats, including XML and netCDF. A database administrator can upload files into the SPIDR databases using either a web services or web portal interface.

Metadata Registry and Data Inventory

[11] Both the catalog- and granule-level metadata records, which contain respectively a general description and detailed inventory of SPIDR data resources, can be updated either manually by a system administrator or automatically by the data robot collecting records from the data grid (see Figure 1). The catalog-level metadata registry uses a native XML database backend based on the open-source product eXist [Meier, 2006]. The metadata engine has no predefined XML schema; it is possible to have different metadata schemas for different data categories. For example, data sources with spatial content, such as OpenGIS Web Map Services (http://www.opengeospatial.org/standards/wms) and time series databases with ground observations, can use FGDC Content Standard for Digital Geospatial Metadata (document FGDC-STD-001-1998, http://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/base-metadata/index_html), and at the same time databases with satellite telemetry can have SPASE-formatted metadata records (Space Physics Archive Search and Extract, 2006, http://www.spase-group.org/data/). The SPIDR high-level metadata engine has extended search capabilities allowing it to search in specific metadata elements, (keyword, title, provider, etc.). In addition, it supports Web 2.0 style functionality with direct editing of the metadata records at the SPIDR portal, a user discussion forum, internal e-mail messaging, and wiki-style documentation and system help.

[12] The SPIDR granule-level inventory metadata registry uses an SQL database backend based built on the open-source product MySQL (http://dev.mysql.com/doc/refman/5.0/en/index.html, 2008). The main purpose of the inventory is to list available parameters and stations from each database with some granularity in time, currently taken as monthly. That is whether a given station has any data for a given month. This information is needed in early validation of data requests for both availability and size of the data export, and for comparison of data holdings at different SPIDR nodes for database synchronization. When adding new data to SPIDR, the inventory can be updated either in real time or by periodic queries of the corresponding data source, depending on the input data load. At the same time the inventory metadata is updated the inventory summary such as the station and parameter list with maximum date ranges is fed up in order to update the corresponding catalog-level metadata.

Grid of Web-services

Figure 2

[13] Web Services (WS) technology is used by SPIDR to access databases and metadata both for the SPIDR web application (interactive interface for human users) and for the SPIDR web clients (third party programs exporting and importing data and metadata in batch mode). In addition to the WS SOAP protocol the SPIDR web application can access databases directly using JDBC drivers. We call the JDBC-connected databases "local'' and the WS-connected databases "remote'' (Figure 2). The access mode is defined in the database configuration files. If database is hosted on the server on the same LAN as the SPIDR web application, then the local access mode may be more efficient compared to the remote one; but if the database is located outside the local network then the JDBC connections will be the most probably blocked by a security considerations and the SOAP protocol becomes the only reliable way to access the data.

[14] SPIDR data archives are logically organized into thematic groups called viewGroups (e.g. Geomagnetic Indices, or Interplanetary Magnetic Field). Each viewGroup may include several databases or groups of database tables, which we call tables. Each table may be considered as a virtual database with a single configuration file describing the access mode (local or remote) together with URL and access credentials.

[15] The system currently has implemented web-services for the following use cases:

Get a metadata record for a given viewGroup (Virtual Observatory WS).
For a given table, element, station and date interval get a data inventory and export data values in a variety of scientific formats, including XML and NetCDF (Data Source WS).
Load several "standard'' data files of several scientific formats into the database (Data Sink WS).
Synchronize two SPIDR archives by exporting data from one archive and loading into another (WS orchestration).

Data Source Web-service.

[16] Data source web service URLs are stored in SPIDR configuration files. When a web service call is performed, the web-service returns a URL, pointing to a data file, containing the serialized CDM object with requested data. The SPIDR application itself can act as a web-service and process remote calls thus allowing for chaining of the data export web services.

[17] In any case, a SPIDR data source service will supply metadata describing parameter names, units of measure, visualization options, etc., and the data accreditation describing the data origin. Data serialization formats include direct Java object serialization, XML, NetCDF, and for some databases also special formats introduced by the data users community. For example, geomagnetic field variations can be exported in WDC or Intermagnet formats. For geomagnetic variations the data accreditation describes the observatory which has provided the data to SPIDR.

Figure 3

[18] With the SPIDR portal, a user can collect the serialized data from the distributed data sources into a single "user basket'', re-format and re-package all the data for download, or visualize the selection with multiple time-synchronous plots either using static GIF images or by dynamic "zoomable'' Java applets which share the same time scale limits. In Figure 3 we present an example plot of selections from two databases with planetary geomagnetic disturbance index Kp and Disturbance Storm Time index DST. Both indices are prime indicators of a magnetic storm, and the simultaneous plots help to estimate the storm intensity.

[19] Data query options (time interval, data source, parameters, stations) are saved in the user basket, so the data can be re-selected in the future. Because of the real-time nature of the SPIDR databases, the data selection itself is transient, so theoretically in the next session user can find different (updated) observations in the data basket. All the data selection queries are logged, so the SPIDR administrator can view not only user session statistics, but also the frequency of data requests by source.

Satellite Data Granules and Image Archive Web-services.

[20] Remote sensing and imagery databases have a different data model as compared to a sequential database. Usually the data collection is divided into "elementary'' blocks called granules. A granule can be a daily set of solar images from different observatories, or a fixed-length section of satellite orbit with Earth observations in different spectral bands.

Figure 4

[21] For example, the magnetic storm shown by the time series plots in Figure 4 is manifested in the daily solar image granule by a bright solar flare in the 164 MHz radiotelescope image. The flare erupts from a large system of sunspots visible on the solar X-ray and magnetogram images. At the same time, the aurora produced by this magnetic storm on the night side of the Earth can be seen at the cloud-free night-time image granules of 1/8th of the DMSP satellite orbit; one of them is shown in Figure 5.

Figure 5

[22] All the granule-based web-services in SPIDR have the same design pattern. The user's data export request specifies the date range and type of the image. The web-service returns a list of granules with metadata and links to the preview and high-resolution images or binary files for granule data products like DMSP satellite SSJ/4 sensor readings.

Data Sink Web-service.

[23] Clients can load data into SPIDR databases from files located on a local workstation, along with relevant loading options which are passed to the SPIDR web services together with the data over SOAP with attachments. The database loading web service called by the client will parse the input file format, load data into the local database, add a bookkeeping record into the SPIDR data input/output logging database, and check the list of mirrored SPIDR nodes to send the input data file there to keep those databases in sync.