SPIDR middleware for WDCs

Database Synchronization

Figure 6

[24] SPIDR databases are self-synchronizing (Figure 6). The synchronization has both push and pull modes and it is based on the data source and the data sink web-services. In the push mode, when a new data set is successfully parsed and loaded into a database at one of the SPIDR nodes (we call it "master'') using the data sink web service, all other nodes which are subscribed to this data stream (we call them "slaves'') will receive the same set of data exported from the "master'' node using the data source web service. Each SPIDR node can be either "master'' or "slave'' depending on whether it receives data from external sources or from another node. Such a peer-to-peer synchronization via web-services CDM object exchange has many advantages for heterogeneous distributed system, where SPIDR nodes can run different operating systems, database engines, and network security policies. For a high volume of short input messages, we can use pull mode synchronization. In this case the "slave'' node periodically calls the "master'' data source and receives, say, the last day of observations as a single data set.

[25] The SPIDR admin web interface has special tools to compare the same databases from several nodes and if necessary to order background synchronization from/to any of them. The inventory-level metadata from the "master'' and "slave'' nodes can be used to compare the data holdings and when there are any differences to start a background process at the "slave'' node, which will pull the locally missing data from the "master'' node using its data source web-service and load it into SPIDR by using the local data sink web-service.

[26] This web-services based synchronization mechanism is a new step in automation of the data exchange between World Data Centers in different countries. The common data model used by all the SPIDR nodes eliminates unnecessary format translations when synchronizing databases at different nodes. The peer-to-peer push synchronization aligns with agency priorities by first loading data into the national "master'' node and then exporting the data to a given list of subscribers abroad. Existence of several copies of the same database in a very distributed network helps ensure long term data preservation.