Data Integration-Intensive

According to the IRI Architecture Blueprint Activity [B4], data integration-intensive patterns are characterized by a need to perform analysis of data combined from multiple sources, which can include data from multiple sites, experiments, and/or simulations. This can also include tracking metadata and provenance for reproducible science and interactive data analysis, possibly at scale.

The Data Integration-Intensive Pattern Group of the IRI Architecture Blueprint Activity recognized at least two broad pattern areas within this class:

  • Integration of data from simulations and experiments/observations to generate new insight and subsequent direct actions.

  • Cross-site data-driven discovery, which includes using similar, multimodal, or heterogeneous data already generated at different facilities, or running the same tool, e.g., simulation software, on different systems, or experimental/observational data originating at different sources, the results of which must be combined, processed, and analyzed.

The Data Integration-Intensive Patterns Group also identified some gaps and opportunities

  • Gaps: Cross-facility API for resource co-operations; common/appropriate resource allocation models; standard abstracted workflow and automation tools; complex-wide data storage and searching capabilities; new models for “wide-area” cybersecurity; common or well-understood data policies; lack of FAIR data; user-focused user experience; lack of portable code; and crosstraining of staff (scientific, engineering, support, and administrative).

  • Opportunities: Many early-win science opportunities exist for this pattern; common API for facilities; standards for metadata; streaming data to/from compute and storage facilities; common and well-understood data policies; support for FAIR data; and templates for portable code

Relationship to INTERSECT Science Use Case Design Patterns

The Data Integration-Intensive IRI Pattern is related to all INTERSECT patterns in terms of the data generated by experiments and the corresponding need for metadata tracking and data provenance to enable reproducibility and insight through analysis.

This IRI Pattern is more specifically related to the INTERSECT Experiment Steering strategic pattern and its Local Experiment Steering and Distributed Experiment Steering architectural patterns, as experiment data analyis is performed in the feedback loop during the experiment. This data may come from different sources, such as different types of temperature sensors for a laser that is depositing metal in a 3D manufacturing process. It may even include data from a computational simulation, such as a structural stress simulation that takes past and current temperature sensor readings into account.

It is also related in a special way to the INTERSECT Design of Experiments strategic pattern and its Local Design of Experiments and Distributed Design of Experiments architectural patterns, as experiment data analyis is performed in the feedback loop between subsequent experiments to design them. The involved data may come from a single or multiple preceeding experiments and from prior analysis of such data. For example, a series of experiments is conducted to find a chemical compound that fits certain characteristics. Before each experiment, the results of prior experiments are analyzed to find the design point for the next experiment. The analysis may include Bayesian design of experiments and/or domain science informed AI.

This pattern is also related in a particular way to the INTERSECT Multi-Experiment Workflow strategic pattern and its Local Multi-Experiment Workflow and Distributed Multi-Experiment Workflow architectural patterns, as data from multiple experiments that may depend on each other is analyzed. This may include combining and analyzing data from multiple experiments to inform a subsequent experiment in a complex workflow. It also may include combining and analyzing data from multiple experiments to produce the overall result of the workflow.