A universal and multi-dimensional model for analytical data on geological samples

. To promote the sharing and reutilization of geoan-alytical data, various geoanalytical databases have been established over the last 30 years. Data models, which form the core of a database, are themselves the subjects of intensive studies. Data models determine the contents stored in the databases and applications of the databases. However, most geoanalytical data models have been designed for spe-ciﬁc geological applications, which has led to strong heterogeneity between databases. It is therefore difﬁcult for researchers to communicate and integrate geoanalytical data between databases. In particular, every time a new database is constructed, the time-consuming process of redesigning a data model signiﬁcantly increases the development cycle. This study introduces a new data model that is universally applicable and highly efﬁcient. The data model is applied to various geoanalytical methods and corresponding applications, and comprehensive analytical data contents together with associated background metadata are summarized and catalogued. Universal data


Introduction
Geoanalytical data include measurements of major and trace elements, rare Earth elements (REEs), isotopes, and structures and morphology of geological samples analyzed by various analytical instruments and techniques such as ICP (inductively coupled plasma), LA-ICP-MS (laser ablation inductively coupled plasma mass spectrometry), ICP-MS (inductively coupled plasma mass spectrometry), EPMA (electro-probe microanalyzer), SIMS (secondary ion mass spectroscopy), SEM (scanning electron microscope), TEM (transmission electron microscopy), XRF (X-ray fluorescence), and XRD (X-ray diffraction).Geoanalytical data effectively reflect the material composition, internal structure, external characteristics, interaction, and evolution history of the Earth and represent the most important support for geological researchers in their aim to understand the Earth and exploit its resources for the survival and development of human society.Enormous financial, material, and human resources have been invested into the geological surveys and geoanalysis required to acquire more comprehensive and abundant geoanalytical data.Over time, tremendous volumes of geoanalytical data have been created, and these volumes continue to increase at a high rate.It is of paramount importance that these data are curated effectively and that adequate background information, such as sample description, sampling information, and analysis information, is included, so that geological researchers can utilize the data according to their requirements.This will also facilitate the reutilization of the precious geoanalytical data.In addition, with the accumulation of large volumes of data, statistical analysis and data mining can be conducted on these data to provide a more comprehensive and advanced scientific understanding of the Earth.Hence, a variety of geoanalytical databases aimed at managing, sharing, and reutilizing geoanalytical information has been constructed and used as advanced tools in geological studies.The analysis and comparison of existing geoanalytical data models, as well as the development of improved models, are therefore a worthy and significant study to be conducted.Over the last decades, several studies of geoanalytical data models have been conducted.As early as 1977, Jeorge Van Trump and colleagues described a data model for environmental geochemical surveying and mineral resource exploration in the United States of America (Jr and Miesch, 1977).Lehnert et al. (2000) suggested a data model for the storage of global geochemical data of rocks.
Their data model provides a complete summary of essential geochemical data contents and a robust structure with a relational database management system (RDBMS).Numerous databases such as GEOROC (Geochemistry of Rocks of the Oceans and Continents), NAVDAT (the North American Volcanic and Intrusive Rock Database), and PetDB (the interactive web-based Petrological Database of the Ocean Floor) have since been constructed based on this model, and it is used by geological researches worldwide.In particular, PetDB has been used for a considerable amount of highimpact research such as Nature (Brandl et al., 2013;Carbotte et al., 2013;Cheng et al., 2016;Dick and Zhou, 2014;Helo et al., 2011;Hoernle et al., 2011;Kamenov et al., 2011;Kelley, 2014;Samuel and King, 2014;Schlindwein and Schmid, 2016;Straub et al., 2009) and Science (Cottrell and Kelley, 2013;Greber et al., 2017;Joy et al., 2012;Kelley and Cottrell, 2009;Mcnutt et al., 2016).A limitation of existing geoanalytical data models is their specificity to particular applications or geological domains and their focus on the description and curation of only a certain portion of geoanalytical data.For example, RU_CAGeochem is specifically focused on major and trace element concentrations and Sr, Nd, and Pb isotopic ratios of American volcanic rocks (Carr et al., 2014).Another database is focused on lead isotopes of copper ores from the southeastern Alps (Artioli et al., 2016).Many other examples of similarly specific geoanalytical databases and associated models exist (e.g., Artioli et al., 2016;Hellström, 2016;Lopes et al., 2014;Siegel et al., 2012;Strong et al., 2016).The consequence of this development is that each database exists as a separate island, and it is difficult for researchers to communicate and integrate geoanalytical data between databases.In particular, every time a database is constructed, a data model has to be redesigned.This consumes considerable amounts of time and prolongs the development cycle.In addition, the vast majority of models are designed based on relational models, which focus on the construction of relations between different data categories.When users query and utilize the geoanalytical data from different dimensions, these types of models utilize compli-cated joints between different tables to query the target data, which decrease efficiency as the amount of data increases.However, the exploration of such data models including the background items has laid a solid foundation for later study of advanced geoanalytical data models.At present, the development of various new techniques provides us with the opportunities to design more comprehensive and advanced geoanalytical data models.In this study, we introduce a novel, universal, and efficient geoanalytical data model.First, we provide an overview of geoanalytical methods and applications to summarize the geoanalytical data available.Then, we design universal data attributes based on these data and develop a multi-dimensional data model.Finally, we evaluate the model to validate its efficiency.

Overview of geoanalytical data contents
In recent years, many new geoanalytical methods and instruments have been developed, creating novel kinds of data (Linge et al., 2017).A truly universal data model should have the ability to accommodate all kinds of geoanalytical data.In addition, the data model should be capable of making all stored data readily available for reutilization by geological researchers.In order to develop a model with such capabilities, a comprehensive set of geoanalytical data, together with related background information required for reutilization of the data, was summarized and categorized, as outlined below.First, analytical techniques and their applications were studied to comprehensively summarize geoanalytical measurement data.This process is outlined in Fig. 1.Because of the great diversity of analytical methods and geological applications, Fig. 1 only shows a few examples to indicate the method adopted in this paper.The five categories (namely, bulk analysis; microanalysis; isotope analysis; morphology, structure, and valence analysis; and organic analysis) were divided according to the analytical technique used.In this way, data from each category were categorized according to analytical instruments (e.g., SEM, SNM, and EPMA for microanalysis).In the next step, the data were grouped according to geological applications.The comprehensive list of geoanalytical measurement data items used in the present study, compiled from a thorough literature review, is presented in Fig. 2. In the case of bulk analysis, most measurements ultimately provided major, trace, and ultratrace element concentration data.Microanalysis can yield data of elemental concentrations in a microregion, as well as structural information of geological samples acquired by secondary electron and backscattered electron techniques, commonly stored as image files.For geochronology and stable isotopic analysis (GSI analysis), most measurement data are isotopic ratios.For morphology, structure and valence analysis (MSV analysis), the most common measurement data are image files such as X-ray photoelectron spectroscopy (XPS) spectra or XRD patterns.Organic analysis is a new analyt-  ical method which is used for the analysis of environmental geological samples.The most common application of this method in the geological literature is the analysis of the 16 kinds of polycyclic aromatic hydrocarbons (PAHs) in soils.
Background information describing the analyzed samples and data quality has to be incorporated, because it is indispensable for proper evaluation, efficient recovery, and sorting of the compiled data.Hence, background metadata are summarized based on the investigations of the geological researchers and the contents of existing databases (Adcock et al., 2003;Lehnert et al., 2000).Table 1 lists details of the background metadata used during the present study.In this study, the background metadata are divided into three parts: sample metadata provide geological researchers with infor-  mation about geological materials, sampling metadata provide information about environmental conditions in the field, and quality metadata allow geological researchers to make an assessment of data quality and usability (Table 1).The background metadata items listed in Table 1 are the most essential information required for every kind of geoanalytical measurement data.More specific attributes are not included in our model.

Geoanalytical data modeling
This section outlines how the novel geoanalytical data model was designed, utilizing the data summarized above.Despite their limitations, the currently relational data mode is the most commonly used pattern for geoanalytical data models.The relational data mode constructs relations between each group of data within the database.This means that more data categories inevitably lead to much more data relations, increasing storage demands and the time required to query the database.Compared to such conventional relational data models (Beynon-Davies, 2004), multi-dimensional models (MDMs), which are widely utilized during the development of big data science and data mining, are single subjectoriented sources for analyzing data based on various dimensions (Niemi and Hirvonen, 2003).Multi-dimensional modeling approaches share characteristics with fast analysis of shared multi-dimensional information (FASMI).In particular, MDM offers the advantage of a relatively simple and straightforward database design, which nevertheless supports powerful analyses and is relatively well understood by the end users (Hoberman, 2005).As a modeling framework, MDM has a conceptual and a logical phase of design, composed of a fact table and several dimension tables (Höpken et al., 2013).Facts comprise numeric and additive characteristics of the data, which can be accumulated along multiple dimensions.Frequently, researchers are interested in analyzing geoanalytical measurement data from different metadata perspectives.Hence, the MDM approach is ideally suited for the design of geoanalytical data models.Here, the geoanalytical data are the fact data, and other background information are dimension data.The use of the MDM modeling framework applied in the present study will allow geological researchers to rapidly analyze geoanalytical data based on numerous metadata criteria.

Conceptual data model (CDM)
A conceptual data model (CDM) includes the definition of its universal attributes and a rough design of its structure.It represents the primary phase of data model design, independent from the detailed techniques of computer systems.Figure 3 presents the multi-dimensional CDM we developed for geoanalytical data.Here, with the abstraction of universal concepts present in geoanalytical data, the model becomes more flexible and universally applicable.The geoanalytical data are placed in the center of the model, in the form of a fact table .The associated background information is categorized and abstracted as various dimensions which are represented by different axes in Fig. 3.The six dimensions of our CDM are sample, analysis type, analytical methods, location, time, and quality.This arrangement allows geological researchers to analyze geoanalytical data from six different dimensions or any combination thereof.The marks in each dimension represent the detailed measurement conditions.The "n" dimension is an expansible dimension, which can be added according to the specific model application.

Logical data model (LDM)
A logical data model (LDM) is a CDM written in unified modeling language (UML) (Evans et al., 2014).Logical model design leads to a logical scheme, defining objects, attributes, and relationships (Chmura and Heumann, 2005).
The LDM scheme can be easily implemented by any DBMS.Figure 4 shows the LDM scheme designed for geoanalytical data.Each box in the LDM represents an object, and items in the box are its attributes.The relations between object are represented with lines.There are three kinds of symbols associated with the lines.The short line denotes "1", the circle denotes "0" (which means "maybe"), and the triangle denotes "many".Lines and symbols define the relations between objects.The additional notation foreign key (FK) is added if the attribute in one object uniquely identifies an attribute in another object.For example, the sample ID in the geoanalytical data object is a foreign key of "sampleid" in the sample object, because they have the value.By means of this foreign key, the data contents of the two objects are connected.For each object, a few extended attributes are added (Extend_n in Fig. 4).This feature allows developers to add database-specific attributes to this model, increasing its flexibility and universal applicability.

Implementation and evaluation
In order to evaluate the performance of our model, we carried out a comparison experiment with the widely used Lehnert rock geochemical data model (Lehnert et al., 2000).In order to conduct the experiment, a physical data model (PDM) needed to be created with a database management system.As RDBMS is the most common technique used in geoanalytical databases, MySQL, which is a widely used RDBMS,  was adopted to implement the two models.A specific data item (rock type: andesite; location: Sycamore Hall; latitude: 36.27.12 • N, longitude: 83.34.12 • W; institution: Jilin University; method: ICP-MS; SiO2:58.9;FiO2:1.13) was used as test data and tables related to the data contents were implemented.We analyzed the two models from two perspectives: developers and users.For developers, the comparison of the PDM structure is shown in Fig. 5, and query operation descriptions are presented Fig. 6.The comparison clearly indicates that the geoanalytical data model is more succinct than rock data model and saves time and computer resources.Three model performance indicators (insert time, storage space usage, and retrieval time) were evaluated with the increasing of amounts of data.The results are shown in Figs. 7, 8, and 9, respectively.Figure 7 shows clearly that the process of data insertion is considerably faster for the geoanalytical data model when compared to the rock data model.Figure 9 shows clearly that the storage space usage is relatively less than rock data model.In the case of data query (Fig. 8), the difference in time consumption is even more striking.With an increasing amount of data items, the query time of the geoanalytical data model remains very fast and efficient.In contrast, for the rock data model, query time costs increased exponentially with the increasing amount of data items.

Conclusions
The geoanalytical data model presented herein is flexible and appropriate for a broad range of applications to geoanalytical data.The model has the following general characteristics: 1. Its universality allows the model to accommodate any type of geoanalytical data for various geological materials, as well as all significant metadata.
2. The adoption of a multi-dimensional data model framework provides geological researchers with the ability to analyze geoanalytical data from different dimensions.
In addition to the sample description and location criteria commonly used in existed databases, this model provides four additional query criteria (method, quality, time, and analysis).
3. There are minimum data relations between different objects.Relations between different background metadata objects have been avoided in order to construct robust relations between background metadata and measurement data.This increases the model's efficiency when geoanalytical data are inserted or queried while simultaneously decreasing its space usage.
It is hoped that the design of this model will allow for the unified construction of geoanalytical databases.The model enables the accumulation and integration of significant amounts of diverse geoanalytical data.By utilization of the big data analysis techniques described in our study, geological researchers could analyze geoanalytical data with high efficiency and develop novel methods to conduct Earth science studies.

Figure 1 .
Figure 1.Process of summarizing the geoanalytical data contents.

Figure 2 .
Figure 2. Lists of geoanalytical methods and measurement data items.

Figure 4 .
Figure 4. Logical data model (LDM) of the geoanalytical data model.

Figure 5 .
Figure 5. Physical data model (PDM) structure comparison with the Lehnert rock data model.

Figure 6 .
Figure 6.Comparison of insert and query operations in structured query languages (SQLs).

Figure 7 .
Figure 7. Time spent on data insert operations with increasing amounts of data.

Figure 8 .
Figure 8.Time requirement for data queries (latitude and longitude).

Figure 9 .
Figure 9. Space usages of the two data models with increasing amount of data items.

Table 1 .
Background metadata of geoanalytical data.