Benefits of using convolutional neural networks for seismic data quality analysis

Casale, Paolo; Pignatelli, Alessandro

doi:https://doi.org/10.5194/gi-2023-4

Preprints

https://doi.org/10.5194/gi-2023-4

Preprints

08 May 2023

| 08 May 2023

Status: this preprint was under review for the journal GI but the revision was not accepted.

Benefits of using convolutional neural networks for seismic data quality analysis

Paolo Casale and Alessandro Pignatelli

Abstract. Seismic data represent an excellent source of information and can be used to investigate several phenomena such as earthquake nature, faults geometry, tomography etc. These data are affected by several types of noise that are often grouped into two main classes: anthropogenic and environmental ones. Nevertheless instrumental noise or malfunctioning stations detection is also a relevant step in terms of data quality control and in the efficiency of the seismic network. As we will show, visual inspection of seismic spectral diagrams allows us to detect problems that can compromise data quality, for example invalidating subsequent calculations, such as Magnitude or Peak Ground Acceleration (PGA). However, such visual inspection requires human experience (due to the complexity of the diagrams), time demanding and effort as there are too many stations to be checked. That’s why, in this paper, we have explored the possibility of “transferring” such human experience into an artificial intelligence system in order to automatically and quickly perform such detection. The results have been very encouraging as the automatic system we have set up shows a detection accuracy of over 90 % on a set of 840 noise spectral diagrams obtained from seismic station records.

Received: 07 Apr 2023 – Discussion started: 08 May 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Paolo Casale and Alessandro Pignatelli

Status: closed

RC1:
'Comment on gi-2023-4', Anonymous Referee #1, 31 May 2023
The paper discusses the applicability of AlexNet to seismic spectral plots to detect data quality problems. They combined existing diagrams from different sources into a new database containing 840 diagrams. By labeling (classifying) the diagrams using expert knowledge, they created a new training dataset for CNNs. In general, the approach was successful and the AlexNet achieved an accuracy of 90% on this dataset.
In my opinion, data quality is one of the most important issues in any domain. Therefore, the authors have addressed a very important problem. They show a possible solution to automate a pre-analysis to identify erroneous stations or stations that should be treated with caution.

1 Does the paper address relevant scientific questions within the scope of GI:
Yes. Data quality is important in any scientific field. In this case, the authors focus on seismic data, so this paper is relevant to GI.
2 Does the paper present novel concepts, ideas, tools, or data:
The authors have applied an existing and well-known concept to a new dataset. It is obvious that a CNN approach to classification will work. However, the authors noted that there are no large databases on which to train the neural networks (they “focused on the entire noise spectra to detect acceptable signals from anomalous ones”). As datasets are rare and very time consuming to create, this contribution is relevant to the scientific community if the data is made (“correctly”) publicly available (see 6).
3 Are substantial conclusions reached:
As stated above, it is obvious that the concept will work. The authors have proven this once again. Therefore, the paper does not provide any substantial new conclusions.
4 Are the scientific methods and assumptions valid and clearly outlined:
The methods are clear with strong weaknesses regarding “4. Machine learning general description”. The headline is not consistent with the following subsections. The heading should be: “4. Deep Learning” with 4.1 being: “4.1 General description”.

Some assumptions are mixed with results, such as the data splits in 5.1, 5.2, 5.3 and 5.4 and their explanations.
5 Are the results sufficient to support the interpretations and conclusions:
Partially. Comparison of the four tests is not possible due to changing test sets.
6 Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results):
No. The data provided are only some diagrams without labels. What is missing:
Source code

Correct labelled data – from the given data, I cannot reproduce their approach. Please create 4 folders, each consisting the labels and images used during every experiment.

The pre-trained checkpoint of the CNN

Furthermore, the images (which by themselves have no use) are published only on google drive, which is a bad decision. I suggest using an appropriate website to share your data such as https://www.kaggle.com/.
7 Do the authors give proper credit to related work and clearly indicate their own new/original contribution:
Yes
8 Does the title clearly reflect the contents of the paper:
Yes. However, in my opinion, the title suggests that a CNN is the best approach, which seems logical. However, the paper only uses a 2D CNN, and a comparison with a 1D CNN is missing. I would at least expect a section discussing the advantages and disadvantages of a 1D and 2D approach.
9 Does the abstract provide a concise and complete summary:
Partially. In my opinion, the results of the large multiclass experiment need to be added.
10 Is the overall presentation well structured and clear:
The general presentation of the introduction and theory up to section 3 is OK. Sections 4 and 5 are not. The experiments are mixed with results and discussion, which makes it difficult to follow the idea. Regarding the general structure, I suggest following common headings such as Introduction, (materials/data and) Methods, Experiments, Results and Discussion, Conclusions. See also the suggestions in 4 and the general comments below.
11 Is the language fluent and precise:
No. Some sentences are difficult to read because they are unnecessarily nested. Proofreading is needed.

There are many typing errors such as:
Missing/additional whitespaces e.g.: line 66; line 86; line 112 and many more

Missing commas:
line 61 end of line to 62: […], and […]

line62: […] in some of them, […]

line 128: […] trace statistics, [...]

line 145: […] (McNamara Boaz, 2006), […]

…

spelling: line 154 (inserted new line separates the t from interest); line 156

Please use cross-linking when referring to another section (example: line 363)

12 Are mathematical formulae, symbols, abbreviations, and units correctly defined and used:
Yes
13 Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated:
Please add a section: "Dataset" and move relevant parts to this section, as in my eyes this is an important contribution. This will make it easier to extract relevant information (see also general comments). I have major concerns with sections 4 and 5. See general comments.
14 Are the number and quality of references appropriate:
The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving. AlexNet should not be used.
15 Is the amount and quality of supplementary material appropriate:
Yes. To make it even more precise, I suggest adding a table with all four tests to make the comparison easier.
General comments:
The authors have used several data sets in the manuscript, which is confusing. I do not agree with the division into smaller datasets (of size 224, 447 and 840 plots) and one large dataset (1865 PDF spectral plots). It is obvious that more data will improve training accuracy. Removing uncertain diagrams from the training is not a good practice as results become not useful in real world applications and conclusions are not representative anymore. The authors could have just used the large dataset as it is the most general dataset in use.
Weighting strategies were not covered, for example, to address their imbalance problem faced in tests 1 and 2. Furthermore, the authors used different test sets to validate their approaches, making it impossible to compare results.
In the whole manuscript there is no information about hyperparameters such as batch size, epochs trained, duration of training and many more.
A very important topic to discuss is overfitting, which has not been considered at all. They did not provide any information about the training and validation loss, which is a common practice in applied deep learning. The lack of this information makes it very difficult to interpret the results.
Since reproducibility was not covered, the results will vary between two training runs. The authors should retrain the model several times and provide an average accuracy and standard deviation.
In conclusion: The experimental design is bad. Too many mistakes have been made.
There is no explanation as to why a 2D CNN approach was chosen in the first place. The authors claim that: “In our study we find that just images of data fulfil our goal” (line 336). Proof is lacking. In fact, they have shown that uncertain graphs are likely to be misclassified. A 1D CNN can be applied directly to the data without any loss of information due to encoding into an image. As the seismic information is stored in vectors, 1D CNNs are more efficient, easier to train and less complex. Since far fewer parameters are needed, the loss of information due to encoding the data into an image can be avoided and the only loss is introduced by subsampling the spectral data itself. Did the authors take this into account in their research? For example, the images could be used by the experts to classify the data, but the 1D data could be used for training. Why is it advantageous to use a 2D CNN? I suggest adding a section on this. I refer also to: https://doi.org/10.1016/j.ymssp.2020.107398.
Did the authors consider data augmentation? I would like to see a section discussing the possibility of data augmentation in such a case as it can improve accuracy a lot. I refer to: https://doi.org/10.1088/1741-2552/ac4430.
Further comments:
Section 4:
Overall, this part is not well-structured and lacks some important parameters to reproduce the results. A suggestion for a new structure:

- CNNs (reduce the general description, remove ANNs and focus on the CNN),

- Training (a short overview, perhaps supported by an enumeration of important training steps),

- Metrics (accuracy, confusion matrix, F1 score (all results also class-based for multiclass problems), Precision, Recall) and

- Hyperparameters (training - validation - test split, batch size, epochs, loss function, optimizer, did the authors use dropout? did the authors use regularization techniques? did the authors use weighting to deal with class imbalances?)

Please also use F1 scores, as the training data is class unbalanced, which will bias the result towards the most common class, or at least use some weighting during training. Accuracy alone can be misleading when the classes are imbalanced because a model that predicts the majority class most of the time could still achieve high accuracy while performing poorly on the minority class. The confusion matrix, on the other hand, provides a more detailed breakdown of predictions but does not directly incorporate the trade-off between precision and recall. I refer you to the tutorials of PyTorch or tensorflow.

AlexNet is very outdated and should not be used for such an important domain (see graph at: https://paperswithcode.com/sota/image-classification-on-imagenet). I refer to PyTorch, which provides pre-trained networks and their top-1 and top-5 accuracies on the ImageNet dataset: https://pytorch.org/vision/stable/models.html. It should be easy to replace your existing AlexNet with a newer network. Suggestion: EfficientNet V2, since it has fewer parameters and should be trainable on the used system. Please retrain on the final dataset.

Please revise some claims that are incomplete or wrong:
Line 262: “As well known, machine learning algorithms are divided into two groups: supervised methods and unsupervised ones.” – Reinforcement learning is missing. Please provide sources for all such claims in these sections.

Line 300 following: “the main task of CNNs is to identify local features”. I cannot agree to that. While CNNs initially focus on local features in the first few layers, deeper layers and the latent space contain global information. However, the network's ability to determine the importance of global versus local features is achieved through the learning process by adjusting the weights based on the task's requirements.

Comment: I recommend not to go too deep into the topic, because many mistakes can be avoided. Instead, explain the general concept of CNNs and cite accordingly.

I recommend replacing: “predictors and answers” for the more commonly used terms “inputs and ground truth”. The terms “inputs” and “ground truth” are more widely understood and commonly used in the deep learning community, making them preferable for clear and consistent communication.

The dataset split is incorrectly explained: the dataset should be split into training, validation and test data. Training data is used to train the network. By evaluating the model's performance on the validation data, adjustments can be made to the model architecture, regularization techniques, or other hyperparameters to improve its generalization capabilities. The goal is to optimize the network for the validation data, e.g., by increasing the batch size or changing the optimizer. In the end, the final performance and generalization ability of the trained models is evaluated using the test dataset. I recommend further research. Because the authors split the test data in different random ways (In 5.1, 5.2 and 5.4: 80-20 split, in 5.3: 90-10 split), the results are not comparable. The test data set must be exactly the same for all experiments.

Regarding the source code, I have doubts that this result can ever be reproduced, since no code was provided and reproducibility was not covered at all as well as the train/test split (see previous and question 6). Neural networks are inherently non-deterministic. However, you can achieve reproducibility by correctly initializing the random generators, which makes the result almost perfectly reproducible. I refer you to the reproducibility documentation from pytorch: https://pytorch.org/docs/stable/notes/randomness.html. Please cover this by adding a few sentences in the paper (e.g. in the suggested hyperparameters section) so that the community can reproduce the results.

Section 5:
Section 5.1 and 5.2 and 5.3 – it makes no sense to split these sections. Training with less data gives no information. It is obvious that with such a small data set, more data will improve performance. Remove sections 5.1 and 5.2. I suggest adding a section "Dataset" explaining the two remaining training datasets (one without dubious diagrams, representing the binary classification, and one with the dubious diagrams, representing the multi-class problem). In both cases, the test dataset has to be the same.

At the end of Section 5.3, the authors state that their data reduction is probably perfect because a human has discarded uncertain diagrams. Such an assumption is very risky and should be avoided, since errors can still occur.

In 5.4, the authors somehow explain the problem of class imbalances without naming it. These imbalances can be addressed using weighting strategies, which the authors did not consider. Since this is one of the fundamentals of training neural networks, the authors should retrain their networks using weighting.

Section 6:
The authors conclude that the experiment with the large data set reduced accuracy, which again needs to be verified using exactly the same test data. They also concluded that no BAD diagrams were labelled OK, which is very important and would be reflected if class-wise accuracies were shown (and precision/recall/F1). In such a multi-class scenario, the authors need to add class-wise metrics.
Citation: https://doi.org/10.5194/gi-2023-4-RC1
RC2:
'Comment on gi-2023-4', Anonymous Referee #2, 02 Jun 2023
In this study, the authors address the topic of data quality control within seismic records, which is a time consuming process in light of rapidly growing data sets and requires human experience (experts) for the identification of malfunctioning or instrumental noise.
The authors propose to use machine learning (CNNs, AlexNet) in order to facilitate and fasten data quality control checks. This subject is definitely of importance to the seismic community and well suited for GI.
The authors put a big effort into manually labeling data (PDS). This data is used for training and testing AlexNet to predict in total three classes, which divide the PDSs into “ok”, “dubious” and “bad” PSDs/signals. Overall, the accuracy of 85 % is very promising.
However, I do feel there are some substantial shortcomings in the way the study is presented and list here some general remarks followed by some more detailed comments on the sections of the manuscript.

1. Does the paper address relevant scientific questions within the scope of GI?
It really does. Data quality is substantial and the authors propose an efficient way to account for it! Therefore, the subject is definitely suited for GI.

2. Does the paper present novel concepts, ideas, tools, or data?
The application of CNNs is not new in this context, neither is the data. The latter is not problematic. I feel it is rather meaningful to make use of existing archives here. However, methodologically – at least the way it is presented – this study seems rather basic and lacks information in the way it is presented now, e.g. on preprocessing, the manual labeling process, model hyperparameters or a comparison between different models.

3. Are substantial conclusions reached?
The application of AlexNet works and the conclusions meaningful. However, this is not a groundbreaking/new result.

4. Are the scientific methods and assumptions valid and clearly outlined?
In my opinion and as stated above, details on preprocessing, labeling and the model itself are missing. Section 4 (Machine Learning general description) is, it current state, rather superficial and overall references are not up to date.
Pros and Cons of AlexNet are not discussed, nor is this approach compared to others.
I had difficulties to follow the criteria for the four tests and wonder if it is really necessary to present all of them. Generally - the more data the better - and it is natural, that predictions become better with an increase in training data. It would be rather interesting to see how much data is really needed and when performance begins to drop. This is, however, not demonstrated based on these four tests.

5. Are the results sufficient to support the interpretations and conclusions?
In general and just looking at the outcome, it seems like it – yes. But as stated above, information are missing and I feel like this makes it hard to judge.

6. Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)?
No. Neither data, source code nor the model are publically available. It is impossible to reproduce the results.

7. Do the authors give proper credit to related work and clearly indicate their own new/original contribution?
Partly. But there for sure is more out there when it comes to CNN application in this context, which the authors did not mention.

8. Does the title clearly reflect the contents of the paper?
Yes, it does.

9. Does the abstract provide a concise and complete summary?
Yes.

10. Is the overall presentation well structured and clear?
The structure is clear. However, the section and subsection titles sometimes sound a bit generic and bloated. This is a general remark though and also applies to parts of the main text. The presentation of methods and results is rather patchy and needs to be more precise.

11. Is the language fluent and precise?
Rather not. The authors need to carefully check spelling and writing throughout the entire manuscript again. Sentences were sometimes difficult to read, formatting or grammar was off.

12. Are mathematical formulae, symbols, abbreviations, and units correctly defined and used?
Yes.

13. Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated?
I would propose to carefully check for new/state-of-the-art papers on that matter and to include those recent studies to put this one presented here into the right light and framework. More information are needed in the methods section and results/discussion accordingly adapted.

14. Are the number and quality of references appropriate?
Not really. Partly, the authors use only one reference for most parts of a section and those studies often go back > 10 yrs. This needs to be addressed.

15. Is the amount and quality of supplementary material appropriate?
Yes.

General Remarks
Spelling and writing needs to be carefully checked throughout the entire manuscript.

A more explicit style of writing would be beneficial. While reading I regularly felt like it was too generic and more details supporting those generic statements are necessary.

The references need to be updated. I am not an expert on the specific field (instrumentation or CNN application in this context), but most of the studies cited lie < 10 years in the past while both the technology and the field rapidly emerge.

As a follow up, the study needs to be contextualized in light of the current state-of-the-art within this field. What do others use/propose to address this topic and why is your approach more suited?

More details on preprocessing, manual labeling and model hyperparameters need to be included.

A stress test would be particularly interesting. How much training data is needed? Is this transferable to other data sets? What are the criteria?

Chapter 2. Seismic noise and its spectral representation
The definition of “noise” vs. “signal” is a bit old fashioned in my opinion. As seismology turned into a very broad field (earthquake seismology, environmental seismology, microseism), we always define our signals of interest, while the remaining fraction of the data is considered as noise. If this study is only related to earthquake seismology, it should be clearly stated here.

Chapter 4. Machine Learning general description
Overall this chapter is rather sketchy. In the beginning, machine learning is almost “advertised” as being simple and easy to apply without the need of being particularly knowledgeable about it, only the amount of training data would be of importance.
I would be very careful about those statements and can actually not support them. It is why machine learning is often still considered as “black magic” or a “black box” with some shortcoming input data and the shiny output data. This is not the case. Machine learning models can be very sensitive to the type of data they were trained on (even biased), but also to hyperparameters of the model, which can cause e.g. overfitting.

The right terminology is sometimes missing or off. For example: the authors describe the problem of overfitting with their own words, but do not mention the specific term.

References are missing. The section is only based on one references, which again is > 10 years old.

What is the difference on “images of data” compared to “data”. I guess I see what is meant here, but terminology needs to be more specific to take off the guesswork.

AlexNet is briefly introduced, but it is not discussed why this is the right choice here compared to other approaches.

5. Method application
ll. 340-341 – how do you account for that?

It would be nice if the authors could elaborate on the classification decision into “ok” and “broken”. It is stated, that this has been done by a human operator, but the criteria are not explicitly described. The only statement I found was “… showing trends that, according to the human eye, definitely belong to defective stations”, which in my opinion is quite subjective and loosely formulated. Overall, this is not reproducible.

Second test: I see why the authors wanted to increase the number of “broken” labels. However, to not bias the network (now you definitely do so), the number of “ok” signals should be equally increased as well. I also doubt that “ok” signals are more homogeneous from the ML model perspective, if you consider all the potential environmental noise.

Fouth test: looking at this data base and test, I wonder why test 1-3 are presented.
Citation: https://doi.org/10.5194/gi-2023-4-RC2
AC1:
'Comment on gi-2023-4', Alessandro Pignatelli, 23 Jun 2023
Thanks to the referees for the effort put in the article analysis. As their comments and requests look important and they require a big effort to be answered (and then accomplished) and as we don’t agree with some of them, we would like to be sure that you agree with our plans before going through the detailed responding.
.

In the following lines you’ll find the answers to some of the referees’ comments we’d not intend to accomplish totally or partially. If you agree, then we will surely dedicate all the needed time to answer as best as we can all the others' comments and requirements.
The most important task of our paper is to understand if it is possible, by means of deep learning, to recognize seismic diagrams, coming from broken stations, with sufficient accuracy to create an automatic system to improve data quality analysis. We thought it was between the scopes of this journal. We do not mean, in the first instance, to study how much accuracy was dependent on the specific model and/or how the parameters or hyperparameters could affect such accuracy. (However we can try to use a different network, as suggested by a referee and try small variations of some hyperparameters).
We want to remark that the big part of the work (as recognized from both referees especially referee 2) has been labelling data in order to “create the experiment” and possibly use the results into the field. We will probably write a separate paper to develop the suggested in-depth analysis.
We will organise the next lines by quoting the referees in italic and our answers will follow.
Referee 1
RC1 point 3. “As stated above, it is obvious that the concept will work”. In our opinion it was not so obvious as sometimes images may not include enough features to recognize image classes with an accuracy percentage acceptable to build an automatic system. A significant emphasis of the paper lies in the extensive effort invested in data analysis and the provision of labelled spectral images for training the system. Notably, there are no comparable works in the existing literature (at least not with spectral images obtained from seismic monitoring stations). So it could have been plausible that the collected data might have been inadequate for effectively training the network or that the accuracy achieved was insufficient, owing to the diverse and complex nature of PSD-PDF spectral images. We verified that the procedure provides satisfactory results for developing an automatic data quality analysis system.

RC1 point 5. “Comparison of the four tests is not possible due to changing test sets”. In our opinion this comment is not applicable as the four experiments are very different both in terms of data and goals to compare them. Moreover, the four experiments describe the evolution of our research and we think it’s valuable to describe the progress. More specifically the first experiment has been designed in order to check if the method could work on training data on a specific year data (so we needed a test data set only for that year, 2017). In the second experiment we wanted to see how much the system could generalise to more year’s data (so we had to include different years data into the test set). The third experiment was designed just to check how much the already trained network (in the same way as the second experiment but just adding a 10% of data) could generalise its accuracy when classifying images of a year not included in the training. In fact, in the second experiment, for the training, we excluded on purpose the 2018 data and for the third experiment we had to include just 2018 data in the test set. Regarding the fourth test, as the network accuracy was very good but not perfect (a few doubios diagrams have been discarded a-priori, as we explained in the paragraph 5.3), we designed the fourth test in order to make the automatic system work better by adding a new class. In fact the main problem using the automatic system could be that there is a small percentage of bad data that has been included into the “good” signals. This means that there are broken stations or bad metadata undetected. To avoid this occurring as much as possible, we added a third class called “doubt”. As shown in the paper, by means of this additional class, there are no “good” signals recognized as “bad” by the system (only as “doubt”) and vice-versa. This means that the problematic signals will be put into “bad” or “doubt” class, significantly decreasing the operator's work and the number of undetected broken stations. In this experiment we needed to change the test set in order to stratify the three rather than two classes. Summarising, in our opinion, the four types of experiments have different objectives and need four different test data so they cannot be compared. Maybe in the manuscript it was not clear so we will explain more clearly the specific goals of each experiment.

RC1 Point 6. “What is missing….”:

“Source code”. Unfortunately we could not share the code as we implemented because it’s not a single script but it’s a part of code inserted into many general classes and libraries and most of them are parts of confidential business of our institute with private companies. Additionally, in the journal policy we did not find an obligation about sharing code. However, to accomplish as much as we can the referee’s requirement, we will rewrite the code just to reproduce the experiments and add it into the auxiliary material once the paper has been accepted.

“Correct labelled data”. In this case, we agree with the referee but, as a huge effort has been made to label this data, we will be happy to provide data on kaggle website in the format suggested by the referee (dividing the classes into different folders) once the paper has been accepted.

“The pre-trained checkpoint of the CNN”. In this case we totally agree with the referee. This will be added once the final neural network has been selected and trained again.

RC1 Point 8: “the paper only uses a 2D CNN, and a comparison with a 1D CNN is missing. I would at least expect a section discussing the advantages and disadvantages of a 1D and 2D approach”. As stated also in the previous points, the main task of this work is to demonstrate that geometric patterns in PSD-PDF spectral images are recognizable by artificial neural networks and to build up an automatic system able to automatically recognize the noisy signal (according to us, improving the measurements looks one of the main topics of this journal). The next step would be to study a model dependency and how model accuracy may be affected by neural networks architecture (for this topic we plan to do dedicated work). However, we do not think it’s possible to use the 1d convolutional neural network as it would require using time series rather than images. In this case, time demanding to get and analyse data rather than compressed images would be much more time demanding and so unusable for an automatic system. We will add a sentence into the paper to express more clearly this point.

RC1, Point 14: “The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving. AlexNet should not be used”. As stated in previous points we intend to study how model accuracy may be affected by neural networks architecture in a separate work as it was not this work’s main task. However, to accomplish as much as possible the referee’s suggestion, we used the efficientnet network architecture suggested by the referee but, by looking at the first preliminary results, it did not improve the accuracy. We will add some references about EfficiententNet together with alexnet hoping this is ok. Furthermore we will put the results of efficientnet in an auxiliary material with the shared code so that people will be able to run using both Alexnet and Efficientnet and compare the results

RC1 General comments (last point): “Did the authors consider data augmentation?”. We don’t, but it has not been considered for specific reasons. Data augmentation is a technique mostly used to increase the generalizability of an overfitted data model. In this case we did not experience any overfitting problem (we will add the training progress plot and we will explain the overfitting absence). Additionally, augmentation may be useful when images can be “generalised” in terms of orientation, contrast and so on. We want to create a system recognizing the classification of the exact PSD-PDF figures as they are. So, in our opinion, in this case data augmentation would be unuseful..

RC1 Further comments (section 4, third point): “AlexNet is very outdated and should not be used for such an important domain”....”Suggestion: EfficientNet V2”. See the second part of point 5.

Referee 2
RC2 Point 2 “comparison between different models”. See points 4 and 5 of previous referees’ answers.

RC2 Point 3: “However, this is not a groundbreaking/new result”. The new result is that deep learning works on spectral data produced by monitoring seismic stations with an acceptable accuracy to realise an automatic system to quickly check the data quality of such monitoring stations.

RC2 Point 6: “Neither data, source code nor the model are publically available. It is impossible to reproduce the results”. See point 3 of referee 1.

RC2 General comment : “A stress test would be particularly interesting. How much training data is needed? Is this transferable to other data sets? What are the criteria?”. If the referee wants a minimum number it can be a problem, because this does not depend only on the number of data but also on how much all the features necessary for classification are present in the set of images provided for training. So one can get very different results with the same number of images. What we show here is that the images provided by us have enough variety and that the results are generalizable. This was the main purpose of experiment 3 as we have applied a trained neural network to a set of data of 2018 while no data of such year has been used for training. We will add text into the paper to better explain the goals of the different tests especially for the third.

RC2 General comments (Chapter 4. … last point) : “AlexNet is briefly introduced, but it is not discussed why this is the right choice here compared to other approaches”. See points 4 and 5 of previous referees’ answers..

RC2 General comments (5.Method application) : “It would be nice if the authors could elaborate on the classification decision into “ok” and “broken”. In our opinion, this was extensively described in paragraph 3 (“Criteria …”) but we can provide some additional examples
Citation: https://doi.org/10.5194/gi-2023-4-AC1

AC2: 'Further answers to comments (to be completed in the revised version)', Alessandro Pignatelli, 27 Jun 2023

In this comment the referees can find additional answers. If you agree with our plan, on the future work you’ll find the missing ones in the revised paper version.

RC1 Point 4:The methods are clear with strong weaknesses regarding “4. Machine learning general description”. The headline is not consistent with the following subsections. The heading should be: “4. Deep Learning” with 4.1 being: “4.1 General description”.

Ok, we will change some headlines and name of some subsections. Some assumptions are mixed with results, such as the data splits in 5.1, 5.2, 5.3 and 5.4 and their explanations. In the description we have preferred to follow the heuristic method that we have adopted. However we will try to split theory and applications as best as we can and, as suggested, we will summarize the data used in the different experiments in a specific table having a template similar to the following one:

Experiment Number	Year of Training data	N. of training diagrams	Percentage of training data	Year of Test data	N. of test Diagrams	Kfolder (Yes/no)	Network parameter (Alessandro)
1	2017	. . . .	80	2017	. . . .	YES
2	2017+2016+…	…	80	2017+…	. . . . .	YES
3	2017+2016+…	…	90	2018	840	NO
4	20121+….	. . . . .	80	2021+…	. . . .	YES

RC1 Point 6. I suggest using an appropriate website to share your data such as https://www.kaggle.com/. We will do it once the paper has been accepted as already declared in the previous comment (specifically point 3.2)
RC1 point 9 . In my opinion, the results of the large multiclass experiment need to be added. --Completely agree. We will add the results of the large multiclass experiment in the abstract
RC1 point 10. The experiments are mixed with results and discussion, which makes it difficult to follow the idea. See RC1 point 4
RC1 point 11 Is the language fluent and precise: No. … OK We will try to improve the English as best as we can.
RC1 point 13 Please add a section: "Dataset" and move relevant parts to this section… OK we will add a section: Dataset including the table shown above.
RC1, Point 14: “The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving…”. Ok! We will add more recent references.
RC1 point 15: “I suggest adding a table with all four tests to make the comparison easier…” see RC1 point 4
RC1 General comments (First line) The authors have used several data sets in the manuscript, which is confusing. I do not agree with the division into smaller datasets (of size 224, 447 and 840 plots) and one large dataset (1865 PDF spectral plots)...” See Rc1 point 5
RC1 Generall comments (next point) Removing uncertain diagrams from the training is not a good practice as results become not useful in real world applications and conclusions are not representative anymore. The authors could have just used the large dataset as it is the most general dataset in use. Removing uncertain diagrams from the training is only done in experiments involving years in which such diagrams were few in percentage. Subsequently, in the 4th experiment they were placed in the 3rd class and the mid-term discussion explains the reasons for these choices including why we introduced the third class only for the 2021 data.
RC1 General comments (next point) Weighting strategies were not covered, for example, to address their imbalance problem faced in tests 1 and 2. We are not completely sure we understand the comment. If for “weighting” the referee means the number of data for each class, we do not think there is a particular problem as the data number is the same order for each class. We will add precision and recall in the final results to check this point better.
RC1 General comments “In the whole manuscript there is no information about hyperparameters such as batch size, epochs trained, duration of training and many more”. Ok we will specify the parameters used
RC1 General comments “Since reproducibility was not covered, the results will vary between two training runs. The authors should retrain the model several times and provide an average accuracy and standard deviation”. We agree with the referee and we will use a k-fold approach to better estimate the accuracy.
RC2 point 7. “Do the authors give proper credit to related work and clearly indicate their own new/original contribution? Partly. But there for sure is more out there when it comes to CNN application in this context, which the authors did not mention”. We are not sure about this point. If the referee may specify better what is missing we would be pleased to consider it.
RC2 point 10. “The structure is clear. However, the section and subsection titles sometimes sound a bit generic and bloated. This is a general remark though and also applies to parts of the main text. The presentation of methods and results is rather patchy and needs to be more precise”. Ok! We will try to improve these points at our best
RC2 point 11. “Is the language fluent and precise? Rather not. The authors need to carefully…” Ok! We will try to improve the English language at our best.
RC2 point 13: “I would propose to carefully check for new/state-of-the-art papers on that matter and to include those recent studies to put this one presented here into the right light and framework”. As we have explained in different sections of our answer, this was not the main task of this paper. We will try to find a balance between the requests and the article goal.
RC2 point 14: “partly, the authors use only one reference for most parts of a section and those studies often go back > 10 yrs.”. Ok! We will try to insert more recent references.

Citation: https://doi.org/10.5194/gi-2023-4-AC2

Status: closed

RC1:
'Comment on gi-2023-4', Anonymous Referee #1, 31 May 2023
The paper discusses the applicability of AlexNet to seismic spectral plots to detect data quality problems. They combined existing diagrams from different sources into a new database containing 840 diagrams. By labeling (classifying) the diagrams using expert knowledge, they created a new training dataset for CNNs. In general, the approach was successful and the AlexNet achieved an accuracy of 90% on this dataset.
In my opinion, data quality is one of the most important issues in any domain. Therefore, the authors have addressed a very important problem. They show a possible solution to automate a pre-analysis to identify erroneous stations or stations that should be treated with caution.

1 Does the paper address relevant scientific questions within the scope of GI:
Yes. Data quality is important in any scientific field. In this case, the authors focus on seismic data, so this paper is relevant to GI.
2 Does the paper present novel concepts, ideas, tools, or data:
The authors have applied an existing and well-known concept to a new dataset. It is obvious that a CNN approach to classification will work. However, the authors noted that there are no large databases on which to train the neural networks (they “focused on the entire noise spectra to detect acceptable signals from anomalous ones”). As datasets are rare and very time consuming to create, this contribution is relevant to the scientific community if the data is made (“correctly”) publicly available (see 6).
3 Are substantial conclusions reached:
As stated above, it is obvious that the concept will work. The authors have proven this once again. Therefore, the paper does not provide any substantial new conclusions.
4 Are the scientific methods and assumptions valid and clearly outlined:
The methods are clear with strong weaknesses regarding “4. Machine learning general description”. The headline is not consistent with the following subsections. The heading should be: “4. Deep Learning” with 4.1 being: “4.1 General description”.

Some assumptions are mixed with results, such as the data splits in 5.1, 5.2, 5.3 and 5.4 and their explanations.
5 Are the results sufficient to support the interpretations and conclusions:
Partially. Comparison of the four tests is not possible due to changing test sets.
6 Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results):
No. The data provided are only some diagrams without labels. What is missing:
Source code

Correct labelled data – from the given data, I cannot reproduce their approach. Please create 4 folders, each consisting the labels and images used during every experiment.

The pre-trained checkpoint of the CNN

Furthermore, the images (which by themselves have no use) are published only on google drive, which is a bad decision. I suggest using an appropriate website to share your data such as https://www.kaggle.com/.
7 Do the authors give proper credit to related work and clearly indicate their own new/original contribution:
Yes
8 Does the title clearly reflect the contents of the paper:
Yes. However, in my opinion, the title suggests that a CNN is the best approach, which seems logical. However, the paper only uses a 2D CNN, and a comparison with a 1D CNN is missing. I would at least expect a section discussing the advantages and disadvantages of a 1D and 2D approach.
9 Does the abstract provide a concise and complete summary:
Partially. In my opinion, the results of the large multiclass experiment need to be added.
10 Is the overall presentation well structured and clear:
The general presentation of the introduction and theory up to section 3 is OK. Sections 4 and 5 are not. The experiments are mixed with results and discussion, which makes it difficult to follow the idea. Regarding the general structure, I suggest following common headings such as Introduction, (materials/data and) Methods, Experiments, Results and Discussion, Conclusions. See also the suggestions in 4 and the general comments below.
11 Is the language fluent and precise:
No. Some sentences are difficult to read because they are unnecessarily nested. Proofreading is needed.

There are many typing errors such as:
Missing/additional whitespaces e.g.: line 66; line 86; line 112 and many more

Missing commas:
line 61 end of line to 62: […], and […]

line62: […] in some of them, […]

line 128: […] trace statistics, [...]

line 145: […] (McNamara Boaz, 2006), […]

…

spelling: line 154 (inserted new line separates the t from interest); line 156

Please use cross-linking when referring to another section (example: line 363)

12 Are mathematical formulae, symbols, abbreviations, and units correctly defined and used:
Yes
13 Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated:
Please add a section: "Dataset" and move relevant parts to this section, as in my eyes this is an important contribution. This will make it easier to extract relevant information (see also general comments). I have major concerns with sections 4 and 5. See general comments.
14 Are the number and quality of references appropriate:
The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving. AlexNet should not be used.
15 Is the amount and quality of supplementary material appropriate:
Yes. To make it even more precise, I suggest adding a table with all four tests to make the comparison easier.
General comments:
The authors have used several data sets in the manuscript, which is confusing. I do not agree with the division into smaller datasets (of size 224, 447 and 840 plots) and one large dataset (1865 PDF spectral plots). It is obvious that more data will improve training accuracy. Removing uncertain diagrams from the training is not a good practice as results become not useful in real world applications and conclusions are not representative anymore. The authors could have just used the large dataset as it is the most general dataset in use.
Weighting strategies were not covered, for example, to address their imbalance problem faced in tests 1 and 2. Furthermore, the authors used different test sets to validate their approaches, making it impossible to compare results.
In the whole manuscript there is no information about hyperparameters such as batch size, epochs trained, duration of training and many more.
A very important topic to discuss is overfitting, which has not been considered at all. They did not provide any information about the training and validation loss, which is a common practice in applied deep learning. The lack of this information makes it very difficult to interpret the results.
Since reproducibility was not covered, the results will vary between two training runs. The authors should retrain the model several times and provide an average accuracy and standard deviation.
In conclusion: The experimental design is bad. Too many mistakes have been made.
There is no explanation as to why a 2D CNN approach was chosen in the first place. The authors claim that: “In our study we find that just images of data fulfil our goal” (line 336). Proof is lacking. In fact, they have shown that uncertain graphs are likely to be misclassified. A 1D CNN can be applied directly to the data without any loss of information due to encoding into an image. As the seismic information is stored in vectors, 1D CNNs are more efficient, easier to train and less complex. Since far fewer parameters are needed, the loss of information due to encoding the data into an image can be avoided and the only loss is introduced by subsampling the spectral data itself. Did the authors take this into account in their research? For example, the images could be used by the experts to classify the data, but the 1D data could be used for training. Why is it advantageous to use a 2D CNN? I suggest adding a section on this. I refer also to: https://doi.org/10.1016/j.ymssp.2020.107398.
Did the authors consider data augmentation? I would like to see a section discussing the possibility of data augmentation in such a case as it can improve accuracy a lot. I refer to: https://doi.org/10.1088/1741-2552/ac4430.
Further comments:
Section 4:
Overall, this part is not well-structured and lacks some important parameters to reproduce the results. A suggestion for a new structure:

- CNNs (reduce the general description, remove ANNs and focus on the CNN),

- Training (a short overview, perhaps supported by an enumeration of important training steps),

- Metrics (accuracy, confusion matrix, F1 score (all results also class-based for multiclass problems), Precision, Recall) and

- Hyperparameters (training - validation - test split, batch size, epochs, loss function, optimizer, did the authors use dropout? did the authors use regularization techniques? did the authors use weighting to deal with class imbalances?)

Please also use F1 scores, as the training data is class unbalanced, which will bias the result towards the most common class, or at least use some weighting during training. Accuracy alone can be misleading when the classes are imbalanced because a model that predicts the majority class most of the time could still achieve high accuracy while performing poorly on the minority class. The confusion matrix, on the other hand, provides a more detailed breakdown of predictions but does not directly incorporate the trade-off between precision and recall. I refer you to the tutorials of PyTorch or tensorflow.

AlexNet is very outdated and should not be used for such an important domain (see graph at: https://paperswithcode.com/sota/image-classification-on-imagenet). I refer to PyTorch, which provides pre-trained networks and their top-1 and top-5 accuracies on the ImageNet dataset: https://pytorch.org/vision/stable/models.html. It should be easy to replace your existing AlexNet with a newer network. Suggestion: EfficientNet V2, since it has fewer parameters and should be trainable on the used system. Please retrain on the final dataset.

Please revise some claims that are incomplete or wrong:
Line 262: “As well known, machine learning algorithms are divided into two groups: supervised methods and unsupervised ones.” – Reinforcement learning is missing. Please provide sources for all such claims in these sections.

Line 300 following: “the main task of CNNs is to identify local features”. I cannot agree to that. While CNNs initially focus on local features in the first few layers, deeper layers and the latent space contain global information. However, the network's ability to determine the importance of global versus local features is achieved through the learning process by adjusting the weights based on the task's requirements.

Comment: I recommend not to go too deep into the topic, because many mistakes can be avoided. Instead, explain the general concept of CNNs and cite accordingly.

I recommend replacing: “predictors and answers” for the more commonly used terms “inputs and ground truth”. The terms “inputs” and “ground truth” are more widely understood and commonly used in the deep learning community, making them preferable for clear and consistent communication.

The dataset split is incorrectly explained: the dataset should be split into training, validation and test data. Training data is used to train the network. By evaluating the model's performance on the validation data, adjustments can be made to the model architecture, regularization techniques, or other hyperparameters to improve its generalization capabilities. The goal is to optimize the network for the validation data, e.g., by increasing the batch size or changing the optimizer. In the end, the final performance and generalization ability of the trained models is evaluated using the test dataset. I recommend further research. Because the authors split the test data in different random ways (In 5.1, 5.2 and 5.4: 80-20 split, in 5.3: 90-10 split), the results are not comparable. The test data set must be exactly the same for all experiments.

Regarding the source code, I have doubts that this result can ever be reproduced, since no code was provided and reproducibility was not covered at all as well as the train/test split (see previous and question 6). Neural networks are inherently non-deterministic. However, you can achieve reproducibility by correctly initializing the random generators, which makes the result almost perfectly reproducible. I refer you to the reproducibility documentation from pytorch: https://pytorch.org/docs/stable/notes/randomness.html. Please cover this by adding a few sentences in the paper (e.g. in the suggested hyperparameters section) so that the community can reproduce the results.

Section 5:
Section 5.1 and 5.2 and 5.3 – it makes no sense to split these sections. Training with less data gives no information. It is obvious that with such a small data set, more data will improve performance. Remove sections 5.1 and 5.2. I suggest adding a section "Dataset" explaining the two remaining training datasets (one without dubious diagrams, representing the binary classification, and one with the dubious diagrams, representing the multi-class problem). In both cases, the test dataset has to be the same.

At the end of Section 5.3, the authors state that their data reduction is probably perfect because a human has discarded uncertain diagrams. Such an assumption is very risky and should be avoided, since errors can still occur.

In 5.4, the authors somehow explain the problem of class imbalances without naming it. These imbalances can be addressed using weighting strategies, which the authors did not consider. Since this is one of the fundamentals of training neural networks, the authors should retrain their networks using weighting.

Section 6:
The authors conclude that the experiment with the large data set reduced accuracy, which again needs to be verified using exactly the same test data. They also concluded that no BAD diagrams were labelled OK, which is very important and would be reflected if class-wise accuracies were shown (and precision/recall/F1). In such a multi-class scenario, the authors need to add class-wise metrics.
Citation: https://doi.org/10.5194/gi-2023-4-RC1
RC2:
'Comment on gi-2023-4', Anonymous Referee #2, 02 Jun 2023
In this study, the authors address the topic of data quality control within seismic records, which is a time consuming process in light of rapidly growing data sets and requires human experience (experts) for the identification of malfunctioning or instrumental noise.
The authors propose to use machine learning (CNNs, AlexNet) in order to facilitate and fasten data quality control checks. This subject is definitely of importance to the seismic community and well suited for GI.
The authors put a big effort into manually labeling data (PDS). This data is used for training and testing AlexNet to predict in total three classes, which divide the PDSs into “ok”, “dubious” and “bad” PSDs/signals. Overall, the accuracy of 85 % is very promising.
However, I do feel there are some substantial shortcomings in the way the study is presented and list here some general remarks followed by some more detailed comments on the sections of the manuscript.

1. Does the paper address relevant scientific questions within the scope of GI?
It really does. Data quality is substantial and the authors propose an efficient way to account for it! Therefore, the subject is definitely suited for GI.

2. Does the paper present novel concepts, ideas, tools, or data?
The application of CNNs is not new in this context, neither is the data. The latter is not problematic. I feel it is rather meaningful to make use of existing archives here. However, methodologically – at least the way it is presented – this study seems rather basic and lacks information in the way it is presented now, e.g. on preprocessing, the manual labeling process, model hyperparameters or a comparison between different models.

3. Are substantial conclusions reached?
The application of AlexNet works and the conclusions meaningful. However, this is not a groundbreaking/new result.

4. Are the scientific methods and assumptions valid and clearly outlined?
In my opinion and as stated above, details on preprocessing, labeling and the model itself are missing. Section 4 (Machine Learning general description) is, it current state, rather superficial and overall references are not up to date.
Pros and Cons of AlexNet are not discussed, nor is this approach compared to others.
I had difficulties to follow the criteria for the four tests and wonder if it is really necessary to present all of them. Generally - the more data the better - and it is natural, that predictions become better with an increase in training data. It would be rather interesting to see how much data is really needed and when performance begins to drop. This is, however, not demonstrated based on these four tests.

5. Are the results sufficient to support the interpretations and conclusions?
In general and just looking at the outcome, it seems like it – yes. But as stated above, information are missing and I feel like this makes it hard to judge.

6. Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)?
No. Neither data, source code nor the model are publically available. It is impossible to reproduce the results.

7. Do the authors give proper credit to related work and clearly indicate their own new/original contribution?
Partly. But there for sure is more out there when it comes to CNN application in this context, which the authors did not mention.

8. Does the title clearly reflect the contents of the paper?
Yes, it does.

9. Does the abstract provide a concise and complete summary?
Yes.

10. Is the overall presentation well structured and clear?
The structure is clear. However, the section and subsection titles sometimes sound a bit generic and bloated. This is a general remark though and also applies to parts of the main text. The presentation of methods and results is rather patchy and needs to be more precise.

11. Is the language fluent and precise?
Rather not. The authors need to carefully check spelling and writing throughout the entire manuscript again. Sentences were sometimes difficult to read, formatting or grammar was off.

12. Are mathematical formulae, symbols, abbreviations, and units correctly defined and used?
Yes.

13. Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated?
I would propose to carefully check for new/state-of-the-art papers on that matter and to include those recent studies to put this one presented here into the right light and framework. More information are needed in the methods section and results/discussion accordingly adapted.

14. Are the number and quality of references appropriate?
Not really. Partly, the authors use only one reference for most parts of a section and those studies often go back > 10 yrs. This needs to be addressed.

15. Is the amount and quality of supplementary material appropriate?
Yes.

General Remarks
Spelling and writing needs to be carefully checked throughout the entire manuscript.

A more explicit style of writing would be beneficial. While reading I regularly felt like it was too generic and more details supporting those generic statements are necessary.

The references need to be updated. I am not an expert on the specific field (instrumentation or CNN application in this context), but most of the studies cited lie < 10 years in the past while both the technology and the field rapidly emerge.

As a follow up, the study needs to be contextualized in light of the current state-of-the-art within this field. What do others use/propose to address this topic and why is your approach more suited?

More details on preprocessing, manual labeling and model hyperparameters need to be included.

A stress test would be particularly interesting. How much training data is needed? Is this transferable to other data sets? What are the criteria?

Chapter 2. Seismic noise and its spectral representation
The definition of “noise” vs. “signal” is a bit old fashioned in my opinion. As seismology turned into a very broad field (earthquake seismology, environmental seismology, microseism), we always define our signals of interest, while the remaining fraction of the data is considered as noise. If this study is only related to earthquake seismology, it should be clearly stated here.

Chapter 4. Machine Learning general description
Overall this chapter is rather sketchy. In the beginning, machine learning is almost “advertised” as being simple and easy to apply without the need of being particularly knowledgeable about it, only the amount of training data would be of importance.
I would be very careful about those statements and can actually not support them. It is why machine learning is often still considered as “black magic” or a “black box” with some shortcoming input data and the shiny output data. This is not the case. Machine learning models can be very sensitive to the type of data they were trained on (even biased), but also to hyperparameters of the model, which can cause e.g. overfitting.

The right terminology is sometimes missing or off. For example: the authors describe the problem of overfitting with their own words, but do not mention the specific term.

References are missing. The section is only based on one references, which again is > 10 years old.

What is the difference on “images of data” compared to “data”. I guess I see what is meant here, but terminology needs to be more specific to take off the guesswork.

AlexNet is briefly introduced, but it is not discussed why this is the right choice here compared to other approaches.

5. Method application
ll. 340-341 – how do you account for that?

It would be nice if the authors could elaborate on the classification decision into “ok” and “broken”. It is stated, that this has been done by a human operator, but the criteria are not explicitly described. The only statement I found was “… showing trends that, according to the human eye, definitely belong to defective stations”, which in my opinion is quite subjective and loosely formulated. Overall, this is not reproducible.

Second test: I see why the authors wanted to increase the number of “broken” labels. However, to not bias the network (now you definitely do so), the number of “ok” signals should be equally increased as well. I also doubt that “ok” signals are more homogeneous from the ML model perspective, if you consider all the potential environmental noise.

Fouth test: looking at this data base and test, I wonder why test 1-3 are presented.
Citation: https://doi.org/10.5194/gi-2023-4-RC2
AC1:
'Comment on gi-2023-4', Alessandro Pignatelli, 23 Jun 2023
Thanks to the referees for the effort put in the article analysis. As their comments and requests look important and they require a big effort to be answered (and then accomplished) and as we don’t agree with some of them, we would like to be sure that you agree with our plans before going through the detailed responding.
.

In the following lines you’ll find the answers to some of the referees’ comments we’d not intend to accomplish totally or partially. If you agree, then we will surely dedicate all the needed time to answer as best as we can all the others' comments and requirements.
The most important task of our paper is to understand if it is possible, by means of deep learning, to recognize seismic diagrams, coming from broken stations, with sufficient accuracy to create an automatic system to improve data quality analysis. We thought it was between the scopes of this journal. We do not mean, in the first instance, to study how much accuracy was dependent on the specific model and/or how the parameters or hyperparameters could affect such accuracy. (However we can try to use a different network, as suggested by a referee and try small variations of some hyperparameters).
We want to remark that the big part of the work (as recognized from both referees especially referee 2) has been labelling data in order to “create the experiment” and possibly use the results into the field. We will probably write a separate paper to develop the suggested in-depth analysis.
We will organise the next lines by quoting the referees in italic and our answers will follow.
Referee 1
RC1 point 3. “As stated above, it is obvious that the concept will work”. In our opinion it was not so obvious as sometimes images may not include enough features to recognize image classes with an accuracy percentage acceptable to build an automatic system. A significant emphasis of the paper lies in the extensive effort invested in data analysis and the provision of labelled spectral images for training the system. Notably, there are no comparable works in the existing literature (at least not with spectral images obtained from seismic monitoring stations). So it could have been plausible that the collected data might have been inadequate for effectively training the network or that the accuracy achieved was insufficient, owing to the diverse and complex nature of PSD-PDF spectral images. We verified that the procedure provides satisfactory results for developing an automatic data quality analysis system.

RC1 point 5. “Comparison of the four tests is not possible due to changing test sets”. In our opinion this comment is not applicable as the four experiments are very different both in terms of data and goals to compare them. Moreover, the four experiments describe the evolution of our research and we think it’s valuable to describe the progress. More specifically the first experiment has been designed in order to check if the method could work on training data on a specific year data (so we needed a test data set only for that year, 2017). In the second experiment we wanted to see how much the system could generalise to more year’s data (so we had to include different years data into the test set). The third experiment was designed just to check how much the already trained network (in the same way as the second experiment but just adding a 10% of data) could generalise its accuracy when classifying images of a year not included in the training. In fact, in the second experiment, for the training, we excluded on purpose the 2018 data and for the third experiment we had to include just 2018 data in the test set. Regarding the fourth test, as the network accuracy was very good but not perfect (a few doubios diagrams have been discarded a-priori, as we explained in the paragraph 5.3), we designed the fourth test in order to make the automatic system work better by adding a new class. In fact the main problem using the automatic system could be that there is a small percentage of bad data that has been included into the “good” signals. This means that there are broken stations or bad metadata undetected. To avoid this occurring as much as possible, we added a third class called “doubt”. As shown in the paper, by means of this additional class, there are no “good” signals recognized as “bad” by the system (only as “doubt”) and vice-versa. This means that the problematic signals will be put into “bad” or “doubt” class, significantly decreasing the operator's work and the number of undetected broken stations. In this experiment we needed to change the test set in order to stratify the three rather than two classes. Summarising, in our opinion, the four types of experiments have different objectives and need four different test data so they cannot be compared. Maybe in the manuscript it was not clear so we will explain more clearly the specific goals of each experiment.

RC1 Point 6. “What is missing….”:

“Source code”. Unfortunately we could not share the code as we implemented because it’s not a single script but it’s a part of code inserted into many general classes and libraries and most of them are parts of confidential business of our institute with private companies. Additionally, in the journal policy we did not find an obligation about sharing code. However, to accomplish as much as we can the referee’s requirement, we will rewrite the code just to reproduce the experiments and add it into the auxiliary material once the paper has been accepted.

“Correct labelled data”. In this case, we agree with the referee but, as a huge effort has been made to label this data, we will be happy to provide data on kaggle website in the format suggested by the referee (dividing the classes into different folders) once the paper has been accepted.

“The pre-trained checkpoint of the CNN”. In this case we totally agree with the referee. This will be added once the final neural network has been selected and trained again.

RC1 Point 8: “the paper only uses a 2D CNN, and a comparison with a 1D CNN is missing. I would at least expect a section discussing the advantages and disadvantages of a 1D and 2D approach”. As stated also in the previous points, the main task of this work is to demonstrate that geometric patterns in PSD-PDF spectral images are recognizable by artificial neural networks and to build up an automatic system able to automatically recognize the noisy signal (according to us, improving the measurements looks one of the main topics of this journal). The next step would be to study a model dependency and how model accuracy may be affected by neural networks architecture (for this topic we plan to do dedicated work). However, we do not think it’s possible to use the 1d convolutional neural network as it would require using time series rather than images. In this case, time demanding to get and analyse data rather than compressed images would be much more time demanding and so unusable for an automatic system. We will add a sentence into the paper to express more clearly this point.

RC1, Point 14: “The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving. AlexNet should not be used”. As stated in previous points we intend to study how model accuracy may be affected by neural networks architecture in a separate work as it was not this work’s main task. However, to accomplish as much as possible the referee’s suggestion, we used the efficientnet network architecture suggested by the referee but, by looking at the first preliminary results, it did not improve the accuracy. We will add some references about EfficiententNet together with alexnet hoping this is ok. Furthermore we will put the results of efficientnet in an auxiliary material with the shared code so that people will be able to run using both Alexnet and Efficientnet and compare the results

RC1 General comments (last point): “Did the authors consider data augmentation?”. We don’t, but it has not been considered for specific reasons. Data augmentation is a technique mostly used to increase the generalizability of an overfitted data model. In this case we did not experience any overfitting problem (we will add the training progress plot and we will explain the overfitting absence). Additionally, augmentation may be useful when images can be “generalised” in terms of orientation, contrast and so on. We want to create a system recognizing the classification of the exact PSD-PDF figures as they are. So, in our opinion, in this case data augmentation would be unuseful..

RC1 Further comments (section 4, third point): “AlexNet is very outdated and should not be used for such an important domain”....”Suggestion: EfficientNet V2”. See the second part of point 5.

Referee 2
RC2 Point 2 “comparison between different models”. See points 4 and 5 of previous referees’ answers.

RC2 Point 3: “However, this is not a groundbreaking/new result”. The new result is that deep learning works on spectral data produced by monitoring seismic stations with an acceptable accuracy to realise an automatic system to quickly check the data quality of such monitoring stations.

RC2 Point 6: “Neither data, source code nor the model are publically available. It is impossible to reproduce the results”. See point 3 of referee 1.

RC2 General comment : “A stress test would be particularly interesting. How much training data is needed? Is this transferable to other data sets? What are the criteria?”. If the referee wants a minimum number it can be a problem, because this does not depend only on the number of data but also on how much all the features necessary for classification are present in the set of images provided for training. So one can get very different results with the same number of images. What we show here is that the images provided by us have enough variety and that the results are generalizable. This was the main purpose of experiment 3 as we have applied a trained neural network to a set of data of 2018 while no data of such year has been used for training. We will add text into the paper to better explain the goals of the different tests especially for the third.

RC2 General comments (Chapter 4. … last point) : “AlexNet is briefly introduced, but it is not discussed why this is the right choice here compared to other approaches”. See points 4 and 5 of previous referees’ answers..

RC2 General comments (5.Method application) : “It would be nice if the authors could elaborate on the classification decision into “ok” and “broken”. In our opinion, this was extensively described in paragraph 3 (“Criteria …”) but we can provide some additional examples
Citation: https://doi.org/10.5194/gi-2023-4-AC1

AC2: 'Further answers to comments (to be completed in the revised version)', Alessandro Pignatelli, 27 Jun 2023

In this comment the referees can find additional answers. If you agree with our plan, on the future work you’ll find the missing ones in the revised paper version.

RC1 Point 4:The methods are clear with strong weaknesses regarding “4. Machine learning general description”. The headline is not consistent with the following subsections. The heading should be: “4. Deep Learning” with 4.1 being: “4.1 General description”.

Ok, we will change some headlines and name of some subsections. Some assumptions are mixed with results, such as the data splits in 5.1, 5.2, 5.3 and 5.4 and their explanations. In the description we have preferred to follow the heuristic method that we have adopted. However we will try to split theory and applications as best as we can and, as suggested, we will summarize the data used in the different experiments in a specific table having a template similar to the following one:

Experiment Number	Year of Training data	N. of training diagrams	Percentage of training data	Year of Test data	N. of test Diagrams	Kfolder (Yes/no)	Network parameter (Alessandro)
1	2017	. . . .	80	2017	. . . .	YES
2	2017+2016+…	…	80	2017+…	. . . . .	YES
3	2017+2016+…	…	90	2018	840	NO
4	20121+….	. . . . .	80	2021+…	. . . .	YES

RC1 Point 6. I suggest using an appropriate website to share your data such as https://www.kaggle.com/. We will do it once the paper has been accepted as already declared in the previous comment (specifically point 3.2)
RC1 point 9 . In my opinion, the results of the large multiclass experiment need to be added. --Completely agree. We will add the results of the large multiclass experiment in the abstract
RC1 point 10. The experiments are mixed with results and discussion, which makes it difficult to follow the idea. See RC1 point 4
RC1 point 11 Is the language fluent and precise: No. … OK We will try to improve the English as best as we can.
RC1 point 13 Please add a section: "Dataset" and move relevant parts to this section… OK we will add a section: Dataset including the table shown above.
RC1, Point 14: “The references regarding section 4 are a little outdated in some cases or not present at all. Please add more and also recent references to section 4 as the field of deep learning is rapidly evolving…”. Ok! We will add more recent references.
RC1 point 15: “I suggest adding a table with all four tests to make the comparison easier…” see RC1 point 4
RC1 General comments (First line) The authors have used several data sets in the manuscript, which is confusing. I do not agree with the division into smaller datasets (of size 224, 447 and 840 plots) and one large dataset (1865 PDF spectral plots)...” See Rc1 point 5
RC1 Generall comments (next point) Removing uncertain diagrams from the training is not a good practice as results become not useful in real world applications and conclusions are not representative anymore. The authors could have just used the large dataset as it is the most general dataset in use. Removing uncertain diagrams from the training is only done in experiments involving years in which such diagrams were few in percentage. Subsequently, in the 4th experiment they were placed in the 3rd class and the mid-term discussion explains the reasons for these choices including why we introduced the third class only for the 2021 data.
RC1 General comments (next point) Weighting strategies were not covered, for example, to address their imbalance problem faced in tests 1 and 2. We are not completely sure we understand the comment. If for “weighting” the referee means the number of data for each class, we do not think there is a particular problem as the data number is the same order for each class. We will add precision and recall in the final results to check this point better.
RC1 General comments “In the whole manuscript there is no information about hyperparameters such as batch size, epochs trained, duration of training and many more”. Ok we will specify the parameters used
RC1 General comments “Since reproducibility was not covered, the results will vary between two training runs. The authors should retrain the model several times and provide an average accuracy and standard deviation”. We agree with the referee and we will use a k-fold approach to better estimate the accuracy.
RC2 point 7. “Do the authors give proper credit to related work and clearly indicate their own new/original contribution? Partly. But there for sure is more out there when it comes to CNN application in this context, which the authors did not mention”. We are not sure about this point. If the referee may specify better what is missing we would be pleased to consider it.
RC2 point 10. “The structure is clear. However, the section and subsection titles sometimes sound a bit generic and bloated. This is a general remark though and also applies to parts of the main text. The presentation of methods and results is rather patchy and needs to be more precise”. Ok! We will try to improve these points at our best
RC2 point 11. “Is the language fluent and precise? Rather not. The authors need to carefully…” Ok! We will try to improve the English language at our best.
RC2 point 13: “I would propose to carefully check for new/state-of-the-art papers on that matter and to include those recent studies to put this one presented here into the right light and framework”. As we have explained in different sections of our answer, this was not the main task of this paper. We will try to find a balance between the requests and the article goal.
RC2 point 14: “partly, the authors use only one reference for most parts of a section and those studies often go back > 10 yrs.”. Ok! We will try to insert more recent references.

Citation: https://doi.org/10.5194/gi-2023-4-AC2

Paolo Casale and Alessandro Pignatelli

Viewed

Total article views: 1,307 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,013	246	48	1,307	62	64

HTML: 1,013
PDF: 246
XML: 48
Total: 1,307
BibTeX: 62
EndNote: 64

Views and downloads (calculated since 08 May 2023)

Month	HTML	PDF	XML	Total
May 2023	129	22	7	158
Jun 2023	159	14	9	182
Jul 2023	121	18	6	145
Aug 2023	96	12	1	109
Sep 2023	98	17	0	115
Oct 2023	79	13	3	95
Nov 2023	35	4	0	39
Dec 2023	35	18	3	56
Jan 2024	17	12	0	29
Feb 2024	7	11	0	18
Mar 2024	16	15	1	32
Apr 2024	17	6	1	24
May 2024	12	7	2	21
Jun 2024	8	5	3	16
Jul 2024	11	2	1	14
Aug 2024	12	4	1	17
Sep 2024	5	3	1	9
Oct 2024	8	4	0	12
Nov 2024	12	2	1	15
Dec 2024	2	5	0	7
Jan 2025	12	6	0	18
Feb 2025	18	5	0	23
Mar 2025	15	12	6	33
Apr 2025	15	4	1	20
May 2025	29	10	1	40
Jun 2025	33	5	0	38
Jul 2025	12	10	0	22

Cumulative views and downloads (calculated since 08 May 2023)

Month	HTML	PDF	XML	Total
May 2023	129	22	7	158
Jun 2023	159	14	9	182
Jul 2023	121	18	6	145
Aug 2023	96	12	1	109
Sep 2023	98	17	0	115
Oct 2023	79	13	3	95
Nov 2023	35	4	0	39
Dec 2023	35	18	3	56
Jan 2024	17	12	0	29
Feb 2024	7	11	0	18
Mar 2024	16	15	1	32
Apr 2024	17	6	1	24
May 2024	12	7	2	21
Jun 2024	8	5	3	16
Jul 2024	11	2	1	14
Aug 2024	12	4	1	17
Sep 2024	5	3	1	9
Oct 2024	8	4	0	12
Nov 2024	12	2	1	15
Dec 2024	2	5	0	7
Jan 2025	12	6	0	18
Feb 2025	18	5	0	23
Mar 2025	15	12	6	33
Apr 2025	15	4	1	20
May 2025	29	10	1	40
Jun 2025	33	5	0	38
Jul 2025	12	10	0	22

Viewed (geographical distribution)

Total article views: 1,267 (including HTML, PDF, and XML) Thereof 1,267 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 10 Jul 2025

Short summary

Thanks to technological developments, collecting seismic signals is hugely increasing. Unfortunately, having more data is subject to limited human capability of handling such data in reasonable time. That's why, in this paper, we propose to “transfer” the human experience into an artificial intelligence based system able to automatically distinguish seismometers collected data as “good” or “bad” using spectral diagram images.


Total:	0
HTML:	0
PDF:	0
XML:	0