Evaluating methods for reconstructing large gaps in historic snow depth time series

. Historic measurements are often temporally incomplete and may contain longer periods of missing data, whereas climatological analyses require continuous measurement records. This is also valid for historic manual snow depth (HS) measurement time series, for which even whole winters can be missing in a station record, and suitable meth-ods have to be found to reconstruct the missing data. Daily in situ HS data from 126 nivo-meteorological stations in Switzerland in an altitudinal range of 230 to 2536 m above sea level are used to compare six different methods for reconstructing long gaps in manual HS time series by performing a “leave-one-winter-out” cross-validation in 21 winters at 33 evaluation stations. Synthetic gaps of one winter length are ﬁlled with bias-corrected data from the best-correlated neighboring station (BSC), inverse distance-weighted (IDW) spatial interpolation, a weighted normal ratio (WNR) method, elastic net (ENET) regression, random forest (RF) regression and a temperature index snow model (SM). Methods that use neighboring station data are tested in two station networks with different density. The ENET, RF, SM and WNR methods are able to reconstruct missing data with a coefﬁ-cient of determination ( r 2 ) above 0.8 regardless of the two station networks used. The median root mean square error (RMSE) in the ﬁlled winters is below 5 cm for all meth-ods


Introduction
Climatological analyses require continuous measurement series of meteorological data.Unluckily, historical measurement series are prone to containing periods of missing data.Longer data gaps can, for example, originate from temporally abandoning a measurement site, not properly reporting measurements or archiving errors.Therefore, periods of missing data ideally need to be interpolated prior to execution of any analysis.This is also valid for manual snow depth (HS) measurement time series.For example, many instances of a whole winter of missing data are present in the manual station HS data records in Switzerland.On the other hand, long-term continuous records of HS are, for example, necessary to perform climatological trend analyses (e.g., Matiu et al., 2021), to verify modeling studies (e.g., Olefs et al., 2020) or to calculate return levels of extreme events for constructional guidelines (e.g., Marty and Blanchet, 2012).
A number of studies have evaluated and compared methods for reconstructing missing data, mostly for the two variables temperature and precipitation (e.g., Kanda et al., 2018;Woldesenbet et al., 2017;Yozgatligil et al., 2013;Kemp et al., 1983).For longer gaps, inter-station approaches are usually used whereby missing data from one station are imputed with the help of one or more neighboring stations (Massetti, 2014).For this purpose, most often multi-Published by Copernicus Publications on behalf of the European Geosciences Union.
ple regressions, weighted averages or ratios of average values between the neighboring station and the station to be filled are used (Woldesenbet et al., 2017;Tardivo and Berti, 2012;Auer et al., 2007).More recently, machine-learning approaches have also been used to estimate missing values (Kim and Pachepsky, 2010;Kashani and Dinpashoh, 2012).
Snow depth is the result of an interplay between temperature and precipitation as well as the radiation-driven energy budget.Therefore, it is unclear if the methods developed for the reconstruction of other meteorological parameters are also easily applicable for snow depth time series.Additionally, for inter-station approaches there might be the problem of different relationships during the accumulation and ablation phase between stations, which could hinder such approaches (Bales et al., 2018).This might be especially true for stations at different elevations.Inter-station approaches are limited by the fact that a suitable set of reference stations needs to be available.Additionally, different predominant macroscale weather patterns from one winter to the other can lead to the violation of the assumption that relationships between stations are stationary.If other meteorological parameters have been continuously measured in the period of missing HS at the target station, HS can also be derived from these parameters with snow models.For the climatological use case in which measured data are often limited by the number of input variables and the temporal resolution, temperature index models can be used for this task as they only need daily precipitation and mean temperature as input variables.Although temperature index models are very simplistic and, for example, neglect effects such as snow redistribution by wind, they have been used in snow climatological impact studies (e.g., Marke et al., 2018;Notaro et al., 2011).Flat field locations, which are often characteristic for snow measurement sites, are thought to be less affected by such kinds of effects.
Reconstruction of HS data has been done by several studies (e.g., Brown, 1996;Brown et al., 2003;Witmer, 1984;Falarz, 2002;Avanzi et al., 2020).Some of the studies focus on shorter gaps in hourly automatic measured snow data (Avanzi et al., 2020), while other studies focus on monthly means and employ very simple statistical models based on temperature only (Hughes and Robinson, 1993;Brown et al., 1995).For daily data, weighted averages of HS data from neighboring stations are employed (Matiu et al., 2021).Schöner and Koch (2016) use spatial averages and a temperature index model to reconstruct missing daily HS data in a project of the Austrian meteorological service.However, except for Witmer (1984), who compare spatial interpolation methods for short gaps, no general comparison of different methods for reconstructing long gaps in daily HS time series exists to our knowledge.It remains unclear which methods are most appropriate for climatological analyses because the existing methods from different studies are not easily comparable and are also only applicable for specific setups.For climatological analyses covering snow, most often annual or seasonal snow climate indicators are used to evaluate trends and changes in the snow cover rather than the daily values (e.g., Marty, 2008;Beniston, 2012;Buchmann et al., 2021;Marke et al., 2018;Olefs et al., 2020).These snow climate indicators are derived from daily data such as, for example, mean snow depth or duration of the snow cover.However, no such studies evaluate the influence of missing data and gap filling procedures on these snow climate indicators.With this study, we perform a quantitative comparison of different methods for reconstructing typical year-long gaps in manual daily HS time series with a focus on climatological analyses and the ability to reproduce important annual snow climate indicators.A specific aim is to test the performance of simple temperature index models because gaps often occur at the beginning of a measurements series (i.e., in the fist half of the 20th century) when no suitable neighboring stations are typically available.We compare different spatial interpolation methods as well as a simple snow model by imputing synthetic gaps in a "leave-one-winter-out" cross-validation study.The remainder of the paper is structured as follows: the data and methods used are described in Sect.2, results are presented and discussed in Sect.3, and concluding remarks are given in Sect. 4.

Data and methods
We use daily manual snow depth, mean temperature and sum of precipitation data from 126 nivo-meteorological stations in Switzerland.The majority (93) of the stations primarily measure snow-related variables and not necessarily temperature and precipitation.The stations are either operated by the Swiss Federal Office of Meteorology and Climatology (Me-teoSwiss) or by the WSL Institute for Snow and Avalanche Research SLF (SLF), and data are provided by these two institutions.The data cover 21 hydrological years in the period between 1999 and 2020.A hydrological year is defined as the period from September until the end of August.The snow depth is measured manually between 7:00 and 8:00 local time each morning from a fixed snow stake and has the date stamp of the day of measurement.Although many stations already measured snow before 1999, we decided to use only the last 21 years in order to have as many complete and thoroughly quality-controlled time series in our station set as possible.The 21-year time period was chosen because we wanted to have a long enough dataset on the one hand (containing a few well-known snow-abundant and snow-scarce years) and a common (realistic) length of available snow depth time series for the training period (see below) on the other hand.The daily sum of precipitation data covers the period 07:00 of the previous day until 07:00 local time and has the date stamp of the previous day.Mean temperature is aggregated over the whole day and has no date shift.The change in an HS measurement of date i relative to the preceding measurement is therefore influenced by the precip- itation of date i − 1 and a combination of the temperature on the two dates i and i − 1.For being able to test methods for reconstructing missing data in a controlled environment, a leave-one-winter-out cross-validation is performed.Data for one winter (November-April) are deleted (gap period), and in the case that parameter training is required for the respective method, this is done with the winter data for the remaining 20 winters (training period).Locations of the stations used in the cross-validation study can be seen in Fig. 1.We test the spatial interpolation methods in two different station networks in order to assess sensitivity against sparser station networks.Sparser networks can be expected in areas of the world which are not as densely populated as Switzerland or in earlier times such as in the mid-20th century when far fewer stations measured snow depth in Switzerland.The dense network contains 33 evaluation stations (blue triangles in Fig. 1) as well as an additional 93 neighboring predictor stations (orange squares in Fig. 1) and covers stations in an altitudinal range of 230 to 2536 m above sea level.The sparser network consists of the evaluation stations only and covers an altitudinal range of 273 to 1970 m above sea level.If two stations were situated closer than 3 km to each other, one of the two stations was excluded from the station sets.In order to test every method at the same set of stations, evaluation stations are chosen such that they have a continuous record for all three variables HS, temperature and precipitation.Therefore, gaps are only filled at the evaluation stations of both station networks.For the stations ARO, DAV and ULR we combined temperature and precipitation data measured by MeteoSwiss with HS data that were measured by the SLF at a nearby partner station.Gaps shorter than 3 d in the HS time series (only rarely occurring) have been filled by linear interpolation.If any variable had data gaps longer than 3 d, the corresponding station was excluded from the station dataset.

Selection of neighboring stations for spatial interpolation methods
Six different methods are employed to interpolate a missing winter of snow depth data at a certain station with the help of neighboring stations or by using measured meteorological data at the gap station.In the case that neighboring stations are used as predictors for reconstructing the missing data, these stations have to be within a radius of 200 km and show an absolute elevation difference of less than 500 m.We choose these limits based on a correlation analysis of Matiu et al. (2021).For all methods which use HS data from neighboring stations, the best n-correlated neighboring stations are chosen as predictor stations.If fewer than n stations meet the constraints defined above, the number of predictor stations is reduced accordingly.and the constraints defined in Sect.2.1.1 have to be fulfilled.As a simple bias correction measure, the data from the BCS are multiplied with the ratio of the mean at the target site to the mean at the BCS calculated in the training period.

Inverse distance weighting (IDW)
The inverse distance weighting (IDW) method uses a weighted spatial average of neighboring stations to impute missing values at the target station, neglecting any elevation gradients.Weights are the inverse squared distance of the respective neighboring station to the target station such that where ŷ is the estimated snow depth at the target station, n is the number of neighboring reference stations, y i is the snow depth at neighboring station i and d i is the distance of the neighboring station i to the target station.Imputed values are rounded to the nearest centimeter integer.Besides nearestneighbor and non-weighted local averages, IDW is one of the most often-used methods for reconstructing climatological data (Beguería et al., 2019;Kanda et al., 2018).

Weighted normal ratio (WNR)
Matiu et al. ( 2021) use a variation of the weighted normal ratio (WNR) method for filling short and longer gaps (up to a few years) in daily snow depth time series.The normal ratio method was first introduced by Paulhus and Kohler (1952) and assumes a constant ratio of the average state of two neighboring stations (Young, 1992;Yozgatligil et al., 2013).
Missing values are filled by where n is the number of neighboring reference stations, y i is the snow depth at neighboring station i, ȳ and ȳi are the mean snow depth at the target station and reference station i in the training period, respectively, and w i is the weight of station i based on the vertical distance Z − Z i calculated as which is a Gaussian weight function with a full width at half maximum of 500 m.Reconstructed values are rounded to the nearest centimeter integer.In order to have equal conditions within our method comparison, the selected neighboring stations do not need to have a correlation coefficient larger than 0.7 with the target, contrary to the WNR method used in Matiu et al. (2021).

Elastic net (ENET) regression
As a fourth method for reconstructing missing HS data at a target station, we use a multiple linear regression of the HS data from the best-correlated neighboring stations.As the neighboring stations often are correlated with each other as well, we use elastic net (ENET) regularization to reduce the variance of the model (Zou and Hastie, 2005;Friedman et al., 2010).Elastic net combines the l1 regularization term employed in LASSO (Tibshirani, 1996) and the l2 regularization term used in ridge regression (Hoerl and Kennard, 1970) and is thus able to deal with multicollinearity in the predictors.The ratios between l1 and l2 regularization and the hyperparameter α are optimized in a 5-fold cross-validation on the data in the training period.Before fitting and predicting with the model, predictors and the target are standard-scaled to have a mean of 0 and standard deviation of 1 based on the data in the training period.Reconstructed values are rounded to the nearest centimeter integer, and negative predicted values are clipped to zero.

Random forest (RF) regression
As a fifth method we employ random forest (RF) regression as a nonlinear combination of neighboring stations.A random forest is an ensemble of decision trees that are drawn from random subsets of the training data (Breiman, 2001).The prediction of the ensemble is the average of the individual trees.We use the best-correlated neighboring stations as predictors that satisfy the requirements defined in Sect.2.1.1.In order to capture potential different relationships between stations in the course of a snow season, we additionally pass the three seasons early winter (November, December), midwinter (January, February) and late winter (March, April) as a categorical predictor to the model.
Prior to fitting the model, this "seasons" predictor is one-hotencoded, whereas the other predictors of neighboring station HS data are standard-scaled as for the elastic net regression (Sect.2.1.5).The random forest model has a tree number of 200 and a maximum depth of 70.

Snow model (SM)
As a last method we make use of a simple snow model (SM).
The snow model consists of a temperature index model, which is then coupled to a density model to estimate the snow depth.For estimating snow water equivalent (SWE) in the snowpack, we use the Snow-17 model, which uses a temperature index approach with a seasonally varying melt factor (Anderson, 1973).However, we do not use the density parameterization described in the former reference.Instead, we post-process the SWE time series of the temperature index model with a very simple density model.The density model uses an approach based on Martinec and Rango (1991) in which a time-dependent density for the different layers in the snowpack is assumed.
Each layer that is identified by an increase in SWE has an initial new snow density ρ 0 , which temporally increases according to Eq. ( 3) at each time step t until it reaches a maximum density ρ max .When SWE decreases during a day, the density model removes layers from top of the snowpack to compensate for the loss in SWE.During the cross-validation, only the parameters of the density model ρ 0 , ρ max and τ are optimized by grid searching a predefined reasonable parameter space during the training period for each station and a synthetic gap individually to minimize the root mean squared error (RMSE) in the training period.No parameter optimizations are done for the melt and accumulation model, and the parameters defined in Anderson (1973) are used.We considered using a combined temperature of 2 d to correspond with the interval of precipitation and HS data (see Sect. 2).However, we found negligible differences in model performance and decided to leave the input data as they are to avoid potential smoothing of temperature signals.In contrast to the interstation methods described above, we apply the snow model over the full hydrological year in order to account for snow which has already built up by November.However, scoring is only done in the winter months November-April.

Evaluation metrics
As score metrics of the reproduced daily HS values we use the RMSE, the coefficient of determination (r 2 ) and the bias.
The bias is calculated as the average error.maximum number of predictor stations.However, a further increase from 15 stations does not yield remarkable differences in median MAAPE and its variance.For RF, RMSE constantly decreases with an increasing maximum number of predictor stations from 3.5 for one predictor station to 2.9 for a maximum number of 25 predictor stations.MAAPE scores for RF are insignificantly better for maximum station numbers of 3, 5 and 10 than for higher maximum station numbers.
Some of the methods are more sensitive to the maximum number of neighboring stations used than others.The deterministic approaches (IDW, WNR) regress in skill for more stations because more stations introduce unnecessary noise.This is the reason why other studies that use regional averages or simple linear regressions also use only few neighboring stations for reconstructing missing data (e.g., Matiu et al., 2021;Tardivo and Berti, 2014).Regularization measures, which are both included in the ENET and RF regression, allow choosing the best predictors from a given set of predictor stations.Therefore, overfitting is prevented even for a larger number of predictors with these two methods.Tests on how many predictor stations are influential for the random forests showed that only few stations (less than ∼ 5) share the majority of feature importance.The selected number of maximum neighboring stations for the method comparison in Sect.3.2.1 and 3.2.2 is mainly based on the median RMSE and MAAPE scores presented earlier.If scores from two different maximum numbers of predictor stations are approximately equal for one method, we decided to use the lower number of stations to keep the method as simple as possible.Accordingly, we use the maximum numbers of predictor stations listed in Table 1 for the comparison of different methods in the following sections.

Daily values
Predicted daily values are plotted against measured daily values for the different methods and station densities in Fig. 3. Values are aggregated over every filled gap in the crossvalidation.The three score metrics r 2 , RMSE and bias are indicated in each panel.For both the sparse and dense station network, ENET regression almost always yields the best results for all score metrics, closely followed by RF regression and the WNR method.In the dense station network, WNR, ENET and RF have similar score values, with RMSE ranging between 6.5 and 7.0, similar r 2 of 0.94, and an equally small bias of 0.06 for ENET and RF as well as a bias of −0.07 for WNR.BCS closely follows WNR, ENET and RF in the dense station network with r 2 of 0.92, RMSE of 7.6 and a bias of −0.1.IDW performs more poorly than the four aforementioned methods with r 2 of 0.85, RMSE of 10.6 and a positive bias of 1.78.The snow model performs equal to IDW in the dense station network in terms of RMSE and r 2 , with RMSE of 10.2 and r 2 of 0.86.SM predictions are negatively biased with a bias of −0.74.The SM thus cannot compete with the WNR, BCS, ENET and RF methods in the dense station network.However, the SM (in contrast to IDW) can compete with the WNR and BCS methods in the sparse station network for which the RMSE increases by ∼ 35 % and ∼ 40 % compared to the dense station network, respectively.RF and ENET are less sensitive against station network density than the WNR and BCS methods, but performance still decreases for a decreasing station network density.RMSE in the sparser station network decreases by ∼ 30 % compared to the dense station network for RF and ENET.IDW is the most sensitive to station network density.RMSE in the sparse station network increases by ∼ 75 %, and explained variance is significantly lower with r 2 of 0.55 in the sparse station network.
The RMSE scores and bias of daily values aggregated over all reconstructed gaps are about twice as high as the median RMSE and bias obtained from each gap individually (Fig. A2 in the Appendix).

Annual snow climate indicators
HSavg, HSmax and dHS1 derived from the reconstructed daily data (Sect.3.2.1)are plotted against the same snow climate indicators derived from the measured data in Fig. 4. The score values bias, RMSE and the coefficient of determination (r 2 ) accompanying the data shown in Fig. 4 are listed in Table 2. Absolute errors of the same snow climate indicators derived from reconstructed data versus those that are HSavgderived from the measured data in the reconstructed winters are shown in Fig. 5.
BCS, WNR, ENET, RF and SM yield unbiased reconstructions of HSavg for both the dense and the sparser station network with a bias smaller than 0.15 cm.For all methods, RMSE for HSavg is about 30 % to 40 % smaller than the RMSE derived from the aggregated daily values (see Sect. 3.2.1)for the reconstructions from both the dense and sparser station network.The absolute error of HSavg and HSmax increases with an increasing HSavg for all methods (Fig. 5).However, the increase is much larger for BCS and IDW in the case of the sparser station network.HSmax derived form the filled gaps shows a ∼5 %-10 % lower explained variance than HSavg.RMSE values for HSmax are larger than for HSavg but should be compared cautiously because of the different scales of the two snow climate indicators.BCS, WNR, ENET, RF and the SM yield negatively biased HSmax with biases ranging from −2.3 to −7.4 cm in the dense and −1.6 to −7.4 cm in the sparse station networks, respectively.IDW shows a slightly positive bias of 2.8 and 2.9 for the dense and sparse station networks, respectively.Median absolute errors of HSmax increase with https://doi.org/10.5194/gi-10-297-2021 Geosci.Instrum.Method.Data Syst., 10, 297-312, 2021 The dHS1 is reproduced less precisely than HSavg with ∼10 %-20 % lower explained variance r 2 .All methods apart from BCS and SM strongly overestimate the number of snow days with HS ≥1 cm of the reconstructed winters with a bias from 14.6 to 18.4 d overestimation for the full station network and 16.0 to 23.3 d overestimation for the sparse station network.However, the BCS also slightly overestimates dHS1 with a bias of 3.7 and 6.6 d in the dense and sparse station networks, respectively.All methods (except SM by the method definition) experience an increase in bias of dHS1 in the sparse station network compared to the dense station network.For all methods, the absolute error of dHS1 is largest in winters with HSavg below 40 cm.

Applicability and limitations
Snow depth appears to be a good-natured parameter with respect to reconstructing missing data.All methods except IDW are able to reconstruct HS with a coefficient of determination above 0.8 regardless of the two station networks used.
When deciding what method to choose, it depends on the use case (daily values or derived annual climate indicators) and the setting (station network, surrounding topography, gaps in neighboring stations) in which one wants to reconstruct the data.A qualitative assessment for the suitability of the different methods in different situations and for different applications is given in Table 3.
In a very dense station network such as the one in Switzerland, BCS is able to reproduce annual snow climate indica-tors HSavg, HSmax and dHS1 with r 2 above 0.8 and RMSE below 10 cm for the reconstructed daily HS values.This performance could probably be improved with more advanced bias correction of the neighboring station such as quantile mapping (Gudmundsson et al., 2012).However, simple approaches such as BCS, IDW and to a smaller extent WNR are sensitive to the density and representativity of the station network.While this is true for every method that uses neighboring stations, more sophisticated methods such as ENET and the nonlinear RF regression are also almost able to retain skill for sparser station networks.Consequently, ENET and RF are, besides the SM, the most promising candidates in regions with a sparser station network.
Simple spatial averaging with IDW is not able to resemble strong gradients that are present in an alpine topography.We therefore also tested the gradient-plus-inverse-distancesquared (GIDS) method (not shown in results) introduced by Nalder and Wein (1998), which was used in a project of the Austrian meteorological service for imputing gaps in HS time series (Schöner and Koch, 2016).In the sparse network GIDS performed even weaker than IDW, which is in accordance with Price et al. (2000), who observed poor results with GIDS for temperature and precipitation reconstruction in areas with strong topography.Nalder and Wein (1998) compare GIDS to kriging-based methods.We also expect a strong dependence on station network density for kriging and therefore refrained from including these kinds of methods in our method comparison.However, in dense station networks, kriging can be an alternative approach to our proposed methods for interpolating snow depth data, especially when it comes to spatially continuous reconstructions and not only estimations on a single point.Buchmann et al. (2021) evaluated the natural variability of annual snow climate indicators by comparing data from parallel station pairs (< 3 km distance and < 100 m elevation difference).They find RMSE for HSavg within a station pair to be in the same range as RMSE for reproduced HSavg with the ENET, RF, WNR and SM methods.This proves that HSavg can be reproduced reasonably well with these four methods.Even the best-performing method in our comparison study cannot reach the quality of a parallel station pair for HSmax and dHS1.RMSE of the RF method is 2 and 4 times larger than the median RMSE within a parallel station pair for these two snow climate indicators (Buchmann et al., 2021).For all methods, the highest median absolute errors and bias for dHS1 can be observed in winters with low HSavg.These winters are often characterized by an ephemeral snow cover which builds up and vanishes again in the course of the winter.Temperature index models are prone to problems with this kind of snow cover, which could explain the weaker performance of the SM method in these conditions (Hughes and Robinson, 1993;Gray and Landine, 1988).The positive bias of dHS1 for the methods that use several neighboring stations may be explained as follows.The probability that at least one of the neighboring stations has snow on a certain day is higher than the probability of snow at the target station.Since most of the methods combine data from the neighboring stations, this will statistically result in more days with snow.When trying to minimize bias in dHS1, it is therefore best to rely on only few neighboring stations.Accordingly, BCS yields predictions for dHS1 that have a lower positive bias.One possible approach to reproduce dHS1 more accurately than deriving it from reconstructed daily values could be to model dHS1 directly.This could be realized by fitting a nonlinear statistical model such as random forest to the dHS1 series of the target station with dHS1 series derived from neighhttps://doi.org/10.5194/gi-10-297-2021 Geosci.Instrum.Method.Data Syst., 10, 297-312, 2021 boring stations as predictors.However, the reduced number of data points would ideally require a longer training period of simultaneous measurements at target and neighboring stations, respectively.The number of snow-covered days can be defined with different thresholds.While a large positive bias for the 1 cm threshold (dHS1) can be observed for all methods, this bias decreases with increasing thresholds for the snow-covered days (see Table A1 in the Appendix).For the number of snow days with HS ≥10 cm (dHS10) the bias is less than 2 d for all methods and decreases further for the number of snow days with HS ≥30 cm (dHS30).The coefficient of determination also increases with an increasing snow-day threshold.
An option to increase the skill of the deterministic methods BCS, IDW and WNR is to apply stricter constraints to the neighboring stations as done in Matiu et al. (2021) by introducing a correlation constraint to the neighboring stations (see Sect. 2.1.4).In the station networks applied in this study, this would lead to a failure in filling data in 15 % and 20 % of the filled gaps (station years) for the dense and sparse station network, respectively.These cases occurred mostly for stations at low elevations (AIG, ALT, GVE, SIO, VIS; see Fig. 1) with an ephemeral snow cover.
Due to semiautomatic quality control procedures and careful station preselection, our test dataset only contained very few missing HS values for the reference stations.However, this is rather unlikely to be encountered in a real application.Missing values in neighboring stations can be handled differently by different methods.ENET does not allow a single missing value in one of the neighboring stations in the training and gap period.On the other hand, RF and the WNR method are able to deal with missing values in the predictor stations, which is a huge asset when it comes to applicability.The effect of missing values in neighboring stations on performance has not been tested in this study.However, this is an important point to keep in mind when trying to apply any of the evaluated methods.For RF, it is also possible to add other non-snow-depth categorical or continuous predictors such as the mean HSavg anomaly of the predictor stations or prevailing large-scale atmospheric conditions in the winter of interest.We tested an RF version with an additional categorical predictor calculated from binned quantiles of the mean of all predictor stations used but did not see any improvement over the simpler version using only the season as a categorical predictor.
One potential limitation of the SM approach is that if the snow measurements are interrupted at a certain station, other variables needed as input for the snow model could also potentially be missing.However, this is a rather unlikely case to encounter, at least in the dataset for Switzerland.Temperature and precipitation traditionally have a higher priority for weather services than the variables associated with snow; therefore, in the case that an issue occurred at a station, the probability of continuation of these two classic meteorologic variables is higher than for any snow variable.After the au-tomation of many weather stations (not for snow) in Switzerland in the 1980s, long gaps in the temperature and precipitation record are even less likely to be encountered.If other variables such as wind and incoming shortwave and longwave radiation are also available at high temporal resolution for a station, a more sophisticated snow model such SNOW-PACK (Bartelt and Lehning, 2002;Lehning et al., 2002) or CROCUS (Brun et al., 1989(Brun et al., , 1992) ) would probably improve the performance of the gap reconstruction.These physicsbased models cover processes such as erosion by wind and are thought to better represent settling and melting than the very simple approach used in our study.However, the required input data are, if at all, only available for the most recent decades.
A general limitation of our analysis may be the fact that the sparse station network is still dense when compared to station networks present in other regions of the world (Gubler et al., 2017).If the station network is sparser than in our example, the snow model and RF should be favored over the other approaches as both these methods show the least sensitivity to station network density in our analysis.Especially in data-sparse regions, the probability of having temperature and precipitation data available is much higher than for snow depth observations, which points towards the use of a snow model for data reconstruction.Alternatively, one could make use of output from reanalysis products such as ERA5-land (Muñoz Sabater et al., 2021).If available, snow depth can be used directly from the reanalysis product or other meteorologic variables from the reanalysis product can be used to model snow depth with a snow model.Either way, some sort of downscaling is necessary since reanalysis products are available in a spatial resolution of about 10 km or more.This can, for example, be done statistically by using, e.g., the random forest model described in Sect.2.1.6with data from the nine surrounding grid cells of the target station as predictor variables.This method would be independent of neighboring stations and can be applied worldwide if a global reanalysis product is employed.However, the low spatial resolution of reanalysis products will always limit application in complex mountainous terrain.Moreover, reanalysis products often suffer from a temperature bias (e.g., Scherrer, 2020), which is crucial with respect to a variable like the highly temperature-sensitive snow cover.
Ultimately, gap filling is often a preceding step when it comes to data homogenization in order to correct time series that show breaks due to station relocations or changes in measurement techniques or instrumentation (Marcolini et al., 2019).These breaks can be accompanied by a period of missing data.Reconstruction methods that employ training methods before and after a data gap could level out breaks and potentially complicate their detection and correction.Therefore, it might be advisable to only use a training period from either before or after the data gap.Caution is also necessary when trying to, e.g., do break detection on reconstructed an-nual dHS1 series due to the positive biases introduced by most of the methods.

Conclusions
We compared different methods for reconstructing long gaps in daily manual HS data records as well as their ability to reconstruct the annual snow climate indicators HSavg, HSmax and dHS1.The ENET, RF, WNR and BCS method are able to reproduce daily HS values with a coefficient of determination above 0.9 in the dense and above 0.8 in the sparse station network, respectively.Median RMSEs of the filled gaps are below 4 cm for all methods.The SM, which does not need data from neighboring stations, reveals only a slightly lower coefficient of determination (0.86) for daily HS values.The two annual climate indicators HSavg and HSmax, in contrast to dHS1, can be reproduced by BCS, ENET, RF, SM and WNR well.All methods except for SM and BCS overestimate the dHS1 with a bias of 15 to 23 d.In a sparse station network a simple snow model is best suited to resemble dHS1 most accurately with r 2 of 0.93.The reconstruction errors of HSavg are within the natural variability of a parallel station pair.Snow depth seems to be a relatively good-natured parameter when it comes to gap filling of data with neighboring stations.However, when station networks get sparse, temperature index snow models serve as a suitable alternative to classic inter-station gap filling approaches.
Since most of the methods perform reasonably well, the choice of which method to use depends on the specific use case and setting.If a serially complete, highly correlated station is available, bias-corrected data from this station are easy to calculate and, in many instances, sufficient enough to be used in a climatological use case.If no meteorological data are available at the target station and if neighboring stations regularly contain missing data as well, WNR is a suitable deterministic approach to reconstruct data from neighboring stations.Missing data in neighboring stations can also be handled by RF.If the station network is sparser than in our study and if neighboring stations are further away and weakly correlated, the snow model, ENET and RF should be favored over the other approaches as these three methods show the least sensitivity to station network density in our analysis.If the focus of the analysis is set on dHS1, a simple snow model is best suited to reconstruct a complete missing winter.If no meteorological data are available, BCS should be the fallback solution for dHS1 in the case that a suitable reference station is available.Table A1.Bias, RMSE and coefficient of determination (r 2 ) for the 1 cm (dHS1), 2 cm (dHS2), 5 cm (dHS5), 10 cm (dHS10) and 30 cm (dHS30) thresholds for the number of snow days reconstructed with the different methods in the dense and sparse station networks.

Figure 1 .
Figure 1.Location of evaluation stations (blue triangles) and predictor stations (orange squares) for the cross-validation study.The background color indicates elevation.
RMSE and bias can be interpreted in the same unit as the HS measurements[cm].As a fourth metric, we use the mean arctangent absolute percentage error (MAAPE), which was introduced byKim and Kim (2016) as a relative error term (limited to a maximum of 1.6) because of frequent close-to-zero HS values for stations at low elevation that blow up traditional relative error terms such as the mean absolute percentage error.Since we are interested in gap filling for climatological analyses, we additionally test how well the different methods are able to reproduce three snow climate indicators which are frequently used by practitioners.These snow climate indicators are (i) the average snow depth in a winter (HSavg), which is widely used to test for trends in snow climatology, (ii) the maximum snow depth in a winter (HSmax), which is an important indicator for, e.g., prevention measures in engineering, and (iii) the number of snow days with HS>1 cm (dHS1), which has vital importance for ecology and the winter tourism industry.3Results and discussion3.1 Number of potential predictor stationsThe influence of the maximum number of neighboring stations is displayed in Fig.2.Box plots of RMSE and MAAPE scores calculated in the reconstructed winters are shown for varying numbers of neighboring stations for the different

Figure 2 .
Figure 2. Box plots of RMSE and MAAPE calculated in the individual reconstructed winters with a varied maximum number of predictor stations for the spatial interpolation methods.The methods have been applied to the complete station network.For better comparison, outliers are not shown in the box plots.Note that WNR with one predictor station is equivalent to the BCS method.

Figure 3 .
Figure 3. Reconstructed daily snow depth values plotted against the measured values for the methods used (columns).Data in the top row are calculated in the full station network, and data in the bottom row are calculated using the evaluation stations only.The solid black line represents perfect predictions, and the dashed line is a linear fit of predicted versus measured values.The three score metrics coefficient of determination (r 2 ), root mean squared error (RMSE) and bias are indicated in each panel.

Figure 4 .
Figure 4. Modeled average snow depth in a winter (HSavg, top row), maximum snow depth in a winter (HSmax, middle row) and number of snow days with HS ≥ 1 cm (dHS1, bottom row) of the reconstructed winters from the cross-validation trials versus the respective snow climate indicator value derived from measurements.The columns refer to the different interpolation methods.Orange squares are gaps reconstructed with the complete station network, and blue triangles are gaps that have been reconstructed solely using the evaluation stations as depicted in Fig. 1.The black line represents perfect predictions.The dashed and dotted lines are linear fits to the data points of the dense and sparse station networks, respectively.

Figure 5 .
Figure 5. Box plots of absolute errors in average snow depth in a winter (HSavg, top row), maximum snow depth in a winter (HSmax, middle row) and number of snow days with HS ≥1 cm (dHS1, bottom row) calculated for 20 cm HSavg bins of the respective gap winter and the different methods (columns).Colors of the box plots denote the different station networks that have been used for reconstruction.Outliers in the box plots are not shown for better comparison.

Figure A3 .
Figure A3.Histograms showing the difference in days between the measured date of HSmax and the date of HSmax in the reconstructed winters in the dense and sparse station networks.In the case that the same HSmax is recorded on more than one day, the date of the first occurrence is taken.

Table 1 .
Selected number of neighboring stations for each method.

Table 2 .
Bias, RMSE and coefficient of determination (r 2 ) for the three climate metrics HSavg, HSmax and dHS1 reconstructed with the different methods in the dense and sparse station networks as shown in Fig.4.