A comparison of gap-filling algorithms for eddy covariance 1 fluxes and their drivers 2

16 17 The errors and uncertainties associated with gap-filling algorithms of water, carbon and energy fluxes 18 data, have always been one of the main challenges of the global network of microclimatological tower 19 sites that use eddy covariance (EC) technique. To address th ese concern s , and find more efficient gap-20 filling algorithms, we reviewed eight algorithms to estimate missing values of environmental drivers, 21 and separately , nine algorithms for the three major fluxes in EC time series. We then examined the 22 algorithms' performance for different gap-filling scenarios utilising the data from five EC towers during 23 2013. Th is research's objectives were a) to evaluate the impact of the gap lengths on the performance of 24 each algorithm; b) to compare the performance of traditional and new gap-filling techniques for the 25 EC data, for fluxes and separately for their corresponding meteorological drivers. The algorithms' 26 performance was evaluated by generating nine gap window s with different lengths, ranging from a day 27 to 365 days. In each scenario, a gap period was chosen randomly, and the data were removed from the 28 dataset, accordingly. After running each scenario, a variety of statistical metrics w ere used to evaluate 29 the algorithms' performance . T he algorithms showed different levels of sensitivity to the gap lengths ; The 30 Prophet Forecast Model (FBP) revealed the most sensitivity, whilst the performance of


Introduction
To address the global challenges of climatological and ecological changes, environmental scientists and policy makers are demanding data that are continuous in time and space.In addition, there is a need to quantify and reduce uncertainties in such data, including observations of carbon, water, and energy exchanges that are crucial components in national and international flux networks as well as global Earth-observing systems.Satellites partially fill this gap as they provide excellent spatial coverage but have limited temporal resolution and do not measure at a point scale.As such, high-quality long-term site observations of ecosystem processes and fluxes are needed that are continuous in time and space.The global eddy covariance (EC) flux tower network (FLUXNET) consists of its regional counterparts (i.e.AmeriFlux, EUROFLUX, OzFlux) and was established in the late 1990s to address the global demand for such information (Aubinet et al., 1999;Baldocchi et al., 2001;Beringer et al., 2016;Hollinger et al., 1999;Menzer et al., 2013;Tenhunen et al., 1998).Despite EC data being frequently used to validate process modelling analyses, field surveys, and remote sensing assessments (Hagen et al., 2006), there are some serious concerns regarding the technique's challenges, e.g.data gaps and uncertainties.Hence, filling data gaps and reducing uncertainties through better gap-filling techniques are highly needed.
Even though the EC is a common technique to measure fluxes of carbon, water, and energy, there are some challenges in providing robust, high-quality, continuous observations.One of the challenges regarding the technique and therefore the flux networks is addressing data gaps and the uncertainties associated with the gap-filling process, mainly when the gap windows are long (longer than 12 consecutive days, as described by Moffat et al., 2007).These gaps happen quite often for a variety of reasons, such as values out of range, spike detection or manual exclusion of date and time ranges, instrument or power failure, herbivores, fire, eagles' nests, lightning, and/or researchers on leave (Beringer et al., 2017).Since EC flux towers are often located in harsh climates, their data are more susceptible to adverse weather (i.e.rain conditions), and they sometimes prevent quick access to sites for repair and maintenance.As a result, this issue can, in turn, produce gaps which might be relatively long (Isaac et al., 2017) and thus problematic, as explained in the following.Firstly, loss of data is considered a threat to scientific studies depending on the missing data quantity, pattern, mechanism, and nature (Altman and Bland, 2007;Molenberghs et al., 2014;Tannenbaum, 2010).That is because using an incomplete dataset might lead to biased, invalid, and unreliable results (Allison, 2000;Kang, 2013;Little, 2002).Second, continuous gap-filled data are required to calculate the annual or monthly budgets of carbon and water balance components (Hutley et al., 2005).
Other than the challenges caused by missing data, there are several sources of errors and uncertainties in the EC technique.Firstly, random error is associated with the stochastic nature of turbulence, associated sampling errors (incomplete sampling of large eddies, uncertainty in the calculated covariance between the vertical wind velocity and the scalar of interest), instrument errors, and footprint variability (Aubinet et al., 2012).For instance, Dragoni et al. (2007) analysed EC-based data from the Morgan-Monroe State For-est for 8 years (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) and assessed instrument uncertainty as equal to 3 % of the total annual net ecosystem exchange (NEE).Another primary source of uncertainty in EC measurements is systematic errors caused by methodological challenges and instrument calibration problems (e.g.sonic anemometer errors, spikes, gas analyser errors).Finally, one of the sources of uncertainties is data processing, especially data gap-filling (Isaac et al., 2017;Moffat et al., 2007;Richardson et al., 2012;Richardson and Hollinger, 2007).
There are several uncertainties pertaining to gap-filling of missing values, including measurement uncertainty (Richardson and Hollinger, 2007), lengths and timing of the gaps (Falge et al., 2001;Richardson and Hollinger, 2007), and the particular gap-filling algorithm that is used (Falge et al., 2001;Moffat et al., 2007).However, there are two dominant issues with long data gaps and the choice of a particular gap-filling algorithm (Aubinet et al., 2012).Firstly, long gaps can significantly increase the total amount of uncertainty as the ecosystem behaviour might change because of different agricultural periods or phenological phases (e.g.growing season, harvest period, bushfire) and thereby show different responses under similar meteorological conditions (Aubinet et al., 2012;Isaac et al., 2017;Richardson and Hollinger, 2007).Consequently, the period in which a long gap happens is important.For example, research undertaken by Richardson and Hollinger (2007) on data from a range of FLUXNET sites revealed that a week data gap during spring green-up in a forest led to a higher uncertainty over a 3-week gap period during winter.Second, each gap-filling algorithm has its strengths and weaknesses; for instance, Moffat et al. (2007) compared 15 different commonly used gapfilling algorithms.They found no significant difference between the performance of the algorithms with "good" reliability based on analysis of variance of the root mean square error (RMSE).The overall gap-filling uncertainty was within ±25 g C m −2 yr −1 for most of the proper algorithms, whereas the other algorithms generated higher uncertainties of up to ±75 g C m −2 yr −1 , showing that the uncertainty provided by reliable methods can be considerably smaller.This result is similar to the findings of Richardson and Hollinger (2007), who found that for the datasets used in their study that uncertainties of up to ±30 g C m −2 yr −1 were from long gaps by appropriate algorithms.Considering that the data provided by EC tower networks are of use for research, government, and policy makers, robust gap-filling is a need to quantify and reduce uncertainties in flux estimations.
Several methods have typically been used to fill data gaps in both fluxes and their meteorological drivers to manage the missing data problem.Due to computational constraints of complex algorithms, early works to impute EC data gaps used interpolation methods based mostly on linear regression or temporal autocorrelation (Falge et al., 2001;Lee et al., 1999).These approaches were quickly replaced by more sophisticated methods such as non-linear regressions (Barr et al., 2004;Falge et al., 2001;Moffat et al., 2007;Richard-son et al., 2006), look-up tables (Falge et al., 2001;Law et al., 2002;Zhao and Huang, 2015), artificial neural networks (ANNs) (Aubinet et al., 1999;Beringer et al., 2016;Cleverly et al., 2013;Hagen et al., 2006;Isaac et al., 2017;Kunwor et al., 2017;Moffat et al., 2007;Papale and Valentini, 2003;Pilegaard et al., 2001;Staebler, 1999), mean diurnal variation (Falge et al., 2001;Moffat et al., 2007;Zhao and Huang, 2015), and multiple imputations (Hui et al., 2004;Moffat et al., 2007).Each of these methods has its pros and cons as follows: (a) interpolation methods such as the mean diurnal variation (MDV) do not need any drivers, yet their accuracy is lower than other approaches (Aubinet et al., 2012).Moreover, this method may provide biased results on extremely clear or cloudy days (Falge et al., 2001).MDV is not recommended when a gap is longer than 2 weeks because it cannot consider the non-linear relations between the drivers and the flux, leading to a high level of uncertainty (Falge et al., 2001).(b) The look-up table, especially its modified version -marginal distribution sampling (MDS) -has provided performance close to ANNs and is more reliable and consistent than the other algorithms so far.Hence, MDS was chosen as one of the standard gap-filling methods in EUROFLUX (Aubinet et al., 2012).Nevertheless, the performance of MDS in gap-filling of extra long gaps is not well known (Kim et al., 2020).(c) ANNs have commonly been used to gap-fill EC fluxes since 2000, and because of their robust and consistent results they are considered a standard gap-filling algorithm in several networks, e.g.ICOS, FLUXNET, and OzFlux (Aubinet et al., 2012;Beringer et al., 2017;Isaac et al., 2017).Despite their reliable performance, ANNs -and generally all other ML algorithms -face some challenges.Over-fitting, for instance, is a big concern and can happen when the number of degrees of freedom is high, while the training window is not long enough or the quality of the training dataset is low.This challenge becomes acute when the gaps happen while the ecosystem behaviour is changing and shows different responses under similar meteorological conditions.Furthermore, there is a desire to have the training windows short so that the algorithm can track the ecosystem behaviour shift.Yet, this increases the risk of over-fitting depending on the algorithm.In other words, the training window length should be neither so short that it causes over-fitting nor so long that it leads to algorithms ignoring ecological condition changes.Long gaps are considered one of the primary uncertainty sources of CO 2 flux in FLUXNET (Aubinet et al., 2012).As a result, studying the effects of the gap lengths and studying the window length whereby an algorithm is trained are both critical challenges associated with environmental data gap-filling.
Apart from the limitations and disadvantages of the mentioned algorithms, gap-filling of fluxes (e.g.NEE) experiences some other challenges that make it necessary to find or develop new gap-filling algorithms.That is because the current methods are not flexible enough to perform well on special occasions or with extreme values (Kunwor et al., 2017), and there is almost no room to optimise them to improve their outcome (Moffat et al., 2007).Moreover, even using the best available algorithm, such as ANNs, the model (gap-filling) uncertainty still accounts for a sizable proportion of the total uncertainties, especially when the gaps are relatively long.Since the 2000s when MDS and ANNs were chosen as the most reliable gap-filling methods for EC flux observations, many new ML and optimisation algorithms have been developed and used in various scientific fields.Some have shown superiority over ANNs, either individually or as a part of a hybrid or ensemble model (e.g.Gani et al., 2016).As a result, comparing the cutting-edge algorithms with the current standard ones can show whether there is any room to improve the gap-filling process within the field.According to the concerns mentioned above, this paper has two objectives: (a) to find out the impact of different gap lengths on the performance of each algorithm and (b) to compare the performance of traditional with new gapfilling techniques separately for fluxes and their meteorological drivers, particularly soil moisture, because this has always been a challenging variable to gap-fill due to the biology and heterogeneity of soil parameters.To address these objectives, we utilised nine different algorithms -eXtreme Gradient Boost (XGB), random forest (RF) algorithm, artificial neural networks (ANNs), marginal distribution sampling (MDS), classic linear regression (CLR), support vector regression (SVR), elastic net regularisation (ELN), panel data (PD), and the Prophet Forecast Model (FBP) -to fill the gaps of the major fluxes and eight of them (excluding MDS) to fill the gaps of the environmental drivers.We then assessed their relative performance to evaluate potentially better ways to fill EC flux data.To test the approaches, we used five flux towers from the OzFlux network.To evaluate the performance of these algorithms, nine scenarios for gaps were planned -from a day to a whole year -and applied to the datasets, and different common performance metrics (e.g.RMSE, MBE) and visual graphs were used.

Materials and methods
In order to address the first objective of this research, nine different gap lengths were superimposed to the datasets, i.e. 1, 5, 10, 20, 30, 60, 90, 180, and 365 d.To address the second objective, we chose nine different algorithms to fill the gaps, including a wide variety of different approaches, e.g. from a simple algorithm like CLR to the cutting-edge ML algorithms like XGB (MDS was not used to gap-fill the environmental drivers).The data used in this paper came from five EC towers of the OzFlux network, i.e.Alice Springs Mulga, Calperum, Gingin, Howard Springs, and Tumbarumba, from 2012 to 2013, with a time resolution of 30 min, except for Tumbarumba (60 min).Additionally, data coming from three additional sources outside the network https://doi.org/10.5194/gi-10-123-2021 Geosci.Instrum.Method.Data Syst., 10, 123-140, 2021 were also used as ancillary data to help the algorithms fill environmental driver gaps.

Data
The data used for this research came from OzFlux, which is the regional Australian and New Zealand flux tower network that aims to provide a continental-scale national research facility to monitor and assess Australia's terrestrial biosphere and climate (Beringer et al., 2016) 1 and Beringer et al., 2016).The datasets included 15 meteorological drivers and three major fluxes recorded (Table 2) based upon the EC technique at a 30 min temporal resolution, except for Tumbarumba, which was hourly.Additionally, relevant ancillary datasets for the mentioned towers were used to follow the OzFlux network gapfilling protocol (Table 3).Each dataset was quality checked at three levels based on the OzFlux network protocol described in Isaac et al. (2017) and applied using PyFluxPro version 0.9.2.To address the underestimation of canopy respiration by EC measurements at night, we used the changepoint detection (CPD) method (Barr et al., 2013) to reject nightly records when the friction velocity fell below each site's threshold value.After dismissing the inappropriate measurements, overall coverage of 72 %-88 % and 21 %-48 % was achieved for diurnal and nocturnal records during 2013 (the year to which the artificial gaps were superimposed), respectively.
The datasets whereby each environmental variable was gap-filled are shown in Table 3.For each of these variables, the same variable of the ancillary source was used to fill the gaps.For instance, to gap-fill Ah, the Ah records of AWS, ACCESS-R, and BIOS2 were used.To gap-fill the missing values of fluxes, i.e.F c (NEE), F h (H ), and F e (LE), eight drivers were used as follows: T a , W s , Sws, F g , vapour pressure deficit (VPD), F n , q, and T s based on a combination of random forest (RF) feature selection and testing out a series of feature combinations.Different Python programming language libraries (version 3.6.4)were utilised for training and testing the algorithms, i.e.XGBoost for XGB, fbprophet for FBP, statsmodels for PD, and sklearn for the rest of the algorithms.Each algorithm was tuned individually using a grid search, and the numbers of nodes, layers, and irritations were chosen accordingly.

Gap-filling algorithms
Eight imputation algorithms for estimating 15 environmental drivers and nine algorithms for the three major fluxes were chosen to make the comparison.These algorithms were selected in such a way that a variety of approaches were tested, from the standard methods like ANNs and MDS to the newer algorithms which have rarely or never been used in the field, such as eXtreme Gradient Boosting and panel data (Table 4).et al. (2005) introduced the MDS as an enhanced look-up table method, which considers both the covariation of fluxes with meteorological variables and the temporal autocorrelation of the fluxes (Aubinet et al., 2012).Alongside the ANNs, the MDS is considered one of the standard gapfilling methods for flux data amongst FLUXNET and is selected in this study to help the community have a clear idea of the performance of other algorithms.Unlike the other algorithms used in this research, we used F sd , T a , and VPD as the input features for the MDS to be consistent with standard application of the MDS; for using more than three or four drivers it is not generally recommended (Aubinet et al., 2012).The PyFluxPro version 0.9.2 was used to apply the algorithm (modified code used for gaps longer than 10 d).

Artificial neural networks (ANNs)
Rooted in the 1950s, artificial neural networks are ML methods inspired by biological neural networks and are classified as supervised learning methods (Dreyfus, 1990;Farley and Clark, 1954).ANNs work based on several connected units called nodes, which are used to mimic a neuron's functionality in an animal brain by sending and receiv-  2. List of variables and their units used in this research, including the three main fluxes and their environmental drivers.

Ah
Absolute humidity (g m −3 ) Table 3.The ancillary sources used to gap-fill each environmental driver.

List of Ancillary source variables
ing signals to other nodes.The ANN technique used in this paper was the Multi-Layer Perceptron regressor, which optimises the squared loss using stochastic gradient descent.Sklearn.neural_network.MLPRegressor was used to apply this method in Python, and its hyperparameters were 800 and 500 for "hidden_layer_sizes" and "max_iter", respec-  tively, based on a grid search.ANNs are one of the current standard approaches for gap-filling in FLUXNET and in this research were picked out as a performance reference for other algorithms.

Classical linear regression (CLR)
A classical linear regression is an equation developed to estimate the value of the dependent variable (y) based on independent values (x i ).In contrast, each x i has its specific coefficient and an overall intercept value.In this method, these coefficients are determined by minimising the squared residuals (errors) of estimated vs. observed values, called least squares.
A CLR algorithm can be formulated as follows (Freedman, 2009): where y is the dependent variable, α is the interception, X i represent independent variables, β i is the coefficient of X i , and ε is the error term.We chose this algorithm as a baseline to find out how much better more complicated algorithms can comparatively estimate dependent variables.

Random forests (RFs)
Random forest, a supervised ML algorithm used for both classification and regression, consists of multiple trees constructed systematically by pseudo-randomly selecting subsets of components of the feature vector: that is, trees constructed in randomly chosen subspaces (Ho, 1998).
The RF algorithm has been developed to overcome the over-fitting problem, a commonplace limitation of its preceding decision-tree-based methods (Ho, 1995(Ho, , 1998)).Sklearn.ensemble.RandomForestRegressor was used to apply this method in Python, and the hyperparameters used were 5 and 1000 for "max_depth" and "n_estimators", respectively, based on a grid search.

Support vector regression (SVR)
As a non-linear method, support vector regression was developed based on Vapnik's concept of support vector theory (Drucker et al., 1997).An SVR algorithm is trained by trying to solve the following problem: where x i and y i are the training sample and target value in a row.The inner product plus intercept w, x i +b is the prediction for that sample, and ε is a free parameter that serves as a threshold.sklearn.svm.SVR was used to apply this method in Python, and the hyperparameters used were 1 and 0.001 for "C" and "γ ", respectively, based on a grid search.

Elastic net regularisation (ELN)
The elastic net is a linear regularised regression method that exerts small amounts of bias by adding two penalty components to the regressed line to decline the coefficients of independent variables.It thus provides better long-term predictions.Given that these two penalty components come from ridge regression and LASSO, the elastic net is considered a hybrid model consisting of ridge and LASSO regressions, thereby overcoming the limitations of both.The estimates from the ELN method can be formulated as below (Zou and Hastie, 2005): where β is the coefficient of each ELN independent variable, λ 1 and λ 2 are penalty coefficients of LASSO and ridge regression, respectively, β(OLS) is the coefficient of an independent variable calculated based on ordinary least squares, and "sgn" stands for the sign function: The ELN regression is good at addressing situations when the training datasets have small samples or when there are correlations between parameters.sklearn.linear_model.ElasticNet was used to apply this method in Python, and the hyperparameters used were as follows: {"alpha": 0.01, "fit_intercept": True, "max_iter": 5000, "normalize": False} based on a grid search.

Panel data (PD)
The panel data method is a multidimensional statistical method mainly used in econometrics to analyse datasets which involve time series of observations amongst individual cross sections (Baltagi, 1995), usually based on ordinary least squares (OLS) or generalised least squares (GLS).A two-way panel data model consists of two extra components beyond a CLR as follows (Baltagi, 1995;Hsiao et al., 2002;Wooldridge, 2002): where i and t denote the cross section and time series dimension in a row, y is a dependent-variable vector, X is an independent-variable matrix, α is a scalar, β is the coefficient of the independent-variable matrix, µ i is the unobservable individual-specific effect, and λ t is the unobservable time-specific effect.The panel data method has the ability to provide a holistic analysis of different individuals and determine the specific impact of every single time, which caused its superiority over CLR.Since PD requires cross sections to be applied, we used a cross section tower for each of the five main towers as follows: Ti Tree East for Alice Springs Mulga, Whroo for Calperum, Great Western Woodlands for Gingin, Daly River for Howard Springs, and Cumberland Plain for Tumbarumba.The cross section towers were chosen based on their distances (the closest ones with common years of data).

eXtreme Gradient Boost (XGB)
The eXtreme Gradient Boost algorithm is a reinforced method of gradient boost introduced in 1999 that works based on parallel boosted decision trees.Similar to RF, it can be used for a variety of data processing purposes including classification and regression (Friedman, 2001(Friedman, , 2002;;Ye et al., 2009).The XGB method is resistant to overfitting and provides a robust, portable, and scalable algorithm for large-scale boosting decision-tree-based techniques.sklearn.ensemble.GradientBoostingRegressor was used to apply this method in Python, and its hyperparameters were chosen based on a grid search as follows: {"learning_rate": 0.001, "max_depth": 8, "reg_alpha": 0.1, "subsample": 0.5}.

The Prophet Forecasting Model (FBP)
The Prophet Forecasting Model, also known as "Prophet", is a time series forecasting model developed by Facebook to manage the common features of business time series.It is designed to have intuitive parameters that can be adjusted without knowing the details of the underlying model (Taylor and Letham, 2018).A decomposable time series model was used (Harvey and Peters, 1990) to develop this model, with three main components: trend, seasonality, and holidays (Taylor and Letham, 2018): where g(t) is the trend function, which models non-periodic changes, s(t) is a function to represent periodic changes, e.g.seasonality, and h(t) assesses the effects of potential anomalies which occur over one or more days, e.g.holidays.

The gap scenarios
In order to find out the effect of gap size on the performance of our gap-filling algorithms, the data were removed randomly from nine different gap windows (i.e.1,5,10,20,30,60,90,180, and 365 consecutive days) during 2013.Afterwards, the data from 2012 to 2013 were used to train the algorithms (excluding the superimposed gaps).Finally, the trained algorithms were used to fill the artificial gaps superimposed to the datasets.The entire process permutated five times in each scenario to ensure the performance was not sensitive to the gap position (i.e.seasonally).As such, 15 variables, nine window lengths, eight gap-filling methods (MDS excluded), and five permutations across five towers resulted in 27 000 computations for the meteorological features.Similarly, three fluxes, nine window lengths, nine gapfilling methods, and five permutations across five towers resulted in 6075 computations for the major fluxes overall.

Statistical performance measures
Different statistical metrics were used to evaluate algorithms' performance and enable comparison between measured values from the flux towers with each gap-filling algorithm prediction.These metrics included the coefficient of determination (R 2 ) to measure the square of the coefficient of multiple correlations (Devore, 1991), the variance of measured and modelled values (S 2 ) to indicate how well algorithms could follow the variations of the recorded data, the root mean square error (RMSE), the mean bias error (MBE) to capture the distribution and bias of residuals, the variance ratio (VR) to compare the variance of estimated values with those of measured, and the index of agreement (IoAd) to compare the sum of the squared error to the potential error (Bennett et al., 2013).Abbreviations and formulas for these metrics are illustrated as follows (Bennett et al., 2013).
Here, o i and p i are individual measured and predicted values, respectively, o and p are the means of o and p, and σ 2 is the variance.S 2 is calculated separately for the observed and predicted values, with the respective values defined as x representing every observed or predicted value.All of these metrics were calculated for each of the gap scenarios, and then the results of five permutations were concatenated.Afterwards, the metrics were calculated to avoid Simpson's paradox or any relevant averaging issue as described by Kock and Gaskins (2016).

CO 2 flux (F c )
Even though factors such as ground heat (F g ) and net radiation (F n ) are fluxes, we dealt with them as environmental drivers since they drive the three major turbulent fluxes.
The metrics used to evaluate the algorithms' performance (RMSE, R 2 , MBE, IoAd, and VR) (Table 5) illustrated that, overall, the performance of these algorithms, particularly the ML ones, was similar, closely followed by the MDS.The XGB provided the lowest values of RMSE and one of the highest R 2 values, while the FBP and ELN had the lowest and highest values of R 2 and RMSE, respectively.The algorithms, however, showed different levels of sensitivity to the gap lengths; e.g. the CLR and PD showed lower sensitivity, while the FBP showed the most sensitivity (Fig. 1).
These outcomes were expected for the XGB as it uses a more regularised model formalisation to control over-fitting (Chen and Guestrin, 2016), which, on paper, leads to better performance against its ML rivals.The relatively poor performance of FBP was also foreseen because, unlike other algorithms, FBP did not use any feature to estimate flux values other than the previous time series of flux values.However, the weaker performance of the ELN compared to CLR was unforeseen as by adding two penalty components to the regression line, the ELN is supposed to improve the longterm prediction compared to the traditional linear regression methods.Tukey's HSD (honestly significant difference) test at the level of 0.05 was applied to the results to determine whether the difference amongst the algorithms was significant (Table 5).When the null hypothesis is confirmed there is no significant difference between the mean values of the RMSE.According to the results, there were significant differences between certain algorithms, and the XGB, RF, and ANNs were different from the rest, showing that these three performed considerably better.Tukey's HSD test, however, did not reject the second error probability between RF, XGB, and ANNs, meaning that the three algorithms were not significantly different from each other.This result agrees with the results of Falge et al. (2001) and Moffat et al. (2007) in the sense that ANNs are one of the best available gap-filling algorithms, and there is no significant difference amongst the appropriate algorithms.However, the test showed that the performance of the MDS was significantly different from the ANNs.It seems that the difference has occurred because of the longer gaps (> 10 d) that were absent from previous studies.Finally, it is worth mentioning that Tukey's HSD is well known as a conservative test.That being said, despite no meaningful difference based on Tukey's HSD, XGB and RF might have performed better than ANNs, as the superiority of RF in gap-filling the methane flux over the ANNs, SVR, and MDS has recently been claimed by Kim et al. (2020).
To address this paper's first objective, which was to find the sensitivity of the gap-filling algorithms to the gap window length, we used the averaged RMSE, R 2 , and MBE for each gap size with the output of all algorithms for all sites (Table 6).The outcome illustrates that the longer the window length got, the larger the RMSE became.Yet, no such pattern was recognisable for the R 2 and MBE.As a result, generally, any consecutive gaps longer than 30 d seem to decrease the algorithms' performance noticeably.A reason for this may be that longer windows do not let the algorithms accommodate seasonal changes and therefore different canopy physiological behaviour.
According to the MBE values (Table 5), mainly, all algorithms had negative MBEs, indicating an overestimation of the F c values.This bias varied from tower to tower and depended on the window lengths.For instance, the MBE absolute values were larger in Gingin and Tumbarumba, while they were considerably smaller (closer to zero) at Al- ice Springs Mulga and Calperum (Supplement).The lower leaf area index of the two latter sites and thus their smaller amounts of photosynthesis are likely to be the reason for this.FBP, nonetheless, provided a substantially lower mean bias (−0.06) compared to the other algorithms, which varied between −0.32 and −0.43.
Observations from the EC technique often include extremely low or high values after quality control (QC), especially at night when some of the theoretical assumptions might be violated.One of the practical challenges associated with the EC technique is that it is often difficult to distinguish between the good data and the noise (Aubinet et al., 2012;Burba and Anderson, 2010).This problem seems to affect the outcomes of the gap-filling algorithms in this research, as none of them performed ideally in capturing the observed variance (Table 5).Even though RMSE, R 2 , and IoAd showed the superiority of the XGB, RF, and ANNs, the variance ratio between the estimated and measured values revealed different information (Table 5), which is recognisable in Fig. 2. The variance ratios (VRs) showed that SVR captured the extreme values of F c better than the other algorithms, with 0.75 on average.The other ML algorithms (plus the MDS), however, performed similarly with regard to capturing the extremes that match both the expectations and the performance metrics (Table 5).
The linear algorithms, CLR, PD, and ELN, performed worse concerning the VR compared to the ML algorithms, with the VR of F c for Calperum (Fig. 2) confirming this.Based on the figure, as expected, the ELN performed the worst in capturing the fluctuations in F c (VR = 0.39), while the performance of the other algorithms, apart from the top five, was not significantly better with the exception of FBP.It is noteworthy that CLR, PD, and ELN frequently predicted nocturnal photosynthesis.Overall, the results showed a significant difference between the top five algorithms (XGB, RF, ANNs, SVR, and MDS) and remaining algorithms, parhttps://doi.org/10.5194/gi-10-123-2021 Geosci.Instrum.Method.Data Syst., 10, 123-140, 2021 ticularly in capturing the fluctuations and the min-max range of F c .However, a comprehensive comparison shows a slight superiority of the XGB and RF.

Latent heat flux (F e )
The performance of algorithms for F e was similar to that for F c with respect to RMSE, MBE, and R 2 , as shown in Table 7.This similarity was not surprising since these processes are partially coupled via stomatal conductance (Scanlon and Kustas, 2010;Scanlon and Sahu, 2008).Again, the top three ML algorithms performed better, with XGB and RF being statistically significant as shown by the Tukey's HSD (Table 7).The null hypothesis was not rejected while comparing FBP and SVR, whereas the better performance of the other algorithms was confirmed.As a result, the FBP and SVR provided the most unsatisfactory results in estimating F e , according to the average values of the RMSE.No significant improvement in RMSE occurred when the gap lengths became shorter than 60 d, meaning that the algorithms' performance did not vary considerably from a 30 d to a 1 d window, especially for the top algorithms (XGB, RF, and ANNs).CLR and PD results were very similar to those for F c , showing a lower RMSE and higher R 2 values against ELN, but the ELN led to a slightly lower MBE.The MBE values also showed moderately high values for the SVR, meaning that there was an absolute bias in its outcome, which might be related to over-fitting.The source of the bias shown by the SVR algorithm (Fig. 3) was its inability to capture the minimum values appropriately, resulting in a considerable overestimation.A common issue in estimating F e values, which affected all algorithms other than the FBP, was the inability to capture  the negative values.In contrast to F c results, the ANNs did not perform as well as the XGB and RF, which could be due to not capturing the maximum values compared to its rivals.

Sensible heat flux (F h )
As with the other flux results, the metrics of RMSE, R 2 , and MBE showed slight superiority for XGB and RF, as well as the inferiority of the SVR and FBP to the other algorithms (Table 8).Likewise, the SVR provided relatively large negative values of MBE, showing considerable overestimation.The Tukey's HSD test of the average RMSE values confirmed that the performance of the FBP was significantly different from the rest at the level of 0.05, making FBP the weakest performer for F h .On the other hand, although there was no significant difference amongst the XGB, RF, and ANNs, the first two were considerably superior over the other algorithms as regards the Tukey's HSD test.Simi- larly to F e , estimated values of F h using SVR had a negative bias (Fig. 4) because it was not able to provide appropriate estimations of F h minimum values.In contrast, the ANNs performed the best in capturing the minimum values, while the other top algorithms performed almost equally well.Despite the similar performance in capturing the minimum values, ANNs and MDS did not perform as well as XGB and RF in capturing the overall values, resulting in a higher RMSE.Finally, like the other fluxes, the PD performed slightly better than the CLR and ELN.

Meteorological and environmental drivers
Since meteorological and environmental drivers are needed to fill the gaps of the three turbulent fluxes (F c , F e , and F h ), the eight algorithms (excluding the MDS) were used to fill these drivers' gaps.The metrics of R 2 , RMSE, and MBE were calculated for all five towers and nine window lengths (16 meteorological and environmental drivers).Overall, for most meteorological drivers, the linear algorithms, especially the CLR and PD, performed slightly better than the ML algorithms such as the XGB, RF, ANNs, and SVR, except for Ah, F g , and F n .This unexpected superiority can be explained based on the following two reasons.Firstly, unlike the fluxes, the input and output features were the same here, e.g.T a for T a , which led to solid correlations (e.g. up to 0.99 for atmospheric pressure -ps) as well as strong linear relationships between the independent and dependent features.These strong correlations helped the linear algorithms perform well and reduced ML algorithms' ability to capture the non-linear behaviour of complicated problems.Second, ML algorithms' slight inferiority could be due to data noise; simple linear algorithms such as the CLR are usually relatively less sensitive to noise.Therefore, over-fitting is not an issue for them when the number of observations is big enough (i.e. at least 10 to 20 observations per parameter; Harrell, 2014)   more accurately by the XGB, ANNs, and RF, especially F g , with the RMSE of RF and CLR for F g being 28.91 vs. 33.92,respectively.Tukey's HSD test for the mean RMSE values of F g confirmed that the XGB, ANNs, and RF showed significantly better results, while, like all other fluxes and drivers, the FBP was the worst algorithm (Table 9).Yet, according to the same test for the other drivers, there was no significant difference between the algorithms other than the FBP, which provided the most significant mean values of the RMSE (results not shown).Importantly, though, none of the algorithms offered adequate estimations for soil moisture (Sws), particularly in drier regions.This weak performance happened because Sws changes dramatically during rainfall in a pulsed manner, often from zero to saturation in a short amount of time, whereas the algorithms had been trained based on the datasets mostly reflecting non-rainy periods.These datasets, consequently, could not fit the algorithms in a way that they could estimate Sws accurately when precipitation occurs and the soil moisture increases dramatically.For instance, in a wet region like Tumbarumba, where the soil faces rainy days frequently, the time series are much less spiky.Thus, the overall performance was better in these regions than the drier ones (e.g.R 2 of 0.45 and 0.26 on average for Tumbarumba and Calperum, respectively).In addition, the dataset used to gap-fill the soil moisture was a model derivation from gridded data or regional reanalysis and therefore may not be close to reality.Another challenge of estimating soil moisture comes from the low spatial coherence of soil moisture; it can be extremely different just a couple of hundred metres away due to storms, topography, and soil structure heterogeneity (Reichle et al., 2004;Sahoo et al., 2008).

Discussion
Nine gap-filling algorithms were used in this study: eXtreme Gradient Boost (XGB), random forest (RF) algorithm, artificial neural networks (ANNs), marginal distribution sampling (MDS), support vector regression (SVR), classical linear regression (CLR), panel data (PD), elastic net regularisation (ELN), and the Prophet Forecasting Model (FBP).All algorithms performed similarly in estimating the meteorological and environmental drivers (turbulent fluxes included) across all stations except the FBP, which performed poorly because it did not use any ancillary data.The best results were achieved for the 30 d gaps and shorter, while the worst results obtained for the most extended windows of 180 and 365 d.Although most of the algorithms performed almost equally well in estimating meteorological and environmental drivers, the linear algorithms (CLR, ELN, and PD) performed slightly better, though not significantly using Tukey's HSD test.The only apparent exception was F g , for which the RF provided more accurate and robust estimations.The ML algorithms and MDS, on the other hand, showed their superiority over the linear algorithms while estimating the main fluxes, F c , F e , and F h .For F c , the XGB, RF, and ANNs performed significantly better than the FBP and all linear algorithms (i.e. the CLR, PD, and ELN, followed closely by the SVR and MDS).The superiority of the ML algorithms and their similar performance agreed with the results of previous researchers (Falge et al., 2001;Moffat et al., 2007), who showed the superiority of non-linear algorithms and no significant difference amongst the top algorithms in estimating F c .Also, with the slight superiorities of XGB and RF over ANNs, our results confirm that RF performs better for EC flux gap-filling, as noted by Kim et al. (2020) for methane.
The XGB was the most novel ML algorithm used in this research, and based on most performance metrics it provided comparatively robust results in estimating the fluxes.In estimating the meteorological drivers, though, the XGB did not show any superiority over the other algorithms, especially the linear ones.Moreover, the XGB needed 4 to 6 times longer to be trained and tuned, making it a less feasible algorithm when time and processing power are important factors or several years of data need to be gap-filled.Hence, we do not recommend the XGB as an alternative to the current standard algorithms.Nevertheless, because of its local superiorities, this algorithm might be suitable to use in an ensemble model alongside algorithms with different weaknesses.
The RF was the best all-around algorithm amongst the nine algorithms used in this study, providing the best consistent and robust estimates of the fluxes (similar to XGB).It is also less complicated and performs faster than the XGB.The RF also provided the best results for F g , but the linear algorithms did not perform well.This superiority of RF over ANNs, MDS, and SVR has been shown previously by Kim et al. (2020) for gap-filling of methane, showing that it is worth testing the RF for other towers and fluxes across FLUXNET.
The ANNs estimated the fluxes better than the linear algorithms, most notably for F c , yet they are not as robust as the XGB and RF in general.For F c and F h , the ANNs provided a bias, mainly due to overestimating minimum values when the window lengths were longer than 30 d.However, since the superiority of the XGB and RF was not considerable, it is difficult at this point to suggest using XGB or RF as better alternatives.This is because the utility of ANNs has been validated over a long time in different locations, and they have been considered to be among the most reliable algorithms in the field for more than a decade (Aubinet et al., 2012;Hagen et al., 2006;Kunwor et al., 2017;Moffat et al., 2007).In other words, the superiority of RF should be assessed in several future studies to convince the network to suggest RF instead of ANNs or identify it as another standard gap-filling method.Furthermore, there are a wide variety of different ANN algorithms used in the field (Beringer et al., 2017;Hagen et al., 2006;Isaac et al., 2017;Kunwor et al., 2017;Moffat et al., 2007), and the minor superiority of RF and XGB cannot be generalised without additional case studies.As such, we suggest that other researchers use the RF, especially for F h and F c , alongside ANNs to find out which one performs better in challenging scenarios (e.g. when the gaps are long).Another option is to develop ensemble models to improve the results over a single algorithm (Moffat et al., 2007).Ideally, a model with a higher level of flexibility is required in the field (Hagen et al., 2006;Kunwor et al., 2017;Richardson and Hollinger, 2007).Finally, the ANNs, like the other ML algorithms, did not show a consistent superiority over the linear algorithms regarding the environmental drivers.Therefore, we do not recommend using ML algorithms in such scenarios, except for F g , for which RF seems to be a better option.
The MDS performed similar to, yet not as well as, the XGB, RF, and ANNs in gap-filling the fluxes.Its performance was close to the SVR but was more reliable for F e and F h .It is worth mentioning that this performance was achieved despite the MDS using fewer input features.Its perhttps://doi.org/10.5194/gi-10-123-2021 Geosci.Instrum.Method.Data Syst., 10, 123-140, 2021 formance, however, was comparable with the ML algorithms, particularly when the gap lengths were relatively shorter (equal to or smaller than 10 d).As such, we recommend using the MDS when the gaps are not long or the available input features are limited, especially considering that the MDS performs significantly faster than the ML algorithms and is easier to use.
The SVR showed consistent inferiority to the other ML algorithms and did not fulfil our expectations for the meteorological drivers or for the major fluxes.The only strength of the SVR was that it captured the extreme values better than any other algorithm.However, because of the larger RMSE the mentioned advantage seems to have been achieved suspiciously and might have occurred due to over-fitting.This dubious performance shows that SVR is perhaps more vulnerable to the over-fitting issues regarding these data types.Hence, we suggest the SVR not be used in environmental modelling related to the reviewed drivers and fluxes.
The CLR, the simplest algorithm used in this research, provided a comparatively acceptable performance in estimating the meteorological drivers, except for F g .This algorithm, however, did not perform well in assessing the fluxes, especially F c , mainly because of its inability to capture the extreme values caused by the non-linear relation of F c to its drivers.Overall, considering the CLR's simple, resourcesaving, and robust performance for drivers, this algorithm seems to be the most suitable way to fill the gaps of meteorological parameters in similar scenarios in which the same ancillary datasets are available.
The PD performed slightly better than the CLR, yet it did not show a significant superiority over the other linear algorithms used in the research.This unforeseen weak performance can be explained due to a couple of factors.First, one of the assumptions of using the PD is that the cross-sectional behaviour (here towers) is similar under similar conditions (the independent variables), and the only thing that leads to the difference is the specific characteristics of each individual cross section.Contrariwise, it seems that the five towers selected in this research violated this assumption due to their being in widely different ecosystems.Based on previous studies in which the PD performed well (Izady et al., 2013(Izady et al., , 2016;;Mahabbati et al., 2017), it appears that a decent level of homogeneity is vital for the PD to perform satisfactorily.As in all previous cases, the cross-sectional ecosystem had significant similarities, and the distance between them was smaller.Therefore, the characteristics of cross sections, such as radiation, climate, and rainfall, had considerably more similarity and homogeneity compared with the towers used in this research.Finally, it is worth mentioning that PD has been commonly used to analyse time series with a time resolution of weekly or longer, with some exceptions using daily time steps.In this research, the data resolution was half-hourly instead, which dramatically increased the computational demands of the algorithm and led to days of processing for a single run.This demand happened be-cause the algorithm creates a dummy variable for each time step and the relevant matrix of variables becomes too large to compute with a regular PC.Considering the computational expense of this algorithm, we recommend other researches not use PD when the time resolution is shorter than daily.Despite the limitation, we still encourage further use of PD whenever there is a decent homogeneity level amongst the cross sections and the time resolution is daily or longer.
As a hybrid linear model, the ELN did not show any superiority over the CLR, despite its modifications to provide more accurate estimations.However, ELN performed well in estimating the drivers with slight superiority on some occasions (e.g. for F ld , the CLR is a more proper algorithm to choose for gap-filling the drivers due to its simplicity and lower calculation requirement).
The FBP was a unique algorithm used in this research, as it did not use any independent variables to estimate the values of drivers and fluxes.The FBP performance was the least satisfactory of all the algorithms.Therefore, FBP cannot be considered a reliable alternative for current algorithms to fill gaps, especially longer ones.
Given that some of the environmental drivers that affect F c are different during the day versus night, separating the diurnal and nocturnal datasets to train the algorithms could improve the outcome.Mainly because of the u * threshold filtering and other problems associated with the nocturnal period, the portion of diurnal data generally far outweighs the nocturnal data portion, which potentially leads to a bias in the algorithm.The same challenge is associated with soil moisture estimation, as the behaviour of the system on sunny days is utterly different from during the rainy periods.Moreover, the system memory and the antecedent conditions are undeniable features associated with soil moisture (Ogle et al., 2015).Therefore, models that can address these considerations are more likely to improve the estimations.
Finally, it is noteworthy that some of the flux drivers used in this study as input features for the gap-filling algorithms are not commonly used or might not globally be available.However, considering that similar relative performance has been achieved in other research for which different sets of input features were used (Kim et al., 2020), the relative performance of the algorithms reviewed in this research should generally provide similar relative performance while using different input features.

Conclusions
Eight different gap-filling algorithms for estimating 16 meteorological drivers and nine algorithms for the three key ecosystem turbulent fluxes (sensible heat flux -F h , latent heat flux -F e , and net carbon flux -F c ) were investigated, and their performance was evaluated based on datasets from five towers in Australia.Overall, three ML algorithms, XGB, RF, and ANNs, performed nearly equally well and signifi-cantly better than their linear rivals (the CLR, PD, and ELN) in estimating the flux values.However, the linear algorithms performed almost equally well as the ML algorithms in assessing the meteorological drivers.Amongst these nine algorithms, the RF and XGB showed the highest level of robustness and reliability in estimating the F c , F e , and F h .The PD was expected to perform better than the linear methods, and it was hoped that it could compete with the ML algorithms in estimating the fluxes, but it failed to do so.The SVR was the only ML algorithm that did not perform at the same level as the rest of the ML algorithms, which we suspect was due to over-fitting issues, while the MDS performed somewhere in between.Considering the outcomes of previous research undertaken in the OzFlux network (e.g.Cleverly et al., 2013;Isaac et al., 2017), none of the ML algorithms used in this research were proven to provide substantially better flux estimations compared with the standard method (ANNs).Nonetheless, amongst the algorithms tested in this research, the RF showed potential capabilities as an alternative due to its more consistent performance regarding long gaps.Finally, we make suggestions below to improve the results for prospective researchers, as well as the QC and gap-filling procedure for flux networks.
1. Since the RF was more consistent than its competitors, including the ANNs, we suggest it is a good idea to use RF alongside the commonly used algorithms in challenging scenarios, such as with long gaps, to figure out whether this superiority can be generalised.
2. It appears that even after three levels of quality control process by the flux processing software (e.g.PyFlux-Pro), the data are still quite noisy.These noisy data are an essential source of both uncertainty and inaccuracy of the outcome, regardless of the algorithm used to gapfill the data.As a result, another level of quality control methods, such as wavelets or matrix factorisation, in addition to the current classical ones used by PyFluxPro and other similar platforms can probably improve the data quality and thereby improve the final imputation results.
3. For future researchers, using recurrent neural networks (RNNs) instead of feed-forward neural networks (FFNNs) could improve estimations.This is likely because RNNs help the model to consider the temporal dynamic behaviour of time series.Unlike FFNNs, wherein the activations flow only from the input layer to the output layer, RNNs also have neuron connections pointing backwards (Géron, 2019).The demand for an algorithm capable of considering time has been mentioned in previous research as one of the reasons why testing new algorithms is needed (Richardson and Hollinger, 2007).
4. Developing ensemble models using algorithms with different weaknesses and strengths may also enhance the results when a single algorithm shows performance deficiency.

Figure 1 .
Figure 1.A heat map of mean RMSE values of F c across all sites based on nine algorithms and nine window lengths in 2013.

Figure 2 .
Figure 2. Measured vs. estimated values of F c for Calperum based on a 10 d gap window (22-31 March 2013): (a) the ML algorithm plus the MDS and (b) the linear models plus FBP.

Figure 3 .
Figure 3. Measured vs. estimated values of F e for Calperum based on a 10 d gap window (22-31 March 2013).

Figure 4 .
Figure 4. Measured vs. estimated values of F h for Calperum based on a 10 d gap window (22-31 March 2013).

Table 1 .
Information on the five towers from which data were used, including their name, location, dominant species, and climate.

Table 4 .
The name and the abbreviation of the gap-filling algorithms.

Table 5 .
The average performance metrics for each gap-filling algorithm regarding F c , which includes all window lengths and sites, ranked by RMSE using the Tukey's HSD test at the level of 0.05.

Table 6 .
The average RMSE, R 2 , and MBE for F c gap-filling based on the window length, including the outcome of all sites; the differences of RMSE values were tested using the Tukey's HSD test at the level of 0.05.

Table 7 .
The average metrics for F e gap-filling based on the algorithms, ranked by RMSE using the Tukey's HSD test at the level of 0.05.

Table 8 .
The average metrics for F h gap-filling based on the algorithms, ranked by RMSE using the Tukey's HSD test at the level of 0.05.

Table 9 .
The average RMSE for F g gap-filling based on the algorithms using the Tukey's HSD test at the level of 0.05.