The two main changes in this version of the paper are the addition of the MDS method for gap-filling fluxes and a randomized selection of introduced gaps. These are both considerable improvements and make the results from the methods comparison more robust. However, the short records of data (only 1-2 years) was also a key concern from both reviewers and was not addressed in this revision. The paper, as it is, represents an interesting contribution showing methods were mostly comparable with the short records and the traditional MDS algorithm still performs reasonably well, and potentially the more complex methods might not lead to improvements that are worth the extra costs. However, these are not conclusive results from the paper. The paper could have been a key contribution to the literature, and although the contributions seem to be technically sound, they do not advance the state of the art in gap-filling of eddy covariance data.
In an answer to a reviewer comment, the authors state: "...since the main goal of the study was to compare different gap-filling algorithms, we do not believe changing the input data leads to a difference in the relative performance of the algorithm". For the comparison of these algorithms, the only factor changing their performance will be the input data. The input data is even more important for methods such as ANNs and RF, which are entirely dependent on relationships between the data variables.
Still in the answers, about ancillary datasets: "Even though this is true, the ancillary data used in the current study have been used to gap-fill the drivers’ data, and not the fluxes directly. As such, it might not be a concern." It might be good to clarify in the methods that only the measured values for the drivers were used to gap-fill fluxes. Although it is fair to assume no gap-filled driver data was used to fill the fluxes, I couldn't find this statement in the paper.
The argument that many previous research results use only single years for evaluation omits that most of these had limited access to long and uniform records. With record spanning over 20 years of data available from most regional flux networks, this is not a limitation any longer and should have been integral to the paper. Seasonal patterns can be correctly identified by many of the methods used, but only if using multi-year data. Using single year limited the results of the paper, which could have been a considerable contribution to both the eddy covariance and machine learning scientific communities.
In Moffat 2007 the RMSE values for the best performing algorithms (mainly ANN variants but also MDS) were consistently under 3.0 gC m-2 d-1. Since these were consistently higher in this manuscript, this might support the argument that there was too little data to train the runs presented in this paper. Since the year selected to perform the tests was very complete, if the short record is not an limitation, as argued by the authors, one could expect these results to be better.
The introduction of randomized gaps improves the soundness of the results. However, in the methods, it is a bit unclear how all the many realizations of the random gaps were aggregated for the final results. This could be explained in more detail. As an example, it is curious that the RMSE values for Fc at Alice Springs Mulga are so low, yet the R2 values for the site are also low, while for Tumbarumba, the RMSE values are more within the expected ranges while R2 values are also higher.
Finally, I will note that I disagree with the last recommendation in the conclusions. Ensembles are useful when there isn't a "true" value against which one can compare an estimation value. In gap-filling, artificially introducing gaps (original true values) for comparisons allow precise estimations of uncertainty. Using ensembles for gap-filling would introduce unnecessary uncertainty. However, playing to the strengths of each method one can procedurally combining them (e.g., one method for short and one for long gaps) to improve final results without mixed uncertainties.
- Net ecosystem exchange (NEE) is usually defined as the sum of CO2 turbulent fluxes (commonly represented as Fc) and CO2 storage fluxes (commonly represented as Sc); so the definition in the paper for Fc as equivalent to NEE can be misinterpreted.
- It might be good to harmonize formatting for Figures 2, 3, and 4.
- page 15, L449: missing reference "()"
- page 24, L703: "3)" -> "4)"
- From previous review, in the abstract: The acronyms RF and CLR were referenced before being defined