The Impact and Resolution of the GPS Week Number Rollover of April 2019 on Autonomous Geophysical Instrument Platforms

Instrument platforms the world over often rely on GPS or similar satellite constellations for accurate timekeeping and synchronization. This reliance can create problems when the timekeeping counter aboard a satellite overflows and begin a new epoch. Due to the rarity of these events (19.6 years for GPS), software designers may be unaware of such circumstance, or may choose to ignore it for development complexity considerations. Although it is impossible to predict every fault that may occur in a complicated system, there are a few "best practices" that can allow for graceful fault recovery and restorative 5 action. These guiding principles are especially pertinent for instrument platforms operating in space or in remote locations like Antarctica, where restorative maintenance is both difficult and expensive. In this work, we describe how these principles apply to a communications failure on Autonomous Adaptive Low-Power Instrument Platforms (AAL-PIP) deployed in Antarctica. In particular, we describe how code execution patterns were subtly altered after the GPS week number rollover of April 2019, how this led to Iridium satellite communications and data collection failures, and how communications and data collection 10 were ultimately restored. Finally, we offer some core tenets of instrument platform design as guidance for future development.

so the systems eventually run out of power and gracefully enter hibernation during the austral winter. The time frame for this hibernation period varies by site location and weather, but typically begins around August and ends in October. It is customary to reduce the amount of data being retrieved on a daily basis to only important housekeeping values once the sun retreats in the northern spring to extend the operational period. Science data is stored in local memory for later download when solar power is plentiful.
The AAL-PIPs that are the focus of this paper used the Router-based Unrestricted Digital Interworking Connectivity Solutions 75 (RUDICS) Iridium protocol for communications. With RUDICS, a remote AAL-PIP system connects to the Iridium network with its modem, and communications traffic is routed through a Department of Defence gateway in Hawaii. The gateway in turn connects via the internet (Transmission Control Protocol/Interenet Protocol, or TCP/IP) to a server at Virginia Tech.
The GPS week number rollover of 2019 created an unanticipated software fault that affected both data collection and Iridium communications at the AAL-PIPs. In this paper, we describe why this fault occurred, how it was diagnosed, and how it was 80 ultimately resolved. Thus, we use this fault as an example to describe best practices for designing future systems against similar faults, or to at least enable graceful fault recovery and restorative action if a fault occurs.
Section 1 (this section) establishes a sufficient background in the operation and usage of satellite timing systems (Section 1.1), as well as details about the remote instrument platform operated by the authors (Section 1.2). Section 2 describes the onset 85 of the fault as discovered by the operators. Section 3 describes the development and implementation of a recovery solution, and Section 4 lists some of the valuable insights gained from this effort. A summary is provided in Section 5.

Fault Description
In the case described below, we describe a fault that appears very minor at first which months later becomes a major fault that halts operation at 4 of 6 stations. The fault encountered on the AAL-PIPs relates to the nature of clocks on Linux systems.

90
The single board computer at the heart of the AAL-PIP platform runs a very lightweight version of Linux based on the Debian distribution. Like all Linux systems, there are two clocks running during operation. The "hardware" clock is typically used when a device is "powered off" to keep track of time elapsed between reboot cycles and often relies on a coin cell battery to maintain operation. The "system" clock is the clock that users are more familiar with, and represents the time and date with a software counter. These clocks drift at different rates, and there are a number of different techniques to synchronize the two 95 over long periods of time.
For the purposes of operating the AAL-PIP platform, the system clock is only synchronized one time after a power outage or reboot. To compensate for drift, GPS time is used to make subsequent adjustments to the system clock thereafter. The drift rate measurements and compensations occur hourly. The distinction between clocks and the details of the adjustment process are important for illustrating the complex behavior relating to the GPS fault on the AAL-PIP systems.

Clocks and Watchdogs
In the case of the AAL-PIP system, there are two important software design considerations that had to be addressed during the patch process. First, the communications watchdog timer issue had to be resolved. Typically, the watchdog will automatically reboot the system after either staying connected or disconnected for more than 24 hours. When the watchdog first starts, it 215 obtains the current time and then sets a trigger for 24 hours later. This reboot process was prevented by rolling the clock back to a time several years before the trigger time. A different, more robust solution to using the full date & time for timers like this would have been to strip the date from the trigger. However, this can still lead to some confusion as a result of making adjustments to the system clock. Using a monotonic hardware clock for this timer would provide a more robust solution, as these clocks typically are not adjusted by any software process.

Checking Assumptions
The solution described in the Fault Recovery section involved checking the reported time against some epoch, and only adding an offset if the reported time was incorrect based on that comparison. This solution will only work until the next GPS rollover in 2038. Ideally, the systems will go through a hardware refurbishment before then, and newer GPS modules using the 13 digit standard can be installed. This rollover won't occur until 2137, and is sufficiently beyond consideration here. However, a 225 rollover of a different kind will also occur in 2038, that of the signed integer time using the Unix epoch date of 1970. Though an entire adolescence will occur between now and that point, it is never too soon to consider what problems may arise as a result.
It is assumed that clocks are a relatively trivial device inside of a computing system, and that the reported time can be taken for granted. Certainly for most applications, especially those using a network for timing, this is typically true. However we have 230 shown here that it is not always the case and can have dramatic impacts on system performance. It was originally assumed that the GPS time reported would always be accurate to within milliseconds, and this assumption was written into our software.
When the time was offset by 1024 weeks, our system performance suffered as a result. In this instance, it was worth considering that the reported time could have been incorrect, and planning for that failure to relay the correct time. Our software was able to account for the situation where no GPS time was reported, but not for incorrect GPS time reported.

Summary and Conclusions
To summarize our suggested design guidelines: -Maintaining control of the system behavior through use of a bidirectional communications link is of the utmost importance for system maintenance and repair.
-Reliance on software based timing is not recommended for subsystems critical to instrument platform operation. Mono-240 tonic hardware timers should be used for watchdogs and other critical timers whenever possible. 9 https://doi.org/10.5194/gi-2020-47 Preprint. Discussion started: 18 January 2021 c Author(s) 2021. CC BY 4.0 License.
-It is important to consider both hard and soft subsystem failures, where outputs may be either missing or invalid.
-Special consideration during the design phase should be given to identifying and reducing the number of single point failure mechanisms. This includes any subsystem that could fail in a way that eliminates control of the system as a whole.
-Integration testing should be extensive, and attention should be given to potential faults in each subsystem.

245
-Whenever possible, duplicate hardware setups should remain onsite at the operational facility to develop solutions to unforeseen faults. If this is considered infeasible, some other method of simulating system behavior (through software / virtual instruments) should be considered.
Operating an instrument platform in a remote location like Antarctica provides a unique set of challenges for maintenance and operations. When setting out to develop an new remote instrument project or refurbish an existing project, there are a number 250 of guiding principles one can follow to ensure success. Maintaining a robust communications link with the platform ensures that any eventualities that may arise from unsuspected fault cases can be addressed remotely. This ensures the platform can be maintained at minimal cost and expenditure, prolonging the life of the instrument and leading to greater proliferation. Properly identifying potential faults and adding fail-safes to critical systems is paramount to achieving this low-cost, long-life status.
Careful consideration must be given to how a system reacts in the event of hard failures as well as soft failures; the difference 255 between a module not working instead of working incorrectly can cause dramatically different behaviors. This is exactly the sort of situation that occurred on the AAL-PIP platforms as a result of the GPS rollover in 2019, and this kind of situation may occur twice in the year 2038. With thoughtful attention to how a platform will operate in the event of a failure, instrument designers and operators can plan for and recover from even the most unexpected and thorough fault cases.
Author contributions. SC currently maintains the AAL-PIP array operations and has prepared the manuscript with assistance from all co-260 authors. SC is also responsible for the development of software updates previously mentioned. YP provided troubleshooting assistance by operating a GPS synthesizer during fault replication attempts. CRC, MH, and ZX are all members originally involved in the deployment of the AAL-PIP magnetometer array.
Competing interests. The authors declare they have no conflict of interest.
Acknowledgements. This material is based upon work supported by the National Science Foundation under Grant No. PLR-1543364, AGS-265 2027210, and AGS-2027168. The authors acknowledge and appreciate the assistance provided by USAP, specifically attributing the efforts of Dan Wagster to the successful diagnosis of the communications fault. John Bowman, and undergraduate student at Virginia Tech is also credited with assistance during the fault diagnosis. The New Jersey Institute of Technology (NJIT) is credited with providing Antarctic field