Opened 12 years ago

Last modified 5 years ago

#602 new defect/bug

Improve resilience of Navit

Reported by: mvglasow (2) Owned by: KaZeR
Priority: major Milestone: version 0.6.0
Component: core Version: git master
Severity: complex Keywords: logging, testing

Description (last modified by usul)

Navit appears to be very prone to crashing (see ticket #567 for a practical "case study" of several things going wrong, resulting in a crash). While crashes are generally annoying, a crash of your navigation system in the middle of a particularly tangled highway intersection, forcing you to restart the program and re-enter your destination, is among the worst examples.

That being said, Navit needs some generic improvements in this field. Besides the most obvious step of writing resilient-per-se code, the two key ingredients are crash prevention and graceful reaction where the crash is inevitable. I will list two good examples for either.

First, crash prevention. It boils down to exception handling, and the good example here is Delphi: The window message loop is enclosed in a try block; if an exception occurs in the code for handling a particular message, the exception handling code is executed and then execution continues with the next message. (Since almost all code in a Delphi app typically goes inside some event handlers called from inside the message loop, it is hard to get a Delphi app to crash - the only remaining risk is that some data may be botched internally due to the exception and further errors result from this.)

Applied to Navit, this would mean a "global" exception handler around the central point at which Navit processes all input from the user or the GPS receiver. When an exception occurs during the handling of some event, this would abort the handling procedures for that event and trigger a default action. (For example, playing a sound, displaying a message on the screen which disappears after a few seconds - remember that the attention of the driver is on the road and not on Navit, hence we need a compromise which does inform the user of the fault but does not require any interaction on his part). Also, do any cleanup that may need to be done - such as committing any unwritten data to file (losing a GPX log to a crash can beWW IS annoying when tracking).

That way even the most careless code would be unlikely to bring Navit down altogether. (Unless it were to mess with some data structures that get processed outside the global exception handler.)

Other critical code can be wrapped into separate exception handlers so that errors can be handled in a graceful manner. This is especially important when aborting an operation requires some cleanup. Exception handlers can always define some other behavior than the default, as deemed appropriate by the programmer. (Also see the next section for that.)

A side effect would be that, if some central data structures were to get corrupted, even the menu command to exit Navit would be unreachable. We need to take care to avoid the dependencies of all code necessary for calling exit (maybe implement the exit command itself as an exception?), else the user will never be able to leave Navit when they need to most urgently.

Second, crash recovery. Sometimes, as a result of a previous exception or other fault, internal data may be corrupted and Navit may be unable to continue. As mentioned before, this may (and according to Murphy WILL) happen at a point when the driver has other things to worry about than restarting Navit and re-entering all navigation data.

How about letting the global exception handler take care of that? It does require some work in advance, though:

Whenever the user changes a setting (view, 2D to 3D, ...), sets a destination or the like, store that information in a separate data structure. When exiting due to a fault, write that information to a file and restart Navit.

When starting Navit, look for a crash-recovery file and, if there is one, load it. (Preferably play a sound and display a dialog asking the user if he/she wants to reload the recovered data; assume Yes if the user does not respond within five or so seconds.) With the recovered information, resume where the user left off.

Caveat: the very recovery information may be corrupt and cause Navit to go into an infinite loop of crashes. Therefore, add an extra flag to the recovery data, setting it to 1 if the recovery information was already loaded from file and not touched since, 0 in all other cases. When loading a recovery file with the flag set to 1 (meaning Navit has already crashed once with the same set of recovery data in memory), offer "do not reload recovered data" as the default option (again with a timeout).

A good real-world example of crash recovery is Firefox - early versions were prone to crashes but the current one is quite stable, and in case of a crash, it restarts itself and reloads all tabs that were open at the time of the crash. The web site claims that Firefox will preserve even the e-mail you were just typing up to the last word. If somebody here is familiar with the Firefox code, this might be the help we need.

However... I realize that all I'm writing is based on exception handling. In C++ (or Delphi) I would just wrap the critical sections into a try block and drop some code into the exception handler, but Navit code is pure C. Any ideas on how to handle exceptions in C? I admit not being an expert on this... maybe the following link helps:

Please submit ideas and feedback here!

Change History (4)

comment:1 Changed 11 years ago by mvglasow (2)

for the records... we're one step ahead, since now at least the destination is saved and re-read when Navit starts up.

comment:2 Changed 9 years ago by usul

  • Description modified (diff)
  • Keywords logging testing added
  • Milestone set to version 0.5.1

This is slightly related to #313 (gdb crash dump). I agree that we need to improve code quality in general, but we currently lack of ressources to do big steps:

  • establish testing procedures
  • establish auto testing (CUnit etc).

I have no idea, how exception handling currently works, as Navit is a mix between C and C++ and sometimes even other code (e.g. Java for Android).

BTW: Currently we save your old destinations etc.

I will assign this ticket to the next minor release, that focus stability.

comment:3 Changed 7 years ago by kazer

  • Severity set to normal

For the records, there has been a big effort on testing and continuous integration.

Documentation is available here.

Keeping the ticket open has some more work is needed to save more session informations.

comment:4 Changed 5 years ago by

  • Milestone changed from version 0.5.1 to version 0.6.0
  • Severity changed from normal to complex

This ticket requires a lot of more work, even though navit is (at least for me) quite stable. I move this ticket to 0.6.0 on order to get 0.5.2 out in a short time

Note: See TracTickets for help on using tickets.