Research activities

Rollback recovery in Distributed Systems

A backward error recovery (BER) is a well know general tool of fault-tolerance. BER consists in saving periodically the current state of computation (a checkpoint) on a stable storage (supposed to never fail), and restoring back the saved state when recovering from a failure. In a distributed system a global checkpoint is composed of local checkpoints of all processes in the system. Local checkpoints can be saved in a coordinated way (synchronous checkpointing) - when all processes cooperate on saving checkpoints that represents a consistent global state of the system. This approach requires a synchronization between all cooperating nodes which introduce the time and communication overhead. Another approach is to save local checkpoints independently, without any cooperation of nodes (asynchronous checkpointing). This approach offers a minimal overhead on a checkpointing process, but suffers from a domino effect during a recovery procedure. To reduce the cost of recovery and eliminate the domino effect, a message logging and dependency tracking can be used.
  • bookRecommended basic bibliography
    PDF R. Koo, S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE Transactions on Software Engineering, vol.13 no.1, pp.23-31, 1987.
    gzipped Postscript Y.M. Wang, W. K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems", Proc. of Symposium on Reliable Distributed Systems - SRDS'92, pp.147-154, 1992.
    gzipped Postscript E.N. Elnozahy, W. Zwaenepoel. "Manetho, Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit", IEEE Transactions on Computers, vol.41, no.5, pp.526-531, 1992
    PDF E.N. Elnozahy, D.B. Johnson, Y.M. Wang. "A Survey of Rollback-Recovery Protocols in Message-Passing Systems", Cornegie Mellon University technical report CMU-CS-96-181, 1996.

Reliable DSM Systems

My initial research interests tended to DSM systems providing recoverability of shared objects. There are 3 main approaches:
  1. adaptation of message-passing checkpoint-recovery techniques (coordinated/independent checkpointing)
    • bookMain references
      PDF K.-L. Wu and W. K. Fuchs. "Recoverable Distributed Shared Memory", IEEE Transactions on Computer, 39(4), April 1990, pp. 460-469.
      gzipped Postscript Bob Janssens, W. Kent Fuchs. "Relaxing Consistency in Recoverable Distributed Shared Memory", Proc. of International Symposium on Fault-Tolerant Computing Systems - FTCS'93, June 1993, pp. 155-163.
      gzipped Postscript Golden G. Richard III, Mukesh Singhal. "Using Logging and Asynchronous Checkpoint to Implement Recoverable Distributed Shared Memory", Proc. of 12th Symposium on Reliable Distributed Systems, 1993, pp. 58-67.
  2. dedicated checkpoint-recovery protocols with dependency tracking of coherency protocol messages
    • bookMain references
      gzipped Postscript Takeshi Fuchi, Mario Tokoro. "A Mechanism for Recoverable Shared Virtual Memory", Manuscript University of Tokio, 1994.
      gzipped Postscript Taesoon Park, Sungbok Cho, Y. Yeom. "An Efficient Logging Scheme for Recoverable Distributed Shared Memory Systems", International Conference on Distributed Computing Systems, 1997.
      gzipped Postscript Bob Janssens, W. Kent Fuchs. "Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory", Proc. of 13th Symposium on Reliable Distributed Systems, 1994, pp.34-41.
  3. integration with coherency protocols
    • bookMain references
      gzipped Postscript Anne-Marie Kermarrec, Gilbert Cabillic, Alain Gefflaut, Christine Morin, Isabelle Puant. "A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability", IRISA - Publication Interne no. 897, 1995.
      gzipped Postscript Christine Morin, Anne-Marie Kermarrec, Michel Banâtre. "An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures", INRIA - Raport de recherche no. 3103, 1997.

  • bookRecommended additional bibliography
    gzipped Postscript Christine Morin, Isabelle Puant. "A Survey of Recoverable Distributed Shared Memory Systems", IRISA - Publication Interne no. 975, 1995.
    PDF William Dieter James Lumpp. "Fault Recovery for Distributed Shared Memory Systems", Proceedings of the IEEE-Aerospace Conference, 1997.

Some of my publications on this subject:

gzipped Postscript Jerzy Brzeziński, Michał Szychowiak: Reliability of Distributed Shared Memory Systems, Proceedings of the European Conference on Research and development for Information Society - ISTHmus 2000.
PDF Jerzy Brzeziński, Michał Szychowiak: Fast and Low Cost Recovery Techniques for Distributed Shared Memory, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`02), Las Vegas, Nevada, June, 2002.
PDF Jerzy Brzeziński, Michał Szychowiak: Replication of Checkpoints in Recoverable DSM Systems, Proceedings of the 21st Parallel and Distributed Computing and Networks (PDCN 2003), February 2003.
PDF Jerzy Brzeziński, Michał Szychowiak: An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects, Proceedings of the International Workshop on Distributed Shared Memory on Clusters - DSM 2003 in Proceedings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid 2003), Tokyo, Japan, May 2003, pp. 510-515.

Process and Object Replication for High-Availability

Consensus / Failure Detectors

  • bookRecommended basic bibliography
    compressed Postscript T.D. Chandra, S. Toueg. "Unreliable Failure Detectors for Reliable Distributed Systems", Proc. 10th ACM Symp. Principles of Distributed Computing, ACM Press, 1991, pp.325-340.

Reliable Group Communication

  • bookRecommended basic bibliography
    PDF Xavier Défago, André Schiper, Péter Urbán "Totally Ordered Broadcast and Multicast Algorithms: A Comprehensive Survey." Technical Report DSC/2000/036, École Polytechnique Fédérale de Lausanne, Switzerland, September 2000.
    gzipped Postscript Rachid Guerraoui. "Revisiting the relationship between Non Blocking Atomic Commitment and Consensus problems", International Workshop on Distributed Algorithms (WDAG-95), Springer Verlag (LNCS 972), 1995.

Self-Stabilization

Self-stabilization of a system guarantees that regardless of its current state, the system will converge to a legal state in a finite number of steps.
  • bookRecommended basic bibliography
    PDF Jerzy Brzeziński, Michał Szychowiak: Self-Stabilization in Distributed Systems - a Short Survey, Foundations of Computing and Decision Sciences, 25(1), 2000.
    PDF Mitchell Flatebo, Ajoy Kumar Datta, Sukumar Ghosh: Self-Stabilization in Distributed Systems, in Advances in Distributed Computing: Concepts and Design, IEEE Computer Society Press, T.L. Casavant and M. Singhal eds., 1992.
    PDF Ted Herman: Probabilistic Self-Stabilization, Information Processing Letters, vol. 35, 1990, pp. 63-67.