Research activities

Rollback recovery in Distributed Systems

A backward error recovery (BER) is a well know general tool of fault-tolerance. BER consists in saving periodically the current state of computation (a checkpoint) on a stable storage (supposed to never fail), and restoring back the saved state when recovering from a failure. In a distributed system a global checkpoint is composed of local checkpoints of all processes in the system. Local checkpoints can be saved in a coordinated way (synchronous checkpointing) — when all processes cooperate on saving checkpoints that represents a consistent global state of the system. This approach requires a synchronization between all cooperating nodes which introduce the time and communication overhead. Another approach is to save local checkpoints independently, without any cooperation of nodes (asynchronous checkpointing). This approach offers a minimal overhead on a checkpointing process, but suffers from a domino effect during a recovery procedure. To reduce the cost of recovery and eliminate the domino effect, a message logging and dependency tracking can be used.

Recommended basic bibliography

R. Koo, S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE Transactions on Software Engineering, vol.13 no.1, pp.23-31, 1987.

Y.M. Wang, W. K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems", Proc. of Symposium on Reliable Distributed Systems — SRDS'92, pp.147-154, 1992.

E.N. Elnozahy, W. Zwaenepoel. "Manetho, Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit", IEEE Transactions on Computers, vol.41, no.5, pp.526-531, 1992.

E.N. Elnozahy, D.B. Johnson, Y.M. Wang. "A Survey of Rollback-Recovery Protocols in Message-Passing Systems", Cornegie Mellon University technical report CMU-CS-96-181, 1996.

Reliable DSM Systems

My initial research interests tended to DSM systems providing recoverability of shared objects. There are 3 main approaches:

adaptation of message-passing checkpoint-recovery techniques (coordinated/independent checkpointing)

Main references

K.-L. Wu and W. K. Fuchs. "Recoverable Distributed Shared Memory", IEEE Transactions on Computer, 39(4), April 1990, pp. 460-469.

Bob Janssens, W. Kent Fuchs. "Relaxing Consistency in Recoverable Distributed Shared Memory", Proc. of International Symposium on Fault-Tolerant Computing Systems — FTCS'93, June 1993, pp. 155-163.

Golden G. Richard III, Mukesh Singhal. "Using Logging and Asynchronous Checkpoint to Implement Recoverable Distributed Shared Memory", Proc. of 12th Symposium on Reliable Distributed Systems, 1993, pp. 58-67.

dedicated checkpoint-recovery protocols with dependency tracking of coherency protocol messages

Main references

Takeshi Fuchi, Mario Tokoro. "A Mechanism for Recoverable Shared Virtual Memory", Manuscript University of Tokio, 1994.

Taesoon Park, Sungbok Cho, Y. Yeom. "An Efficient Logging Scheme for Recoverable Distributed Shared Memory Systems", International Conference on Distributed Computing Systems, 1997.

Bob Janssens, W. Kent Fuchs. "Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory", Proc. of 13th Symposium on Reliable Distributed Systems, 1994, pp.34-41.

integration with coherency protocols

Main references

Anne-Marie Kermarrec, Gilbert Cabillic, Alain Gefflaut, Christine Morin, Isabelle Puant. "A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability", IRISA — Publication Interne no. 897, 1995.

Christine Morin, Anne-Marie Kermarrec, Michel Banâtre. "An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures", INRIA — Raport de recherche no. 3103, 1997.

Recommended additional bibliography

Christine Morin, Isabelle Puant. "A Survey of Recoverable Distributed Shared Memory Systems", IRISA —Publication Interne no. 975, 1995.

William Dieter James Lumpp. "Fault Recovery for Distributed Shared Memory Systems", Proceedings of the IEEE-Aerospace Conference, 1997.

Some of my publications on this subject:

Jerzy Brzeziński, Michał Szychowiak: Reliability of Distributed Shared Memory Systems, Proceedings of the European Conference on Research and development for Information Society — ISTHmus 2000.

Jerzy Brzeziński, Michał Szychowiak: Fast and Low Cost Recovery Techniques for Distributed Shared Memory, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`02), Las Vegas, Nevada, June, 2002.

Jerzy Brzeziński, Michał Szychowiak: Replication of Checkpoints in Recoverable DSM Systems, Proceedings of the 21st Parallel and Distributed Computing and Networks (PDCN 2003), February 2003.

Jerzy Brzeziński, Michał Szychowiak: An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects, Proceedings of the International Workshop on Distributed Shared Memory on Clusters — DSM 2003 in Proceedings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid 2003), Tokyo, Japan, May 2003, pp. 510-515.

Process and Object Replication for High-Availability

Consensus / Failure Detectors

Recommended basic bibliography

T.D. Chandra, S. Toueg. "Unreliable Failure Detectors for Reliable Distributed Systems", Proc. 10th ACM Symp. Principles of Distributed Computing, ACM Press, 1991, pp.325-340.

Reliable Group Communication

Recommended basic bibliography

Xavier Défago, André Schiper, Péter Urbán "Totally Ordered Broadcast and Multicast Algorithms: A Comprehensive Survey." Technical Report DSC/2000/036, École Polytechnique Fédérale de Lausanne, Switzerland, September 2000.

Rachid Guerraoui. "Revisiting the relationship between Non Blocking Atomic Commitment and Consensus problems", International Workshop on Distributed Algorithms (WDAG-95), Springer Verlag (LNCS 972), 1995.

Self-Stabilization

Self-stabilization of a system guarantees that regardless of its current state, the system will converge to a legal state in a finite number of steps.

Recommended basic bibliography

Jerzy Brzeziński, Michał Szychowiak: Self-Stabilization in Distributed Systems—a Short Survey, Foundations of Computing and Decision Sciences, 25(1), 2000.

Mitchell Flatebo, Ajoy Kumar Datta, Sukumar Ghosh: Self-Stabilization in Distributed Systems, in Advances in Distributed Computing: Concepts and Design, IEEE Computer Society Press, T.L. Casavant and M. Singhal eds., 1992.

Ted Herman: Probabilistic Self-Stabilization, Information Processing Letters, vol. 35, 1990, pp. 63-67.

Acces to locally cached copies of the papers is restricted.