Research activities
Rollback recovery in Distributed Systems
A backward error recovery (BER) is a well know general tool of fault-tolerance.
BER consists in saving periodically the current state of computation (a checkpoint) on a stable storage (supposed to never fail), and restoring back the saved state when recovering from a failure.
In a distributed system a global checkpoint is composed of local checkpoints of all processes in the system.
Local checkpoints can be saved in a coordinated way (synchronous checkpointing) — when all processes cooperate on saving checkpoints that represents a consistent global state of the system.
This approach requires a synchronization between all cooperating nodes which introduce the time and communication overhead.
Another approach is to save local checkpoints independently, without any cooperation of nodes (asynchronous checkpointing).
This approach offers a minimal overhead on a checkpointing process, but suffers from a domino effect during a recovery procedure.
To reduce the cost of recovery and eliminate the domino effect, a message logging and dependency tracking can be used.
-
bookRecommended basic bibliography
R. Koo, S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE Transactions on Software Engineering, vol.13 no.1, pp.23-31, 1987.
Y.M. Wang, W. K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems", Proc. of Symposium on Reliable Distributed Systems — SRDS'92, pp.147-154, 1992.
E.N. Elnozahy, W. Zwaenepoel. "Manetho, Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit", IEEE Transactions on Computers, vol.41, no.5, pp.526-531, 1992.
E.N. Elnozahy, D.B. Johnson, Y.M. Wang. "A Survey of Rollback-Recovery Protocols in Message-Passing Systems", Cornegie Mellon University technical report CMU-CS-96-181, 1996.
Reliable DSM Systems
My initial research interests tended to DSM systems providing recoverability of shared objects. There are 3 main approaches:
- adaptation of message-passing checkpoint-recovery techniques (coordinated/independent checkpointing)
-
bookMain references
K.-L. Wu and W. K. Fuchs. "Recoverable Distributed Shared Memory", IEEE Transactions on Computer, 39(4), April 1990, pp. 460-469.
Bob Janssens, W. Kent Fuchs. "Relaxing Consistency in Recoverable Distributed Shared Memory", Proc. of International Symposium on Fault-Tolerant Computing Systems — FTCS'93, June 1993, pp. 155-163.
Golden G. Richard III, Mukesh Singhal. "Using Logging and Asynchronous Checkpoint to Implement Recoverable Distributed Shared Memory", Proc. of 12th Symposium on Reliable Distributed Systems, 1993, pp. 58-67.
- dedicated checkpoint-recovery protocols with dependency tracking of coherency protocol messages
-
bookMain references
Takeshi Fuchi, Mario Tokoro. "A Mechanism for Recoverable Shared Virtual Memory", Manuscript University of Tokio, 1994.
Taesoon Park, Sungbok Cho, Y. Yeom. "An Efficient Logging Scheme for Recoverable Distributed Shared Memory Systems", International Conference on Distributed Computing Systems, 1997.
Bob Janssens, W. Kent Fuchs. "Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory", Proc. of 13th Symposium on Reliable Distributed Systems, 1994, pp.34-41.
- integration with coherency protocols
-
bookMain references
Anne-Marie Kermarrec, Gilbert Cabillic, Alain Gefflaut, Christine Morin, Isabelle Puant. "A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability", IRISA — Publication Interne no. 897, 1995.
Christine Morin, Anne-Marie Kermarrec, Michel Banâtre. "An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures", INRIA — Raport de recherche no. 3103, 1997.
-
bookRecommended additional bibliography
Some of my publications on this subject:
-
Jerzy Brzeziński, Michał Szychowiak: Reliability of Distributed Shared Memory Systems, Proceedings of the European Conference on Research and development for Information Society — ISTHmus 2000.
-
Jerzy Brzeziński, Michał Szychowiak: Fast and Low Cost Recovery Techniques for Distributed Shared Memory, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`02), Las Vegas, Nevada, June, 2002.
-
Jerzy Brzeziński, Michał Szychowiak: Replication of Checkpoints in Recoverable DSM Systems, Proceedings of the 21st Parallel and Distributed Computing and Networks (PDCN 2003), February 2003.
-
Jerzy Brzeziński, Michał Szychowiak: An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects, Proceedings of the International Workshop on Distributed Shared Memory on Clusters — DSM 2003 in Proceedings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid 2003), Tokyo, Japan, May 2003, pp. 510-515.
Process and Object Replication for High-Availability
Consensus / Failure Detectors
-
bookRecommended basic bibliography
Reliable Group Communication
-
bookRecommended basic bibliography
-
Xavier Défago, André Schiper, Péter Urbán "Totally Ordered Broadcast and Multicast Algorithms: A Comprehensive Survey." Technical Report DSC/2000/036, École Polytechnique Fédérale de Lausanne, Switzerland, September 2000.
-
Rachid Guerraoui. "Revisiting the relationship between Non Blocking Atomic Commitment and Consensus problems", International Workshop on Distributed Algorithms (WDAG-95), Springer Verlag (LNCS 972), 1995.
-
Self-Stabilization
Self-stabilization of a system guarantees that regardless of its current state, the system will converge to a legal state in a finite number of steps.
-
bookRecommended basic bibliography
-
Jerzy Brzeziński, Michał Szychowiak: Self-Stabilization in Distributed Systems—a Short Survey, Foundations of Computing and Decision Sciences, 25(1), 2000.
-
Mitchell Flatebo, Ajoy Kumar Datta, Sukumar Ghosh: Self-Stabilization in Distributed Systems, in Advances in Distributed Computing: Concepts and Design, IEEE Computer Society Press, T.L. Casavant and M. Singhal eds., 1992.
-
Ted Herman: Probabilistic Self-Stabilization, Information Processing Letters, vol. 35, 1990, pp. 63-67.
-