Research activities
Rollback recovery in Distributed Systems
A backward error recovery (BER) is a well know general tool of fault-tolerance.
BER consists in saving periodically the current state of computation (a checkpoint) on a stable storage (supposed to never fail), and restoring back the saved state when recovering from a failure.
In a distributed system a global checkpoint is composed of local checkpoints of all processes in the system.
Local checkpoints can be saved in a coordinated way (synchronous checkpointing) - when all processes cooperate on saving checkpoints that represents a consistent global state of the system.
This approach requires a synchronization between all cooperating nodes which introduce the time and communication overhead.
Another approach is to save local checkpoints independently, without any cooperation of nodes (asynchronous checkpointing).
This approach offers a minimal overhead on a checkpointing process, but suffers from a domino effect during a recovery procedure.
To reduce the cost of recovery and eliminate the domino effect, a message logging and dependency tracking can be used.
-
bookRecommended basic bibliography
- R. Koo, S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems", IEEE Transactions on Software Engineering, vol.13 no.1, pp.23-31, 1987.
- Y.M. Wang, W. K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems", Proc. of Symposium on Reliable Distributed Systems - SRDS'92, pp.147-154, 1992.
- E.N. Elnozahy, W. Zwaenepoel. "Manetho, Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit", IEEE Transactions on Computers, vol.41, no.5, pp.526-531, 1992
- E.N. Elnozahy, D.B. Johnson, Y.M. Wang. "A Survey of Rollback-Recovery Protocols in Message-Passing Systems", Cornegie Mellon University technical report CMU-CS-96-181, 1996.
Reliable DSM Systems
My initial research interests tended to DSM systems providing recoverability of shared objects. There are 3 main approaches:
- adaptation of message-passing checkpoint-recovery techniques (coordinated/independent checkpointing)
-
bookMain references
- K.-L. Wu and W. K. Fuchs. "Recoverable Distributed Shared Memory", IEEE Transactions on Computer, 39(4), April 1990, pp. 460-469.
- Bob Janssens, W. Kent Fuchs. "Relaxing Consistency in Recoverable Distributed Shared Memory", Proc. of International Symposium on Fault-Tolerant Computing Systems - FTCS'93, June 1993, pp. 155-163.
- Golden G. Richard III, Mukesh Singhal. "Using Logging and Asynchronous Checkpoint to Implement Recoverable Distributed Shared Memory", Proc. of 12th Symposium on Reliable Distributed Systems, 1993, pp. 58-67.
- dedicated checkpoint-recovery protocols with dependency tracking of coherency protocol messages
-
bookMain references
- Takeshi Fuchi, Mario Tokoro. "A Mechanism for Recoverable Shared Virtual Memory", Manuscript University of Tokio, 1994.
- Taesoon Park, Sungbok Cho, Y. Yeom. "An Efficient Logging Scheme for Recoverable Distributed Shared Memory Systems", International Conference on Distributed Computing Systems, 1997.
- Bob Janssens, W. Kent Fuchs. "Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory", Proc. of 13th Symposium on Reliable Distributed Systems, 1994, pp.34-41.
- integration with coherency protocols
-
bookMain references
- Anne-Marie Kermarrec, Gilbert Cabillic, Alain Gefflaut, Christine Morin, Isabelle Puant. "A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability", IRISA - Publication Interne no. 897, 1995.
- Christine Morin, Anne-Marie Kermarrec, Michel Banâtre. "An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures", INRIA - Raport de recherche no. 3103, 1997.
-
bookRecommended additional bibliography
Some of my publications on this subject:
- Jerzy Brzeziński, Michał Szychowiak: Reliability of Distributed Shared Memory Systems, Proceedings of the European Conference on Research and development for Information Society - ISTHmus 2000.
- Jerzy Brzeziński, Michał Szychowiak: Fast and Low Cost Recovery Techniques for Distributed Shared Memory, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`02), Las Vegas, Nevada, June, 2002.
- Jerzy Brzeziński, Michał Szychowiak: Replication of Checkpoints in Recoverable DSM Systems, Proceedings of the 21st Parallel and Distributed Computing and Networks (PDCN 2003), February 2003.
- Jerzy Brzeziński, Michał Szychowiak: An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects, Proceedings of the International Workshop on Distributed Shared Memory on Clusters - DSM 2003 in Proceedings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid 2003), Tokyo, Japan, May 2003, pp. 510-515.
Process and Object Replication for High-Availability
Consensus / Failure Detectors
-
bookRecommended basic bibliography
- T.D. Chandra, S. Toueg. "Unreliable Failure Detectors for Reliable Distributed Systems", Proc. 10th ACM Symp. Principles of Distributed Computing, ACM Press, 1991, pp.325-340.
Reliable Group Communication
-
bookRecommended basic bibliography
- Xavier Défago, André Schiper, Péter Urbán "Totally Ordered Broadcast and Multicast Algorithms: A Comprehensive Survey." Technical Report DSC/2000/036, École Polytechnique Fédérale de Lausanne, Switzerland, September 2000.
- Rachid Guerraoui. "Revisiting the relationship between Non Blocking Atomic Commitment and Consensus problems", International Workshop on Distributed Algorithms (WDAG-95), Springer Verlag (LNCS 972), 1995.
-
bookGroup communication systems
Self-Stabilization
Self-stabilization of a system guarantees that regardless of its current state, the system will converge to a legal state in a finite number of steps.-
bookRecommended basic bibliography
- Jerzy Brzeziński, Michał Szychowiak: Self-Stabilization in Distributed Systems - a Short Survey, Foundations of Computing and Decision Sciences, 25(1), 2000.
- Mitchell Flatebo, Ajoy Kumar Datta, Sukumar Ghosh: Self-Stabilization in Distributed Systems, in Advances in Distributed Computing: Concepts and Design, IEEE Computer Society Press, T.L. Casavant and M. Singhal eds., 1992.
- Ted Herman: Probabilistic Self-Stabilization, Information Processing Letters, vol. 35, 1990, pp. 63-67.
-