Research activities

Rollback recovery in Distributed Systems

A backward error recovery (BER) is a well know general tool of fault-tolerance. BER consists in saving periodically the current state of computation (a checkpoint) on a stable storage (supposed to never fail), and restoring back the saved state when recovering from a failure. In a distributed system a global checkpoint is composed of local checkpoints of all processes in the system. Local checkpoints can be saved in a coordinated way (synchronous checkpointing) - when all processes cooperate on saving checkpoints that represents a consistent global state of the system. This approach requires a synchronization between all cooperating nodes which introduce the time and communication overhead. Another approach is to save local checkpoints independently, without any cooperation of nodes (asynchronous checkpointing). This approach offers a minimal overhead on a checkpointing process, but suffers from a domino effect during a recovery procedure. To reduce the cost of recovery and eliminate the domino effect, a message logging and dependency tracking can be used.
Reliable DSM Systems

My initial research interests tended to DSM systems providing recoverability of shared objects. There are 3 main approaches:
  1. adaptation of message-passing checkpoint-recovery techniques (coordinated/independent checkpointing)
  2. dedicated checkpoint-recovery protocols with dependency tracking of coherency protocol messages
  3. integration with coherency protocols
      gzipped Postscript Christine Morin, Anne-Marie Kermarrec, Michel Banâtre. "An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures", INRIA - Raport de recherche no. 3103, 1997.

Some of my publications on this subject:

gzipped Postscript Jerzy Brzeziński, Michał Szychowiak: Reliability of Distributed Shared Memory Systems, Proceedings of the European Conference on Research and development for Information Society - ISTHmus 2000.
PDF Jerzy Brzeziński, Michał Szychowiak: Fast and Low Cost Recovery Techniques for Distributed Shared Memory, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`02), Las Vegas, Nevada, June, 2002.
PDF Jerzy Brzeziński, Michał Szychowiak: Replication of Checkpoints in Recoverable DSM Systems, Proceedings of the 21st Parallel and Distributed Computing and Networks (PDCN 2003), February 2003.
PDF Jerzy Brzeziński, Michał Szychowiak: An Extended Home-Based Coherence Protocol for Causally Consistent Replicated Read-Write Objects, Proceedings of the International Workshop on Distributed Shared Memory on Clusters - DSM 2003 in Proceedings of the IEEE Symposium on Cluster Computing and the Grid (CCGrid 2003), Tokyo, Japan, May 2003, pp. 510-515.

Process and Object Replication for High-Availability

Consensus / Failure Detectors

Reliable Group Communication

Self-stabilization of a system guarantees that regardless of its current state, the system will converge to a legal state in a finite number of steps.
    Jerzy Brzeziński, Michał Szychowiak: Self-Stabilization in Distributed Systems - a Short Survey, Foundations of Computing and Decision Sciences, 25(1), 2000.
