Big Data and Distributed Computing

Consistent snapshots of global state

dr inż. Arkadiusz Danilecki

based on lectures by prof. J. Brzeziński

The outline

Definitions
Why we want to do it?
Why it's difficult?
Stop-and-sync
Lamport algorithm
Lai-Yang algorithm

The notation

$\Sigma^i, \Sigma(\tau)$ - The $i$th global state, the global state of the system at the time $\tau$

$P_i$ - $i$th process

$S_i$ - the state of the $i$th process

$a, b, c ...$ - the events

The notation

$a, b, c ...$ - the events

${\bf a}\mapsto {\bf b}\iff { \begin{cases} {\mbox{1)}} {\bf a} {\mbox{ and }} {\bf b } {\mbox{ are events in the same process and }} {\bf a} {\mbox{ precedes }} {\bf b} {\mbox{ OR }}\\ {\mbox{2)}} {\bf a} {\mbox{ is sending of message }}M,{\mbox{ and }} {\bf b} {\mbox{ is receiving of }} M, {\mbox{ OR }}\\ {\mbox{3)}} {\mbox{ there is sequence of events }} {\bf a}, \ldots x, y, z\ldots {\bf b} {\mbox{, such as for each }} x, y \\{\mbox{ in this sequence, we have either case 1) or 2) above.}}\end{cases}} $

What we want to achieve?

Gather the state of all processes in the system, creating a "snapshot" of a computation. We want to ensure the state will be useful (in an informal sense) or more formally consistent.

Why we want to do it?

The post-mortem execution analysis and debugging
Checking what's happening in the system
System recovery using checkpoints

Unfortunately, creating a consistent state snapshot is difficult!

The first (naïve) approach

Initiator sends to everyone a message to record their state at specific time
Each process after getting a message memorizes when to record its state

The second (still naïve) approach

When initiator determines when to record time, it takes into account maximum communication delays

Initiator sends to everyone a message to record their state at specific time

Each process after getting a message memorizes when to record its state

The third approach (you guess it right, it's still naïve)

Initiator determines the time to record the state taking into account both communication times and the current clocks of each other process
The rest is as before

Still wrong, if processes' clocks are run with different speed

Wait a minute! But do we even need a state from a specific time?

The fourth approach, where we play "let's assume" game

Let's assume perfectly synchronous system

Initiator sends to everyone a message to record their state at specific time
Each process after getting a message memorizes when to record its time

Haven't we forgotten about something?

Dealing with the problem: two possibilities

We record the history of communication. i.e. all sent and received messages
We record the channels' states

Dealing with the problem: two possibilities

We record the history of communication. i.e. all sent and received messages

If $P_i$ recorded that message $M$ was sent to $P_j$, and $P_j$ has not recorded the event of receiving $M$... then the message is still in transit (in the channel)

We record the channels' states

Dealing with the problem: two possibilities

We record the history of communication. i.e. all sent and received messages

If $P_i$ recorded that message $M$ was sent to $P_j$, and $P_j$ has not recorded the event of receiving $M$... then the message is still in transit (in the channel)

We record the channels' states

sending special messages of the kind "I've already counted the students, so take that into an account" which would work nicely with FIFO students

Informal reasoning

In the states recorded in two nodes, we have the same student

The student left $P_j$ after the state was recorded - we missed the fact the student left afterwards

The student arrived to $P_i$ before the state was recorded - so the saved state had "seen" its arrival

Informal conclusion

The student left $P_j$ after the state was recorded - we missed the fact the student left afterwards

We recorded a state $S_j^l$ before the event a of sending $M$

The student arrived to $P_i$ before the state was recorded - so the saved state had "seen" its arrival

Informal conclusion

The student left $P_j$ after the state was recorded - we missed the fact the student left afterwards

We recorded a state $S_j^l$ before the event a of sending $M$

The student arrived to $P_i$ before the state was recorded - so the saved state had "seen" its arrival

In state $S_i^k$ we recorded the event b of receiving the message $M$

In terms of happened-before relation defined before, what is the relation between events $a$ and $b$?

Informal conclusion

The student left $P_j$ after the state was recorded - we missed the fact the student left afterwards

We recorded a state $S_j^l$ before the event a of sending $M$

The student arrived to $P_i$ before the state was recorded - so the saved state had "seen" its arrival

In state $S_i^k$ we recorded the event b of receiving the message $M$

In other words - we recorded state with event $a$, but not an event $b$ which casually preceeds $a$

Time for more formal specifications!

Cut and consistent cut

The cut ${\boldsymbol C}$ of set of events $\mathcal{E}$ is a set ${\mathit C}\subseteq \mathcal{E}$, such that we have at least one event from each process. The consistent cut is a cut where $$(a\in {\mathit C}\land b\mapsto a)\Rightarrow (b\in {\mathit c})$$

If an event $a$ belongs to a consistent cut, all events causally preceding $a$ also must belong to the cut.

Configuration

The configuation ${\boldsymbol {\mathit {\Gamma }}}$ is a vector of local states $\left\langle S_{1}^{k1},S_{2}^{k2},\cdots ,S_{n}^{kn}\right\rangle $ of all processes $P_{1},P_{2},\ldots ,P_{n}$, such that for all $u,1\leq u\leq n,S_{u}^{ku}\in {\mathcal {S}}_{u}$.

Cuts and configurations are equivalent, assuming that given a sequence of events, always the same final state will be reached.

The graphical representation

Consistent cut and a global state

Each consistent cut is equivalent to a global state in some theoretically possible execution of a program.

Consistent cut and a global state

Later and earlier cuts

We say that the cut ${\mathit C}$ is later than some cut ${\mathit D}$, if ${\mathit C}\subseteq {\mathit D}$. In other words, $\mathit D$ contains all events from $\mathit C$, and $\mathit C$ does not contain events which are not in $\mathit D$.

Cuts more informally

Minimal cut

Why we do it - checkpoints and the state recovery

Obviously, for a given application w can always develop a solution tailored towards that application, which would recover from a set of checkpoints which form a set representing theoretically inconsistent cut.

Why we do it - checkpoints and the state recovery

Stop-and-sync

Idea - stop all processes and the record their states

... obviously we must ensure communication channels are empty.

Stop-and-sync


				    message stop, ready, save, saved, continue

				    local bool ~~start_i := false~~
				    local bool ~~flushed_i[n] := false~~
				    local bool ~~ready_{\alpha}[n] := false~~
				    local set of messages ~~log_i := \emptyset~~

Stop-and-sync


				    when ~~{P}_{\alpha}~~ wants to take a snapshot
				       suspend application 
				       save local state 
				       ~~start_i~~ := true
				       ~~flushed_{\alpha}[\alpha]~~ := true
				       broadcast ~~stop~~ to all ~~{P}_{j\neq\alpha}\in\mathcal{P}~~
				    end when

Stop-and-sync


when a message ~~stop~~ arrives at ~~{P}_{i}~~ from ~~{P}_{j}~~ do
	if not ~~start_i~~ then
	    suspend application 
	    save local state
	    ~~start_i~~ := true
	    ~~flushed_{i}[i]~~ := true
	    broadcast ~~stop~~ to all ~~{P}_{k}\in\mathcal{P}~~
	end if

	~~flushed_{i}[j]~~ := true
	if ~~\forall k: flushed_{i}[k] == \mbox{true}~~ then
	    save ~~log_{i}~~
	    send ready to ~~P_{\alpha}~~
	end if
end when

Stop-and-sync


				when an application message ~~m~~ arrives at ~~{P}_{i}~~ from ~~{P}_{j}~~ do
					if ~~start_i~~ and not ~~flushed_i[j]~~ then
					    ~~log_{i}~~ := ~~log_i \cup m~~
					else if ~~start_i~~
					    delay ~~m~~
					else
					    deliver ~~m~~
					end if
				end when

Stop-and-sync


				when a message ~~ready~~ arrives at ~~{P}_{\alpha}~~ from ~~{P}_{j}~~ do
					~~ready_{\alpha}[j] := \mbox{true}~~
					if ~~\forall k: ready_{\alpha}[k] == \mbox{true}~~ then
					    save ~~log_{\alpha}~~
					    broadcast ~~continue~~ to all ~~{P}_k\in\mathcal{P}~~ including itself
					end if
				end when

				when a message continue arrives at ~~{P}_{i}~~ do
					resume application
					~~start_i := \mbox{false}~~
					deliver delayed messages
				end when

Stop-and-sync: The algorithm correctness

Progress?
Safety?

We will assume reliable FIFO communication channels and no node crashes

Stop-and-sync: The algorithm correctness

Progress?

All processes will eventually record their states and communication logs, and all will resume their execution.

Safety?

Stop-and-sync: The algorithm correctness

Safety?

The recorded states will form a consistent cut (actually, consistent configuration)

Progress?

Stop-and-sync: The algorithm correctness

Safety?

The recorded states will form a consistent cut (actually, consistent configuration)

If $P_i$ has recorded the event of receiving $M$ from $P_j$, then $P_j$ recorded event of sending $M$

Progress?
Założenia?

Stop-and-sync: Progress

The initiator starts by sending a special message $stop$ to all processes. Since channels are reliable and processes do not crash, that message must eventually reach all processes.

According to algorithm, all processes then must sent same message to all other processes.

Since we assumed reliable channels, those $stop$ messages will reach their destination, so eventually $\forall k: flushed_i[k] == \mbox{true}$ for each $P_i$

Stop-and-sync: Progress

Hence, all processes will eventually record their state and will sent $ready$ to initiator.

Since channels are reliable, initiator will get $ready$ from all processes. This will cause the initiator to record its state and to send $continue$, and it finally will resume execution.

As before, $continue$ will reach its destination, so all other processes will resume their execution too. QED.

Stop-and-sync: Safety

We would have inconsistent cut (actually configuration,but let's leave that) iff we would include an event of receiving some message $M$ without including an event of sending $M$

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Stop-and-sync: Safety

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Initiator will sent $continue$ only when it will get $ready$ from all proceesses.

A process can send $ready$ only, when $\forall k: flushed_i[k] == \mbox{true}$ (and it will record its state earlier)

This condition is true only if $P_i$ gets $stop$ from every other process

Stop-and-sync: Safety

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Initiator will sent $continue$ only when it will get $ready$ from all proceesses.

A process can send $ready$ only, when $\forall k: flushed_i[k] == \mbox{true}$ (and it will record its state earlier)

This condition is true only if $P_i$ gets $stop$ from every other process

A proces may send $stop$ only after recording its state

Stop-and-sync: Safety

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Initiator will sent $continue$ only when it will get $ready$ from all proceesses.

A process can send $ready$ only, when $\forall k: flushed_i[k] == \mbox{true}$ (and it will record its state earlier)

This condition is true only if $P_i$ gets $stop$ from every other process

A proces may send $stop$ only after recording its state

So initiator will sent $continue$ only when all processes already recorded their state

Stop-and-sync: Safety

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Initiator will sent $continue$ only when it will get $ready$ from all proceesses.

A process can send $ready$ only, when $\forall k: flushed_i[k] == \mbox{true}$ (and it will record its state earlier)

This condition is true only if $P_i$ gets $stop$ from every other process

So initiator will sent $continue$ only when all processes already recorded their state

Stop-and-sync: Safety

To get inconsistent cut, the $P_i$ would have to record a state after receiving $M$ from $P_j$, sent after $P_j$ recorded its state

But process $P_j$ after recording state won't sent new messages until it will resume execution after receiving $continue$

Initiator will sent $continue$ only when it will get $ready$ from all proceesses.

A process can send $ready$ only, when $\forall k: flushed_i[k] == \mbox{true}$ (and it will record its state earlier)

This condition is true only if $P_i$ gets $stop$ from every other process

So initiator will sent $continue$ only when all processes already recorded their state

No new message sent $M$ unless all processs recorded their state

Hence, it's impossible to include an event of receiving $M$ without including an event of sending $M$