glados.no

16 KiB

Raw Blame History

title	description	date	math
Oppsumering av TTK4145	Lot of theory and discussion, some fomulas, spring 2021.	2021-05-04	true

Fault tolerance

Hard to capture faults.

Bugs

1 bug per 50 lines before testing
1 bug per 500 at release
1 bug per 550 after a year, the constant

Make the program work within specs.
Run/Tests of the program-
Errors happen
Locate errors
- Incomplete spec
- Missing handleling of som situation
Fix code

Traditional error handeling

{% highlight java %} FILE * openConfigFile(){ FILE * f = fopen("/path/to/config.conf"); if (f == NULL) { switch(errno){ case ENOMEM: { ... break; } case ENOTDIR: { ... break; } // Do this for all errors } } } {% endhighlight %}

Causes of errors

Incomplete specification
Software bugs
HW problems
Communication problems

Fault tolerance in real time systems

The problem with traditional errorhandleing is that errors can happen at any possible time. This is extremely hard to test.

This is some of the error handling real time programming have.

Handling of unexpected errors
More threads hanles errors
Can not test the conventional way
- Can only show extistence of errors
- Can not find errors in specification
- Can not find race conditions

The fault path is shown under.

With fault tolerance the path looks something more like the figure under.

Error handling

Keep it simple!

The error modes is a part of the module interface.

One way is to handle all errors the same way. Handle the as if it was the worst error. Crash and start again.

A different approach is to check that everything is OK.

To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).

Redundancy

If I have N copies of my data, it is possible to handle that one is destroyed.
Sending N messages, trying N times.

Static redundancy

N active copies. Sending N messages if it is necessary or not.
Detecting errors is not important.
Handles cosmic rays easily.

Dynammic redunancy

Relies on detecting the error and recovering
- Resend if timeout and not receiving "ack"
- Go with default if no messages have been received
The acceptancetest must be good.

Fault model

Example with storage functions.

Step 1: Failure modes

Find the failure modes: What could go wrong?

Write: May return "I failed". Does not know why it faield
Read: May return "I failed". Does not know why it failed.

Step 2: Detect, Simplify, Inject errors

Write information on where/what/how the process is doing.
All errors --> Fail
Inject errors

Step 3: Handling with redundancy

Have multiple copies of the the information
- Use only the newest

Example with communication function

Step 1: Failure modes

Message
- Lost
- Delayed
- Corrupted
- Duplicated
- Wrong recipient

Step 2: Detection, Merging of errormodes and error injection

Adding information to message
- Checksum
- Session ID
- Sequence number
Adding "ack" on well recieved messages
All errors will be treaded as "Lost message"
Injection
- Occasionally throw away some messages

Step 3: Handling with redundancy

Timeout
Retransmit message

Example with processes and caculations

A calculation is an abstract, so how can we talk generally about the failure modes.

Step 1: Failure modes

One failure mode

Step 2: Detect, simplify, inject errors

All failed acceptance tests will "PANIC" or "STOP".

Step 3: Handling with redundancy

There are three solutions:

Checkpoint restart
- Do all the work incuding the acceptance test
- Wait with the "side effects"
- Store a checkpoint
- Do the "side effects"
Process pairs * Crash and let an another process take over
Presistent processes

Transactions

A transaction is a design framework for Damage Confinement and Error Recovery.

An atomic action, just without the backward recovery error mode as standard mode
invincible and instantaneous "calculation" seen from the outside
A transformation from one consistent state to another'
A modular computation

Four features: ACID

Atomicity: Either all side effects happens or none
Concistency: Leaves the system in a consistent state when finished
Isolation: Errors does not spread
Durability: Results are not lost

Atomic Actions

Resumption vs. Termination mode

If we continue where we were (e.g. after the interrupt) --> Resumption
If we continue somewhere else (i.e. terminating what we where doing) --> Termination

Async Notification (AN) = Low level thread interaction

Async event handling. ("Signals") (resumption)
- Modeled after a HW interrupt
- Can be sent to the correct thread
- Can be handled, ignored, blocked --> The domain can be controlled.
- Often lead to polling
  - Could rather skip the signal and poll a status variable or a message queue
  - Useless
ATC --> Async transfer of Control (termination)
- Canceling threads
- setjmpt/longjmp could convert signals to ATC (not really, but still)
- ADA: a strictured mechanism for ATV is integraded with the selected statement
- RT Java: A structured mechanism for ATC is integraded with the exception-handling mechanism

Cancelling threads

Yes, killing threads is ATC!

Can make termination model by letting domain be a thread
- "Create a doWork thread, and kill it if the action fails"
Ca still control domain by disabling "cancelstate"

But, but, but: It leaves ut in undifined state!?

Not if we have...
- Full control over changed state (like logs or recovery points) or some other way of recovering well.
- A lock manager that can unlock on behalf of killed thread
- Some control of where we were killed (like nok in the middle of a lock manager or log call)
An this is what we have!

Shared variable synchronization

Non-Preemptive scheduling

Controlling a pump filling a tank.

Spec:

Every second: measure the water level of the tank and generate the reference to the pump
10 times a second: Set the power of the pump motor
Do some GUI: let the human control the process

A trivial solution: "Cyclic Exectutive"

{% highlight java %} oldTime = now(); i = 0; while(true) { i = i + 1; if (i % 10 == 0) { i = 0; calculatePumpReference(); } controlPump(); do { handleUserEvent(); } while(now() < oldTime + 0.1); oldTime = oldTime + 0.1; } {% endhighlight %}

Drawbacks

OK tasks?
Timing hard to tune (what if pump sampling should be \pi/10?)
Overload (what if calucaltePumpReference uses more than 1/10 seconds?)
How to add new tasks? (Everything is coupled)
Waste of time in the do-loop?
What is priority of handleUserEvents?
How are erros, exceptions, alarms etc. handled?

Better soulution with Non-preemptive scheduler

3 taskts administered by a scheduler
The scheduler takes care of who runs and timing
Scheduler often inculuded in OSes
Introducing priorities

{% highlight java %} /** * scheduler_registerThread(function, time, priority) * Higher priority numer means higher priority in scheduler */ main() { scheduler_registrerThread(controlPump, 0.1, 3); scheduler_registrerThread(calculatePumpReference, 1, 2); scheduler_registrerThread(handleUserEvents, 0.2, 1); scheduler_mainLoop(); } {% endhighlight %}

Some notes on priorities

Priority is generally not important; rather, the main rule is to give higher priority to shorter-deadline tasks.
- This allows tasks to reach its deadlines.
... but this is not always the case - if e.g. the tasks are cooperating
We still handle overload badly
And: What connection between deadline and priority to start with?
- Is this a good dependency seen from a code quality perspective?

Pros and cons of nonpreemptive scheduling

Pros	Cons
Simple, intuitive, predictable	C macro hell
No kernel	Threads must cooperate <-- a form of dependency breaking module boundaries
Fast switching times	Heavy threads must be divided
Some elegant sunchronization patterns possible	Can we handle blocking of library functions?
	Unrobust to errors
	Unrobust to (heavy) error handling
	Hard to tune at end of project
{: .table-responsive-lg .table }

Preemptive Kernel

Preemption, thread objects and the timer interrupt
Enabling synchronization: Busy waiting, tes-and-set, disabling the timer interrupt
Blocking and suspend & resume
An API for synchronization? Semaphores!

Preemption

Make a handler for a timer interrupt
Store all registers (including IP & SP) in a "thread object"
Organize queue of processes (Round Robin e.g. - a collection of thread objects?)
Can synchronize by: while(!ready); (busy wating, "spin locks")

Bad solution

{% highlight java%} while(lock==1) {} lock = 1; // We may run lock = 0; {% endhighlight %}

Better solution

{% highlight java%} void t1() { flag1 = 1; // Declare my intention turn = 2; // But try to be polite while(flag2 == 1 && turn == 2) {} // We may run flag1 = 0; } {% endhighlight %}

Looking more closely at the arsenal

How can we make basic synchronization under preemption?

Spin locks (wasting time and cpu)
Test&Set (swap) assembly instruction (atomic, but not obvious)
Disable interrupt (steals control from OS/scheduler)

But

If we disable the timer interrupt we don not have preemption any more
And... Are these good abstractions in the application programmer domain?

Blocked threads

Let us introduce another queue; the collection of threads not running, waiting for something

Fixes the bad performance of spin locks. Is conceptually better.
suspend moves a thread object from "run" queue to "blocked" queue
resume moves it back.

Two bad solutions

{% highlight java%} t1(){ while(busy == 1) suspend(); busy = 1; // It is free; tak it - No // Run busy = 0; // Release resource

resume t2 // No

} {% endhighlight %}

{% highlight java%} t1(){ while(TestNSet(busy, 1) == 1) suspend(); // We own resource // Run busy = 0;

resume t2 // No

} {% endhighlight %}

The suspend/resume problem

{% highlight java%} // Global variables bool g_initDone = False;

// Threads t1(){ t2(){ /* Do init */ if (g_initDone == False) { g_initDone = True; Suspend(); resume(t2) } // Continue executing // Continue exectuting } } {% endhighlight %}

Priorities

Threads mey have different priorities. (A sortet run-queue, or more of them.)
Only if there are no running threads on a higher priority, a thread will run.
We are not aiming for some sens of fairness (!). But predictability.
And priorities supports schedulability proofs.
But we open ourselves up to starvation. A thread may not ever get to run, even if it is runnable.

Application-level syncronization

SO, the application programmer needs some syncronozation primitives...

sleep()? - Ok
Publish suspend and resume - No
Events (wait and signal) - Just named versions of suspend & resume semantics.
- Fixes the need to know aboud "thread objects". But no
...or "Condition variables" - same

Semaphores

A counting semaphore

signal(SEM) increases the counter (possibly resuming a thread waiting for the semaphore)
wait(SEM) decrements the counter - will block (be suspended) if SEM == 0
- The semaphores value can not be negative
Of course; These calls are protected from interleaving by disabling the timer interrupt

We solve beautifully:

Mutual Exclusion
Conditional Synchronization (ref suspend/resume)
Basic resource allocation

Semaphore variations

wait and signal nay take parameter value to add or subtract
getValue(SEM) returning the value of the semaphore. (Fishy)
BInary semaphores (signal will fail if SEM == 1)
Who is woken at signal (FIFO, Arbitrary, Highest priority)
The mutex
- binary
- ownership
- allows mulitple waits by owner
- regions (may be released by Javas wait or POSIX condition variables)
RTFM

Semaphore challenges

Breaks modules (both ways)
- Does not scale!
Deadlocks
- Global analysis --> Does not scale
Can not release "temporarily
"Limited expressive power". Some reasonalbe problems are hard to solve
- Ref "The Little Book of Semaphores"

Why shared-variable synchronization

Why not?

"Shared variables" is bad code quality
- Ref global variables, and data members in module interfaces
An obvious bottleneck? Scales terribly
"Variables" are passive objects
- They can not protect themselves
Why use synchronization when it is communication we need?
Technology transfers badly to distibuted systems
... and this is before we start discussing how hard it is

Why?

Part of the "real-time" design pattern
- "One thread per timing demand"
- We do have scheduling proofs and best practises
Timing analysis is global anyway
- Scalability and deadlock analysis may not be the limiting constraint
HW is shared memory architecture
- Infrastucture is avalible
Communication systems requires infrastucture that we may not have

All resources are shared!

Memory, certainly
"Hidden" memory used by libraries (.. your own modules and the kernel)
- If the library takes care of this itself, it is called "reentrant"
Sensors and actuators
"CPU" - Computing capacity
- This is real-time programming; We solve it by Scheduling
... any other interface

Some standard problems/pit-falls

Race condition: A bug that surfaces by unfortunate timing or order of events
Deadlock: system in circular wait
- Special case of livelock
- Does not use CPU
Livelock: system locked in a subset of states
- like deadlock, but we use CPU
- Busy-Waiting is a livelock
Starvation: A thread does "by accident" not get the necessary resources

Features in syncronization

Critical Section - Code that must not be interupted
Mutual Exclusion - More piecesof code that must not interrupt each other
Bounded buffer - Buffer with full/empty synchronization
Read/Write Locks
- Readers can interleave eachother
- Writers have mutual exclusion
Condition Syncronization - Blocking on event or status
- Guards etc.
Resource allocation
- More than mutual exclution!
- Ref: The lock manager
Rendezvouz/barriere - Synchronization point
- Ref: AA "end boundary"
Communication
Broadcast
...

16 KiB Raw Blame History

Fault tolerance

Bugs

Traditional error handeling

Causes of errors

Fault tolerance in real time systems

Error handling

Redundancy

Fault model

Example with storage functions.

Example with communication function

Example with processes and caculations

Transactions

Four features: ACID

Atomic Actions

Cancelling threads

Shared variable synchronization

Non-Preemptive scheduling

A trivial solution: "Cyclic Exectutive"

Better soulution with Non-preemptive scheduler

Pros and cons of nonpreemptive scheduling

Preemptive Kernel

Preemption

Looking more closely at the arsenal

Blocked threads

Two bad solutions

The suspend/resume problem

Priorities

Application-level syncronization

Semaphores

Why shared-variable synchronization

All resources are shared!

Some standard problems/pit-falls

Features in syncronization

16 KiB

Raw Blame History