123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195 |
- An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel
- free to update it with notes about any area that is not clear.
- ---
- MCA/INIT are completely asynchronous. They can occur at any time, when
- the OS is in any state. Including when one of the cpus is already
- holding a spinlock. Trying to get any lock from MCA/INIT state is
- asking for deadlock. Also the state of structures that are protected
- by locks is indeterminate, including linked lists.
- ---
- The complicated ia64 MCA process. All of this is mandated by Intel's
- specification for ia64 SAL, error recovery and unwind, it is not as
- if we have a choice here.
- * MCA occurs on one cpu, usually due to a double bit memory error.
- This is the monarch cpu.
- * SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
- to all the other cpus, the slaves.
- * Slave cpus that receive the MCA interrupt call down into SAL, they
- end up spinning disabled while the MCA is being serviced.
- * If any slave cpu was already spinning disabled when the MCA occurred
- then it cannot service the MCA interrupt. SAL waits ~20 seconds then
- sends an unmaskable INIT event to the slave cpus that have not
- already rendezvoused.
- * Because MCA/INIT can be delivered at any time, including when the cpu
- is down in PAL in physical mode, the registers at the time of the
- event are _completely_ undefined. In particular the MCA/INIT
- handlers cannot rely on the thread pointer, PAL physical mode can
- (and does) modify TP. It is allowed to do that as long as it resets
- TP on return. However MCA/INIT events expose us to these PAL
- internal TP changes. Hence curr_task().
- * If an MCA/INIT event occurs while the kernel was running (not user
- space) and the kernel has called PAL then the MCA/INIT handler cannot
- assume that the kernel stack is in a fit state to be used. Mainly
- because PAL may or may not maintain the stack pointer internally.
- Because the MCA/INIT handlers cannot trust the kernel stack, they
- have to use their own, per-cpu stacks. The MCA/INIT stacks are
- preformatted with just enough task state to let the relevant handlers
- do their job.
- * Unlike most other architectures, the ia64 struct task is embedded in
- the kernel stack[1]. So switching to a new kernel stack means that
- we switch to a new task as well. Because various bits of the kernel
- assume that current points into the struct task, switching to a new
- stack also means a new value for current.
- * Once all slaves have rendezvoused and are spinning disabled, the
- monarch is entered. The monarch now tries to diagnose the problem
- and decide if it can recover or not.
- * Part of the monarch's job is to look at the state of all the other
- tasks. The only way to do that on ia64 is to call the unwinder,
- as mandated by Intel.
- * The starting point for the unwind depends on whether a task is
- running or not. That is, whether it is on a cpu or is blocked. The
- monarch has to determine whether or not a task is on a cpu before it
- knows how to start unwinding it. The tasks that received an MCA or
- INIT event are no longer running, they have been converted to blocked
- tasks. But (and its a big but), the cpus that received the MCA
- rendezvous interrupt are still running on their normal kernel stacks!
- * To distinguish between these two cases, the monarch must know which
- tasks are on a cpu and which are not. Hence each slave cpu that
- switches to an MCA/INIT stack, registers its new stack using
- set_curr_task(), so the monarch can tell that the _original_ task is
- no longer running on that cpu. That gives us a decent chance of
- getting a valid backtrace of the _original_ task.
- * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
- nested error, we want diagnostics on the MCA/INIT handler that
- failed, not on the task that was originally running. Again this
- requires set_curr_task() so the MCA/INIT handlers can register their
- own stack as running on that cpu. Then a recursive error gets a
- trace of the failing handler's "task".
- [1] My (Keith Owens) original design called for ia64 to separate its
- struct task and the kernel stacks. Then the MCA/INIT data would be
- chained stacks like i386 interrupt stacks. But that required
- radical surgery on the rest of ia64, plus extra hard wired TLB
- entries with its associated performance degradation. David
- Mosberger vetoed that approach. Which meant that separate kernel
- stacks meant separate "tasks" for the MCA/INIT handlers.
- ---
- INIT is less complicated than MCA. Pressing the nmi button or using
- the equivalent command on the management console sends INIT to all
- cpus. SAL picks one of the cpus as the monarch and the rest are
- slaves. All the OS INIT handlers are entered at approximately the same
- time. The OS monarch prints the state of all tasks and returns, after
- which the slaves return and the system resumes.
- At least that is what is supposed to happen. Alas there are broken
- versions of SAL out there. Some drive all the cpus as monarchs. Some
- drive them all as slaves. Some drive one cpu as monarch, wait for that
- cpu to return from the OS then drive the rest as slaves. Some versions
- of SAL cannot even cope with returning from the OS, they spin inside
- SAL on resume. The OS INIT code has workarounds for some of these
- broken SAL symptoms, but some simply cannot be fixed from the OS side.
- ---
- The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
- violations. Unfortunately MCA/INIT start off as massive layer
- violations (can occur at _any_ time) and they build from there.
- At least ia64 makes an attempt at recovering from hardware errors, but
- it is a difficult problem because of the asynchronous nature of these
- errors. When processing an unmaskable interrupt we sometimes need
- special code to cope with our inability to take any locks.
- ---
- How is ia64 MCA/INIT different from x86 NMI?
- * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
- all cpus.
- * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
- per cpu.
- * x86 has a separate struct task which points to one of multiple kernel
- stacks. ia64 has the struct task embedded in the single kernel
- stack, so switching stack means switching task.
- * x86 does not call the BIOS so the NMI handler does not have to worry
- about any registers having changed. MCA/INIT can occur while the cpu
- is in PAL in physical mode, with undefined registers and an undefined
- kernel stack.
- * i386 backtrace is not very sensitive to whether a process is running
- or not. ia64 unwind is very, very sensitive to whether a process is
- running or not.
- ---
- What happens when MCA/INIT is delivered what a cpu is running user
- space code?
- The user mode registers are stored in the RSE area of the MCA/INIT on
- entry to the OS and are restored from there on return to SAL, so user
- mode registers are preserved across a recoverable MCA/INIT. Since the
- OS has no idea what unwind data is available for the user space stack,
- MCA/INIT never tries to backtrace user space. Which means that the OS
- does not bother making the user space process look like a blocked task,
- i.e. the OS does not copy pt_regs and switch_stack to the user space
- stack. Also the OS has no idea how big the user space RSE and memory
- stacks are, which makes it too risky to copy the saved state to a user
- mode stack.
- ---
- How do we get a backtrace on the tasks that were running when MCA/INIT
- was delivered?
- mca.c:::ia64_mca_modify_original_stack(). That identifies and
- verifies the original kernel stack, copies the dirty registers from
- the MCA/INIT stack's RSE to the original stack's RSE, copies the
- skeleton struct pt_regs and switch_stack to the original stack, fills
- in the skeleton structures from the PAL minstate area and updates the
- original stack's thread.ksp. That makes the original stack look
- exactly like any other blocked task, i.e. it now appears to be
- sleeping. To get a backtrace, just start with thread.ksp for the
- original task and unwind like any other sleeping task.
- ---
- How do we identify the tasks that were running when MCA/INIT was
- delivered?
- If the previous task has been verified and converted to a blocked
- state, then sos->prev_task on the MCA/INIT stack is updated to point to
- the previous task. You can look at that field in dumps or debuggers.
- To help distinguish between the handler and the original tasks,
- handlers have _TIF_MCA_INIT set in thread_info.flags.
- The sos data is always in the MCA/INIT handler stack, at offset
- MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
- as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
- ia64_sal_os_state), with 16 byte alignment for all structures.
- Also the comm field of the MCA/INIT task is modified to include the pid
- of the original task, for humans to use. For example, a comm field of
- 'MCA 12159' means that pid 12159 was running when the MCA was
- delivered.
|