#493 1440x900 P8600 T400: kernel panic during heavy I/O

Closed
opened 4 months ago by specing · 8 comments
[1489743.926522] Disabling lock debugging due to kernel taint
[1489743.926522] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 5: b200121014040400
[1489743.926522] mce: [Hardware Error]: TSC 1808047d6d192
[1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 0 microcode 0
[1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[1489743.926522] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: b200121020080400
[1489743.926522] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff817129b7> [1489743.926522] {acpi_processor_ffh_cstate_enter+0x77/0xc0}
mce: [Hardware Error]: TSC 1808047d73a05
[1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 1 microcode 0
[1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[1489743.926522] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b200004000000800
[1489743.926522] mce: [Hardware Error]: TSC 1808047d6d192
[1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 0 microcode 0
[1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[1489743.926522] mce: [Hardware Error]: Machine check: Processor context corrupt
[1489743.926522] Kernel panic - not syncing: Fatal machine check
[1489743.926522] Kernel Offset: disabled
[1489743.926522] Rebooting in 30 seconds..
``` [1489743.926522] Disabling lock debugging due to kernel taint [1489743.926522] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 5: b200121014040400 [1489743.926522] mce: [Hardware Error]: TSC 1808047d6d192 [1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 0 microcode 0 [1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [1489743.926522] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: b200121020080400 [1489743.926522] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff817129b7> [1489743.926522] {acpi_processor_ffh_cstate_enter+0x77/0xc0} mce: [Hardware Error]: TSC 1808047d73a05 [1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 1 microcode 0 [1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [1489743.926522] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b200004000000800 [1489743.926522] mce: [Hardware Error]: TSC 1808047d6d192 [1489743.926522] mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1534131347 SOCKET 0 APIC 0 microcode 0 [1489743.926522] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [1489743.926522] mce: [Hardware Error]: Machine check: Processor context corrupt [1489743.926522] Kernel panic - not syncing: Fatal machine check [1489743.926522] Kernel Offset: disabled [1489743.926522] Rebooting in 30 seconds.. ```
Fedja Beader commented 4 months ago
Poster

And here is mcelog --ascii's output:

Disabling lock debugging due to kernel taint
Hardware event. This is not a software error.
CPU 0 BANK 5 TSC 1808047d6d192 
TIME 1534131347 Mon Aug 13 05:35:47 2018
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121014040400 MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 23
SOCKET 0 APIC 0 microcode 0
Run the above through 'mcelog --ascii'
Hardware event. This is not a software error.
CPU 1 BANK 5 TSC 1808047d73a05 
RIP !INEXACT! 10:ffffffff817129b7
TIME 1534131347 Mon Aug 13 05:35:47 2018
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121020080400 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 23
SOCKET 0 APIC 1 microcode 0
Run the above through 'mcelog --ascii'
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 1808047d6d192 
TIME 1534131347 Mon Aug 13 05:35:47 2018
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS error: -1 0 Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 23
SOCKET 0 APIC 0 microcode 0
Run the above through 'mcelog --ascii'
Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
Kernel Offset: disabled
Rebooting in 30 seconds..
And here is mcelog --ascii's output: ``` Disabling lock debugging due to kernel taint Hardware event. This is not a software error. CPU 0 BANK 5 TSC 1808047d6d192 TIME 1534131347 Mon Aug 13 05:35:47 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200121014040400 MCGSTATUS 4 CPUID Vendor Intel Family 6 Model 23 SOCKET 0 APIC 0 microcode 0 Run the above through 'mcelog --ascii' Hardware event. This is not a software error. CPU 1 BANK 5 TSC 1808047d73a05 RIP !INEXACT! 10:ffffffff817129b7 TIME 1534131347 Mon Aug 13 05:35:47 2018 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal Timer error STATUS b200121020080400 MCGSTATUS 5 CPUID Vendor Intel Family 6 Model 23 SOCKET 0 APIC 1 microcode 0 Run the above through 'mcelog --ascii' Hardware event. This is not a software error. CPU 0 BANK 0 TSC 1808047d6d192 TIME 1534131347 Mon Aug 13 05:35:47 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS error: -1 0 Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 4 CPUID Vendor Intel Family 6 Model 23 SOCKET 0 APIC 0 microcode 0 Run the above through 'mcelog --ascii' Machine check: Processor context corrupt Kernel panic - not syncing: Fatal machine check Kernel Offset: disabled Rebooting in 30 seconds.. ```
Fedja Beader commented 3 months ago
Poster

About a month ago I had applied microcode update 0xa0e to the above (a stepping 10 core2duo P8600) and since then have been unable to reproduce this issue, despite hitting it with heavy I/O 10+ hours on end.

About a month ago I had applied microcode update 0xa0e to the above (a stepping 10 core2duo P8600) and since then have been unable to reproduce this issue, despite hitting it with heavy I/O 10+ hours on end.

@specing Are you suggesting that the microcode update might have fixed the issue, based on anecdotal evidence? Of course we can't formally prove anything or be sure that some issue contributing factor got fixed. But it's a good start. Pardon me if I am misunderstanding something.

Did the microcode update come with release notes or anything like that?

Thank you for sending this report and following up with your update.

@specing Are you suggesting that the microcode update _might_ have fixed the issue, based on anecdotal evidence? Of course we can't formally _prove_ anything or be sure that some issue contributing factor got fixed. But it's a good start. Pardon me if I am misunderstanding something. Did the microcode update come with release notes or anything like that? Thank you for sending this report and following up with your update.
Fedja Beader commented 2 months ago
Poster

I am merely stating that there has been no panic since updating the microcode. 6 weeks now.

I do not know of any release notes.

I am merely stating that there has been no panic since updating the microcode. 6 weeks now. I do not know of any release notes.
Leah Rowe commented 2 months ago
Owner

could you make a note about this somewhere in the libreboot documentation? this is useful information which some users will benefit from. i've seen people report this issue befoer

could you make a note about this somewhere in the libreboot documentation? this is useful information which some users will benefit from. i've seen people report this issue befoer
Fedja Beader commented 2 months ago
Poster

I was hoping this could be resolved some other way than encouraging users to install proprietary software (at least if Intel microcode is in any way similar to AMD's in that lecture at CCC).

Will probably start by adding "Capturing panic logs with netconsole" to FAQ.

I was hoping this could be resolved some other way than encouraging users to install proprietary software (at least if Intel microcode is in any way similar to AMD's in that lecture at CCC). Will probably start by adding "Capturing panic logs with netconsole" to FAQ.
Leah Rowe commented 2 months ago
Owner

well you don't tell user to use updates. you simply tell them that (assuming it's even true, that is) having no updates can cause such instability

it's simply a statement of fact. depending on how you word it, there's no violation of mission statement

well you don't tell user to use updates. you simply tell them that (assuming it's even true, that is) having no updates can cause such instability it's simply a statement of fact. depending on how you word it, there's no violation of mission statement
Leah Rowe commented 1 month ago
Owner

Fedja, I can confirm. I've been running with microcode updates on my own X200T for a few weeks to test this. I've been heavily stress-testing my own personal workstation and it's been rock solid with the updates. Same cannot be said without.

I'm inclined to believe that these instability issues only appear without the microcode updates. Which means that it's not possible to have stability on heavy I/O in Libreboot.

Since this is unsolvable by Libreboot, this issue can therefore be closed.

Fedja, I can confirm. I've been running with microcode updates on my own X200T for a few weeks to test this. I've been heavily stress-testing my own personal workstation and it's been rock solid with the updates. Same cannot be said without. I'm inclined to believe that these instability issues only appear without the microcode updates. Which means that it's not possible to have stability on heavy I/O in Libreboot. Since this is unsolvable by Libreboot, this issue can therefore be closed.
Sign in to join this conversation.
Loading...
Cancel
Save
There is no content yet.