开云体育

ctrl + shift + ? for shortcuts
© 2025 开云体育

Re: RPi Kernel Panic on Bookworm


 

开云体育


Hello Michael,

It's great to see your persistence here and I think with some additional details, maybe we can narrow things down.? Few questions:

?? - You mentioned that when you rebuilt the TEST machine to use a real sound device instead of a sound-loopback, it started to crash.? To me, this could be RFI induced.? How physically close is the radio to your Raspberry PIs?? How much RF power is the radio transmitting at?

?? - What is the interval on your beacon message(s)?? Are you only sending one beacon or multiple?? Are you using a digi path on your beacons?

?? - When you use Pat, are you making an outbound AX.25 connection to a remote system or is some remote system making a AX.25 connection to you?
????? - How often do you make this connection attempt?

?? - It might be good to run a script via cron running every minute on your Pis under test that:
????? - Writes all of the following to a file on a permanent file say in your home directory
? ? ???? - captures all running processes and system load
? ? ???? - captures the amount of free memory, disk space, etc
? ? ???? - captures the bottom of dmesg for any new lines
?? ? ? ?? - issues a sync command to push the changes to the file to hopefully NOT loose as little data as possible to the "disk" when the PI crashes
????? - Once the Pi crashes, reboot it and review the file.? See if there are any concerning trends (memory leaks, USB resets, etc)

?? - What kernel version are you currently using?

?? - Can you take a picture of the kernel crash screen?


One note:

?? - I don't think your statement of "I thought this detail was significant as beacon and pat are the only processes that produce UI frames" is correct?? Connectionless beacons use UI frames but PAT uses connected sessions.

--David
KI6ZHD


On 01/28/2024 02:12 PM, Michael Dunn wrote:

Hi everyone,

It's been a few weeks since I've posted on this, but I've been busy working through testing and troubleshooting on multiple crashes. ?This might get a bit lengthy, so here's your mute switch to tap-out :) . ?TL;DR the ?bad news is that the Pi still consistently crashes regardless of what I've changed, but the good news is that I've duplicated the crash elsewhere.

To summarize previous posts, the Pi had been crashing consistently "every few days", so I followed the utility advice of performing software and firmware updates; the system still crashed, however. ?I connected a serial console to the Pi to catch the kernel stack trace. ?The Pi's kernel blamed beacon for the crash, so I disabled beacons thinking it might be a workaround, but the crash came anyways.

Since the last post I've introduced a 2nd Pi for testing, I'm referring to it as TEST and the original Pi as PROD. ?Following KI6ZHD's advice, I built the VE7FET replacement libax25 and tools on TEST and installed the packages (after removing disto default packages) on PROD. ?PROD crashed a few days later, but as an interesting note, the kernel trace blamed pat instead of beacon. ?I thought this detail was significant as beacon and pat are the only processes that produce UI frames, so my next test was to remove the beacon configuration from pat on the PROD Pi.

At the risk of sounding like a broken record, PROD crashed again. Next up was another KI6ZHD suggestion to to bounce the AX.25 stack after a clean reboot, however PROD crashed several days past the reboot and stack bounce. ?At this point I backed out the VE7FET packages, trying to get back to a baseline configuration. One interesting item that came out of this crash was the time of day, 9:15a local; more on that later. ?Another item was some self-inflicted "complexity" on the ax.25 stack on PROD. ?Without going into details (would be lengthy), I removed this complexity, but PROD crashed again at 9:15a a few days later (today).

So, that's the bad news, but there is a bit of good news. ?Sometime early on, I started working on building a second Pi (TEST) to reproduce this crash. ?The thought was to reduce the time it takes to test ideas by running multiple tests in parallel. ?As it turns out, getting TEST to crash has been elusive. ?On the initial build, I imaged the TEST SD card from the same Raspbian image, manually built the AX.25 stack and configured a dummy sound device. ?This build ran for 2 weeks, beaconing every 60 seconds with out a crash.

On the second iteration, I built TEST by dd'ing PROD's SD card to TEST's SD card. ?The thought was that I must have missed some critical configuration, thus a block for block copy would solve that. ?On booting TEST, I changed a minimum number of items; just what was required for it to operate. ?I changed the hostname, configured snd-dummy and swapped out the audio device in Direwolf; I commented out Pat's scheduler so it wouldn't steal my email, and set beacons to 60s. ?I let this build of TEST run for about 9 days before I decided that it had failed to produce a crash.

Since TEST is using snd-dummy and is essentially an echo chamber, the thought occurred to me that the crash might need a chain of events that require the kernel to hear traffic and enter some state prior to the crash. ?I ordered an additional CM108 card for TEST and proceeded to wire a tap into my radio connection, routing RX audio and ground to the CM108. ?This allowed TEST to hear, but not speak and a crash followed about 6-7 days later.

I mentioned the significance of 9:15a earlier; 9:15a is the time that pat is scheduled to check my mail on PROD. ?I have witnessed at least 3 crashes around 9:15a, which should be a big, red flag. ?However, I say "about", because I've noted crashes that happen moments before my mail is checked and even crashes that happen in the minutes after my mail is retrieved. ?Recall that I disabled the pat schedule on TEST; that crash occurred sometime later in the day (11a?), but the kernel still blamed pat . ?I think pat is just the thing that happens to be generating traffic, and the schedule simply coalesces the time of day that crash happens.

So fresh out of good ideas, yet flush with bad ones, I'm at a bit of an impasse. ?Does anyone have any ideas for me?

One bad idea I have is to shutoff pat entirely; however, the lesson learned from shutting off beacon is that the crash will just stem from some other process using the AX.25 stack. ?That would just prove the producer of the crash is mobile, but not help me to work around it.

Another bad idea would be to recompile the kernel on TEST to produce the core and debug symbols needed to trace the crash. ?That seems like a fair bit of work and I'm not really confident the output would be useful.

Any thoughts would be greatly appreciated. ?In lieu of any good ideas, I plan to proceed with the bad ones mentioned above. :)

? Thanks
? Mike




Join [email protected] to automatically receive all group messages.