Hello Michael,
It's great to see your persistence here and I think
with some additional details, maybe we can narrow things down.?
Few questions:
?? - You mentioned that when you rebuilt the TEST machine to use a
real sound device instead of a sound-loopback, it started to
crash.? To me, this could be RFI induced.? How physically close is
the radio to your Raspberry PIs?? How much RF power is the radio
transmitting at?
?? - What is the interval on your beacon message(s)?? Are you only
sending one beacon or multiple?? Are you using a digi path on your
beacons?
?? - When you use Pat, are you making an outbound AX.25 connection
to a remote system or is some remote system making a AX.25
connection to you?
????? - How often do you make this connection attempt?
?? - It might be good to run a script via cron running every
minute on your Pis under test that:
????? - Writes all of the following to a file on a permanent file
say in your home directory
? ? ???? - captures all running processes and system load
? ? ???? - captures the amount of free memory, disk space, etc
? ? ???? - captures the bottom of dmesg for any new lines
?? ? ? ?? - issues a sync command to push the changes to the file
to hopefully NOT loose as little data as possible to the "disk"
when the PI crashes
????? - Once the Pi crashes, reboot it and review the file.? See
if there are any concerning trends (memory leaks, USB resets, etc)
?? - What kernel version are you currently using?
?? - Can you take a picture of the kernel crash screen?
One note:
?? - I don't think your statement of "
I thought this detail was
significant as beacon and pat are the only
processes that produce UI frames" is correct??
Connectionless beacons use UI frames but PAT uses connected
sessions.
--David
KI6ZHD
On 01/28/2024 02:12 PM, Michael Dunn
wrote:
toggle quoted message
Show quoted text
Hi everyone,
It's been a few weeks since I've posted on this, but I've been
busy working through testing and troubleshooting on multiple
crashes. ?This might get a bit lengthy, so here's your mute switch to tap-out :) . ?TL;DR the ?bad
news is that the Pi still consistently crashes regardless of what
I've changed, but the good news is that I've duplicated the crash
elsewhere.
To summarize previous posts, the Pi had been crashing consistently
"every few days", so I followed the utility advice of performing
software and firmware updates; the system still crashed, however.
?I connected a serial console to the Pi to catch the kernel stack
trace. ?The Pi's kernel blamed beacon for the crash, so I
disabled beacons thinking it might be a workaround, but the crash
came anyways.
Since the last post I've introduced a 2nd Pi for testing, I'm
referring to it as TEST and the original Pi as PROD. ?Following
KI6ZHD's advice, I built the VE7FET replacement libax25 and tools
on TEST and installed the packages (after removing disto default
packages) on PROD. ?PROD crashed a few days later, but as an
interesting note, the kernel trace blamed pat instead of beacon. ?I thought this
detail was significant as beacon and pat
are the only processes that produce UI frames, so my next test
was to remove the beacon configuration from pat
on the PROD Pi.
At the risk of sounding like a broken record, PROD crashed
again. Next up was another KI6ZHD suggestion to to bounce the
AX.25 stack after a clean reboot, however PROD crashed several
days past the reboot and stack bounce. ?At this point I backed out
the VE7FET packages, trying to get back to a baseline
configuration. One interesting item that came out of this crash
was the time of day, 9:15a local; more on that later. ?Another
item was some self-inflicted "complexity" on the ax.25 stack on
PROD. ?Without going into details (would be lengthy), I removed
this complexity, but PROD crashed again at 9:15a a few days later
(today).
So, that's the bad news, but there is a bit of good news.
?Sometime early on, I started working on building a second Pi
(TEST) to reproduce this crash. ?The thought was to reduce the
time it takes to test ideas by running multiple tests in parallel.
?As it turns out, getting TEST to crash has been elusive. ?On the
initial build, I imaged the TEST SD card from the same Raspbian
image, manually built the AX.25 stack and configured a dummy sound
device. ?This build ran for 2 weeks, beaconing every 60 seconds
with out a crash.
On the second iteration, I built TEST by dd'ing
PROD's SD card to TEST's SD card. ?The thought was that I must
have missed some critical configuration, thus a block for block
copy would solve that. ?On booting TEST, I changed a minimum
number of items; just what was required for it to operate. ?I
changed the hostname, configured snd-dummy and swapped out the
audio device in Direwolf; I commented out Pat's scheduler so it
wouldn't steal my email, and set beacons to 60s. ?I let this build
of TEST run for about 9 days before I decided that it had failed
to produce a crash.
Since TEST is using snd-dummy and is essentially an echo chamber,
the thought occurred to me that the crash might need a chain of
events that require the kernel to hear traffic and enter some
state prior to the crash. ?I ordered an additional CM108 card for
TEST and proceeded to wire a tap into my radio connection, routing
RX audio and ground to the CM108. ?This allowed TEST to hear, but
not speak and a crash followed about 6-7 days later.
I mentioned the significance of 9:15a earlier; 9:15a is the time
that pat
is scheduled to check my mail on PROD. ?I have witnessed at least
3 crashes around 9:15a, which should be a big, red flag. ?However,
I say "about", because I've noted crashes that happen moments
before my mail is checked and even crashes that happen in the
minutes after my mail is retrieved. ?Recall that I disabled the pat
schedule on TEST; that crash occurred sometime later in the day
(11a?), but the kernel still blamed pat . ?I think pat is
just the thing that happens to be generating traffic, and the
schedule simply coalesces the time of day that crash happens.
So fresh out of good ideas, yet flush with bad ones, I'm at a bit
of an impasse. ?Does anyone have any ideas for me?
One bad idea I have is to shutoff pat entirely; however, the
lesson learned from shutting off beacon is that the crash
will just stem from some other process using the AX.25 stack.
?That would just prove the producer of the crash is mobile, but
not help me to work around it.
Another bad idea would be to recompile the kernel on TEST to
produce the core and debug symbols needed to trace the crash.
?That seems like a fair bit of work and I'm not really confident
the output would be useful.
Any thoughts would be greatly appreciated. ?In lieu of any good
ideas, I plan to proceed with the bad ones mentioned above. :)
? Thanks
? Mike