Hello Mike,
Again, thank you for the detailed email and I think this all helps
in tracking down the real issue here.? I've been discussing this on
the side with Bernard F6BVP who maintains FPAC (node) and FBB (BBS)
and uses the ROSE protocol heavily.? He reported that he's "running
three ROSE/FPAC nodes on a local network and I haven't observed any
connections issues with Raspbian OS 64bit for a long time nor with
Ubuntu (20.04)".? He showed months of uptime with LOTS of
connections without either any panics or any orphaned AX.25
connections.? One key point he mentioned is that he does NOT have
any RF connections, it's all via AXUDP and he also noted he's NOT
using mkiss for linking the AXUDP to the kernel with kissattach.
--David
On 02/10/2024 12:27 PM, Michael Dunn
wrote:
toggle quoted message
Show quoted text
? Hi everyone,
Been quiet on this topic recently as I haven't had much to report.
?I've had a lack of any crashes for over 12 days now, which seems
to be related to disabling Pat in my environment. ?Please don't
jump to conclusions here; this is a complex issue. ?As I've said
in the past, Pat doesn't appear to be the cause of the crash, just
the process that trips over the kernel garbage to trigger it.
Since I've had some time to think about this problem, here's what
I've noticed:
? - Jon and I both run rmsgw. ?We both have crashes on the system
running rmsgw.
? - The process that triggers the crash is mobile (beacon, pat,
netstat), but is never rmsgw.
? - Jon and I both have few outside connections to rmsgw. ?My last
outside connection was 10 days ago.
? - Jon has crashes in just a few hours; my crashes take days to
weeks.
? - Jon frequently self checks his mail (possibly hourly? but I
don't think he stated). ?I self check my mail infrequently
(daily).
Based on these facts, my theory is this. ?rmsgw puts the kernel in
some sort of bad state. ?This state is tripped over later by some
unsuspecting process, causing the kernel crash.
To prove this out, I need to separate my rms server from my rms
client. ?Unfortunately, I have only one radio, so I decided to
take this test off-air. ?In my test setup, I built 2 Pi's, let's
call them RMS and PAT to distinguish their roles. ?The Pis were
built by imaging the SD card from PROD in the manner I've
described previously.
Instead of connecting to a radio, I simply connected sound card to
sound card using a pair of TRS cables (headphone to microphone in
both directions). ?Direwolf was configured to match and PTT was
disabled. ?RMS ran rmsgw from ax25d and a shell loop on PAT
checked my mail every 30 minutes using pat -s (send only, as not
to eat my actual inbox).
I let this setup run over night. ?In the morning, neither Pi had
crashed, but I did notice that PAT was no longer able to connect
to RMS. ?Tracking through the logs, it looks like about 7 hours
after I setup the test, connections started failing. ?Digging into
the RMS Pi, I found the netstat condition that Jon first reported.
?Note ax0 missing from the device column:
Dest
? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q
* ? ?
? ? ?MYCALL-10 ? ? ? ? ?LISTENING ? ?000/000 ?0 ? ? ? 0
?
The kernel on RMS had been trashed, but the Pi was still
operating. ?Checking the PAT Pi, netstat output looked normal.
?Realizing I still needed an AX25 event to trigger the crash, I
used axcall on RMS ?to generate some traffic. ?The RMS Pi
immediately crashed, blaming axcall as it went down:
[61160.353159]
CPU: 1 PID: 130380 Comm: axcall Tainted: G ? ? ? WC ? ?
?6.1.0-rpi7-rpi-vB #1 Debian 1:6.1.63-1+rpt1
?
For me, this is great news. ?I have an off-air way to quickly show
the problem. ?This also continues to show that the crash is mobile
between processes and demonstrates an unrelated trigger event.
?Next steps are to reproduce the crash to ensure it is reliable.
?I'm also going to move RMS and PAT out of the RF environment
(e.g. the other end of the house) to ensure there is no RFI
element.
? Cheers
? Mike