? Hi David,

?

Sorry, I didn't see your response until just now. ?While the bug isn't specific to ax25d, it does relate to freeing kernel resources and cleaning up after a received (inbound) connection has terminated. ?If you were to have a packet service that does all of its own socket code, then it would likely still be affected. ?However, since packet apps like rmsgw and node leave the socket code to ax25d, in practical terms, the bug only presents itself when running ax25d.

?

? Cheers

? Mike

#15839

开云体育

Hey Mike,

This is good to hear that some of the issues are being worked out though most Raspberry Pi users will still have to wait since Raspberry Pi OS is still on 6.6.47.? Curious, would you only see this bug when using ax25d or would you see it under other situations?

Btw.. for reporting various AX.25 kernel related bugs, consider joining linux-hams@...

--David
KI6ZHD

On 10/21/2024 08:48 PM, Michael Dunn wrote:

Show quoted text

On Wed, Feb 28, 2024 at 07:01 PM, Michael Dunn wrote:

What this tells me is that I don't need to waste any time troubleshooting rmsgw. ?The problem lies between the kernel and ax25d's use of the kernel API.

Hey all. ?Sorry to bump an old thread, but I may have made some progress on this.

?

For a refresher, the issue was a persistent kernel crash on a Raspberry Pi running Bookworm and RMS Gateway. ?Myself and at least one other individual were having this issue. ?In my case the kernel would blame what ever AX25 app was performing an operation at the time, but I traced the source of the fault back to the handoff between ax25d and rmsgw.

?

I had given up on troubleshooting this problem; instead just resetting my Pi once or twice a week when it would crash. ?However, I when I was looking for known AX25 issues this week, I bumped into this CVE from July:

?

The description neatly fit my panic, so I dusted off the test environment I built (2 RPis with cross connected sound cards) and set about reproducing the crash. ?In the test environment, I'm able to consistently get crashes every 1 - 2 hours, so I was quickly able to validate the problem was still present. ?According to the CVE, custom kernel versions less than 6.6.35 are impacted, so I upgraded the test environment kernel image to 6.6.51.

?

After upgrading the test environment, I've been running for a full day without a crash. ?I'll probably give it another day before upgrading my actual packet station, but thought I would share in case other folks have the same issue.

?

? Cheers

? Mike

#15838

On Wed, Feb 28, 2024 at 07:01 PM, Michael Dunn wrote:

What this tells me is that I don't need to waste any time troubleshooting rmsgw. ?The problem lies between the kernel and ax25d's use of the kernel API.

Hey all. ?Sorry to bump an old thread, but I may have made some progress on this.

?

For a refresher, the issue was a persistent kernel crash on a Raspberry Pi running Bookworm and RMS Gateway. ?Myself and at least one other individual were having this issue. ?In my case the kernel would blame what ever AX25 app was performing an operation at the time, but I traced the source of the fault back to the handoff between ax25d and rmsgw.

?

I had given up on troubleshooting this problem; instead just resetting my Pi once or twice a week when it would crash. ?However, I when I was looking for known AX25 issues this week, I bumped into this CVE from July:

?

The description neatly fit my panic, so I dusted off the test environment I built (2 RPis with cross connected sound cards) and set about reproducing the crash. ?In the test environment, I'm able to consistently get crashes every 1 - 2 hours, so I was quickly able to validate the problem was still present. ?According to the CVE, custom kernel versions less than 6.6.35 are impacted, so I upgraded the test environment kernel image to 6.6.51.

?

After upgrading the test environment, I've been running for a full day without a crash. ?I'll probably give it another day before upgrading my actual packet station, but thought I would share in case other folks have the same issue.

?

? Cheers

? Mike

#15645

I will try this scenario on my Pi with bookworm, and replace the /usr/local/bin/rmsgw with a shell script that writes to a log file.

? I had an interesting result, basically trying this same test. ?I modified ax25d.conf and replaced rmsgw with /usr/games/fortune on my test Pi pair. ?The idea was to use a non-ax.25 app to produce some output and see if we can isolate the problem.

? After making the change on RMS Pi, the PAT Pi connected as it normally does, receiving a fortune cookie instead of a WL2K banner. ?Obviously the connection from PAT failed, but I let it run for about 2 days before I noticed the "netstat issue" again. ?I killed ax25d and immediately received a kernel Oops, but, interesting, not an immediate crash. ?The crash happened when I restarted ax25d:

root@hammyRMS:~# grep -v "^#" /etc/ax25/ax25d.conf

[MYCALL-4 via dw12]

NOCALL * * * * * * L

default ?* * * * * * ?0 ? ?root ?/usr/games/fortune fortune

root@hammyRMS:~# netstat -an

....

Active AX.25 sockets

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-4 ? ? ? ? ? LISTENING ? ?000/000 ?0 ? ? ? 0

root@hammyRMS:~#

root@hammyRMS:~# ps -ef | grep ax25

root ? ? ? ? 968 ? ? ? 1 ?0 Feb25 ? ? ? ? ?00:00:00 /usr/sbin/ax25d -l

root ? ? ?479232 ? ? 792 ?0 09:08 pts/3 ? ?00:00:00 grep ax25

root@hammyRMS:~# kill 968

root@hammyRMS:~#

Message from syslogd@hammyRMS at Feb 28 09:09:05 ...

?kernel:[227603.958206] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP

?

Message from syslogd@hammyRMS at Feb 28 09:09:05 ...

?kernel:[227603.958206] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP

?

Message from syslogd@hammyRMS at Feb 28 09:09:05 ...

?kernel:[227604.256120] Code: f9426400 d538d082 12800003 8b020000 (885f7c05)

?

root@hammyRMS:~#

root@hammyRMS:~# uptime

?09:09:18 up 2 days, 15:13, ?9 users, ?load average: 0.09, 0.17, 0.20

root@hammyRMS:~#

root@hammyRMS:~# ax25d -l

root@hammyRMS:~# client_loop: send disconnect: Broken pipe

?

What this tells me is that I don't need to waste any time troubleshooting rmsgw. ?The problem lies between the kernel and ax25d's use of the kernel API. ?I think I've taken this about as far as I can without digging into the code. ?I think next steps would be to see if I can add some verbose logging to ax25d and try to understand what has changed in the kernel's AX.25 stack between my version and a working version.

? Cheers
? Mike

#15644

Did you find anything in the OS system logs related to this issue?? Maybe a kernel oops, etc?

? Hi David,

No entries in the log files on disk; syslog simply can't get them written with the kernel panicked.

However, I do get an Oops with every crash. ?Certainly on the serial console, but I also often get a wall from syslog:

Message from syslogd@hammy at Feb 25 13:03:30 ...

?kernel:[2412078.321381] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP

? Cheers
? Mike

#15643

I'll try the test the way you describe it (run ax25d, comment out rmsgw), but I don't think I'll get the correct kernel state with out rmsgw. ?Setup this way, there is no "LISTEN" entry, so the kernel likely doesn't have the correct kernel structures loaded.

? So, I tried testing this way, but I messed up the test; I haven't fixed persistent device names for my usb audio, so the AX.25 stack was down for a couple of days before I noticed.

? Instead of starting over, I took a peek at my PROD Pi, which has been healthy for over 3 weeks since turning off my mail check schedule. ?With 27 days of uptime, PROD finally encountered the "(null)" column in netstat output. ?PROD had been running rmsgw from ax25d during those 27 days and I've had about 6 connections to the gateway in that time. ?I decided to stop and restart ax25d. ?The stop cleared out the netstat entries, as you would expect when the service isn't listening any more. ?However, when I tried to start ax25d again, the system immediately crashed. ?I think that proves the bad state is held in the kernel and is not resolved by releasing the listening socket.

? Cheers
? Mike

#15642

开云体育

Hello Michael,

Did you find anything in the OS system logs related to this issue?? Maybe a kernel oops, etc?

--David
KI6ZHD

On 02/20/2024 08:24 PM, Michael Dunn wrote:

Show quoted text

? I'm going to be away from the console of these test Pis for a few days, so it's a bit pointless to test a crash that only takes a few hours to happen. ?Instead, I'm going to remove rmsgw from ax25d and let the test cycle run while I'm away. ?

? Just a quick update on this, I checked on the Pis I had left in this state (with rmsgw disabled) and found that both were healthy after 8 days. ?To close the loop on this and demonstrate that rmsgw was causing the crash, I re-enabled rmsgw in ax25d.conf and restarted the daemon. ?There were no other changes and PAT was able to connect to check mail in the next connection cycle. ?About 12 hours later, I checked and found RMS had a faulted netstat output:

Active AX.25 sockets

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ?(null) ?LISTENING ? ?000/000 ?0 ? ? ? 0 ? ??

I found it interesting that netstat printed "(null)" this time, instead of just an empty column. ?Not relevant, just interesting. ?After finding netstat like this, a quick axcall command crashed the Pi. ?I think this confirms that rmsgw is the culprit here.

Still, this might not be specific to rmsgw. ?I think I'm seeing this crop up with rmsgw because it is the only process that is actively accepting AX.25 connections on my Pi. ?I might see if I can use node to generate some traffic on the test box.

? Cheers
? Mike

#15640

Something to try (in all your free time)? let the system run nicely without rmsgw, then turn it on for a bit and if you catch the netstat output with an empty or null value, stop/start ax25d.? Does it become stable again?

? Hi Jon,

I think you make a very good point here; ax25d is a wrapper to rmsgw, much like inetd was a wrapper to telnet way back in the day. ?If I recall, the wrapper handles binding to the interface and listening for connections. ?When a connection happens, it forks and execs the process (e.g. rmsgw, telnet), leaving the child with a set of open file handles to manipulate the connection. ?Put another way, both ax25d and rmsgw interact with the kernel's ax25 stack, so either could be responsible.

I'll try the test the way you describe it (run ax25d, comment out rmsgw), but I don't think I'll get the correct kernel state with out rmsgw. ?Setup this way, there is no "LISTEN" entry, so the kernel likely doesn't have the correct kernel structures loaded. ?Assuming nothing happens after a day, I'll modify the test, letting rmsgw run until the kernel state is triggered and try restarting ax25d to see what happens to netstat.

? Cheers
? Mike

#15639

Mike,? excellent observation.

I will try this scenario on my Pi with bookworm, and replace the /usr/local/bin/rmsgw with a shell script that writes to a log file.
Disconnect the vhf radio, disable the rmsgw_aci, and only have the kissattach, ax25d, and beacon process running.

And regularly check netstat for weird device entries.? This would help verify your theory that rmsgw running on current kernels is a main problem.?

Since moving down to buster, it's been crash-free for me.? Two weeks plus.

Something to try (in all your free time)? let the system run nicely without rmsgw, then turn it on for a bit and if you catch the netstat output with an empty or null value, stop/start ax25d.? Does it become stable again?

#15637

? I'm going to be away from the console of these test Pis for a few days, so it's a bit pointless to test a crash that only takes a few hours to happen. ?Instead, I'm going to remove rmsgw from ax25d and let the test cycle run while I'm away. ?

? Just a quick update on this, I checked on the Pis I had left in this state (with rmsgw disabled) and found that both were healthy after 8 days. ?To close the loop on this and demonstrate that rmsgw was causing the crash, I re-enabled rmsgw in ax25d.conf and restarted the daemon. ?There were no other changes and PAT was able to connect to check mail in the next connection cycle. ?About 12 hours later, I checked and found RMS had a faulted netstat output:

Active AX.25 sockets

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ?(null) ?LISTENING ? ?000/000 ?0 ? ? ? 0 ? ??

I found it interesting that netstat printed "(null)" this time, instead of just an empty column. ?Not relevant, just interesting. ?After finding netstat like this, a quick axcall command crashed the Pi. ?I think this confirms that rmsgw is the culprit here.

Still, this might not be specific to rmsgw. ?I think I'm seeing this crop up with rmsgw because it is the only process that is actively accepting AX.25 connections on my Pi. ?I might see if I can use node to generate some traffic on the test box.

? Cheers
? Mike

#15626

开云体育

I imagine you're getting confused by Pat's version numbering scheme for it's infrastructure modules (wl2k-go) which include all kinds of stuff.? It's 0.11.8 version is just one of many versions that looks similar to what the Official AX.25 repo as well as what the VE7FET repo uses:

??

Linux's current AX.25 woes is not a problem in these user-space libraries and utilities.? The issues are in the kernel itself.

--David
KI6ZHD

?

On 02/12/2024 11:29 PM, JJ wrote:

Show quoted text

Hmmm..this shows ax25 version 0.11.8

newer? changes?

Just sorta stumbled upon this..haven't tried anything tho...

On 2024-02-12 12:23 a.m., David Ranch wrote:

Hello Mike,

Again, thank you for the detailed email and I think this all helps in tracking down the real issue here.? I've been discussing this on the side with Bernard F6BVP who maintains FPAC (node) and FBB (BBS) and uses the ROSE protocol heavily.? He reported that he's "running three ROSE/FPAC nodes on a local network and I haven't observed any connections issues with Raspbian OS 64bit for a long time nor with Ubuntu (20.04)".? He showed months of uptime with LOTS of connections without either any panics or any orphaned AX.25 connections.? One key point he mentioned is that he does NOT have any RF connections, it's all via AXUDP and he also noted he's NOT using mkiss for linking the AXUDP to the kernel with kissattach.

--David

On 02/10/2024 12:27 PM, Michael Dunn wrote:

? Hi everyone,

Been quiet on this topic recently as I haven't had much to report. ?I've had a lack of any crashes for over 12 days now, which seems to be related to disabling Pat in my environment. ?Please don't jump to conclusions here; this is a complex issue. ?As I've said in the past, Pat doesn't appear to be the cause of the crash, just the process that trips over the kernel garbage to trigger it.

Since I've had some time to think about this problem, here's what I've noticed:

? - Jon and I both run rmsgw. ?We both have crashes on the system running rmsgw.
? - The process that triggers the crash is mobile (beacon, pat, netstat), but is never rmsgw.
? - Jon and I both have few outside connections to rmsgw. ?My last outside connection was 10 days ago.
? - Jon has crashes in just a few hours; my crashes take days to weeks.
? - Jon frequently self checks his mail (possibly hourly? but I don't think he stated). ?I self check my mail infrequently (daily).

Based on these facts, my theory is this. ?rmsgw puts the kernel in some sort of bad state. ?This state is tripped over later by some unsuspecting process, causing the kernel crash.

To prove this out, I need to separate my rms server from my rms client. ?Unfortunately, I have only one radio, so I decided to take this test off-air. ?In my test setup, I built 2 Pi's, let's call them RMS and PAT to distinguish their roles. ?The Pis were built by imaging the SD card from PROD in the manner I've described previously.

Instead of connecting to a radio, I simply connected sound card to sound card using a pair of TRS cables (headphone to microphone in both directions). ?Direwolf was configured to match and PTT was disabled. ?RMS ran rmsgw from ax25d and a shell loop on PAT checked my mail every 30 minutes using pat -s (send only, as not to eat my actual inbox).

I let this setup run over night. ?In the morning, neither Pi had crashed, but I did notice that PAT was no longer able to connect to RMS. ?Tracking through the logs, it looks like about 7 hours after I setup the test, connections started failing. ?Digging into the RMS Pi, I found the netstat condition that Jon first reported. ?Note ax0 missing from the device column:

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ? ? ? ? ?LISTENING ? ?000/000 ?0 ? ? ? 0

?
The kernel on RMS had been trashed, but the Pi was still operating. ?Checking the PAT Pi, netstat output looked normal. ?Realizing I still needed an AX25 event to trigger the crash, I used axcall on RMS ?to generate some traffic. ?The RMS Pi immediately crashed, blaming axcall as it went down:

[61160.353159] CPU: 1 PID: 130380 Comm: axcall Tainted: G ? ? ? WC ? ? ?6.1.0-rpi7-rpi-vB #1 Debian 1:6.1.63-1+rpt1

?
For me, this is great news. ?I have an off-air way to quickly show the problem. ?This also continues to show that the crash is mobile between processes and demonstrates an unrelated trigger event. ?Next steps are to reproduce the crash to ensure it is reliable. ?I'm also going to move RMS and PAT out of the RF environment (e.g. the other end of the house) to ensure there is no RFI element.

? Cheers
? Mike

#15625

开云体育

Hmmm..this shows ax25 version 0.11.8

newer? changes?

Just sorta stumbled upon this..haven't tried anything tho...

On 2024-02-12 12:23 a.m., David Ranch wrote:

Show quoted text

Hello Mike,

Again, thank you for the detailed email and I think this all helps in tracking down the real issue here.? I've been discussing this on the side with Bernard F6BVP who maintains FPAC (node) and FBB (BBS) and uses the ROSE protocol heavily.? He reported that he's "running three ROSE/FPAC nodes on a local network and I haven't observed any connections issues with Raspbian OS 64bit for a long time nor with Ubuntu (20.04)".? He showed months of uptime with LOTS of connections without either any panics or any orphaned AX.25 connections.? One key point he mentioned is that he does NOT have any RF connections, it's all via AXUDP and he also noted he's NOT using mkiss for linking the AXUDP to the kernel with kissattach.

--David

On 02/10/2024 12:27 PM, Michael Dunn wrote:

? Hi everyone,

Been quiet on this topic recently as I haven't had much to report. ?I've had a lack of any crashes for over 12 days now, which seems to be related to disabling Pat in my environment. ?Please don't jump to conclusions here; this is a complex issue. ?As I've said in the past, Pat doesn't appear to be the cause of the crash, just the process that trips over the kernel garbage to trigger it.

Since I've had some time to think about this problem, here's what I've noticed:

? - Jon and I both run rmsgw. ?We both have crashes on the system running rmsgw.
? - The process that triggers the crash is mobile (beacon, pat, netstat), but is never rmsgw.
? - Jon and I both have few outside connections to rmsgw. ?My last outside connection was 10 days ago.
? - Jon has crashes in just a few hours; my crashes take days to weeks.
? - Jon frequently self checks his mail (possibly hourly? but I don't think he stated). ?I self check my mail infrequently (daily).

Based on these facts, my theory is this. ?rmsgw puts the kernel in some sort of bad state. ?This state is tripped over later by some unsuspecting process, causing the kernel crash.

To prove this out, I need to separate my rms server from my rms client. ?Unfortunately, I have only one radio, so I decided to take this test off-air. ?In my test setup, I built 2 Pi's, let's call them RMS and PAT to distinguish their roles. ?The Pis were built by imaging the SD card from PROD in the manner I've described previously.

Instead of connecting to a radio, I simply connected sound card to sound card using a pair of TRS cables (headphone to microphone in both directions). ?Direwolf was configured to match and PTT was disabled. ?RMS ran rmsgw from ax25d and a shell loop on PAT checked my mail every 30 minutes using pat -s (send only, as not to eat my actual inbox).

I let this setup run over night. ?In the morning, neither Pi had crashed, but I did notice that PAT was no longer able to connect to RMS. ?Tracking through the logs, it looks like about 7 hours after I setup the test, connections started failing. ?Digging into the RMS Pi, I found the netstat condition that Jon first reported. ?Note ax0 missing from the device column:

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ? ? ? ? ?LISTENING ? ?000/000 ?0 ? ? ? 0

?
The kernel on RMS had been trashed, but the Pi was still operating. ?Checking the PAT Pi, netstat output looked normal. ?Realizing I still needed an AX25 event to trigger the crash, I used axcall on RMS ?to generate some traffic. ?The RMS Pi immediately crashed, blaming axcall as it went down:

[61160.353159] CPU: 1 PID: 130380 Comm: axcall Tainted: G ? ? ? WC ? ? ?6.1.0-rpi7-rpi-vB #1 Debian 1:6.1.63-1+rpt1

?
For me, this is great news. ?I have an off-air way to quickly show the problem. ?This also continues to show that the crash is mobile between processes and demonstrates an unrelated trigger event. ?Next steps are to reproduce the crash to ensure it is reliable. ?I'm also going to move RMS and PAT out of the RF environment (e.g. the other end of the house) to ensure there is no RFI element.

? Cheers
? Mike

#15624

开云体育

Hello Mike,

Again, thank you for the detailed email and I think this all helps in tracking down the real issue here.? I've been discussing this on the side with Bernard F6BVP who maintains FPAC (node) and FBB (BBS) and uses the ROSE protocol heavily.? He reported that he's "running three ROSE/FPAC nodes on a local network and I haven't observed any connections issues with Raspbian OS 64bit for a long time nor with Ubuntu (20.04)".? He showed months of uptime with LOTS of connections without either any panics or any orphaned AX.25 connections.? One key point he mentioned is that he does NOT have any RF connections, it's all via AXUDP and he also noted he's NOT using mkiss for linking the AXUDP to the kernel with kissattach.

--David

On 02/10/2024 12:27 PM, Michael Dunn wrote:

Show quoted text

? Hi everyone,

Been quiet on this topic recently as I haven't had much to report. ?I've had a lack of any crashes for over 12 days now, which seems to be related to disabling Pat in my environment. ?Please don't jump to conclusions here; this is a complex issue. ?As I've said in the past, Pat doesn't appear to be the cause of the crash, just the process that trips over the kernel garbage to trigger it.

Since I've had some time to think about this problem, here's what I've noticed:

? - Jon and I both run rmsgw. ?We both have crashes on the system running rmsgw.
? - The process that triggers the crash is mobile (beacon, pat, netstat), but is never rmsgw.
? - Jon and I both have few outside connections to rmsgw. ?My last outside connection was 10 days ago.
? - Jon has crashes in just a few hours; my crashes take days to weeks.
? - Jon frequently self checks his mail (possibly hourly? but I don't think he stated). ?I self check my mail infrequently (daily).

Based on these facts, my theory is this. ?rmsgw puts the kernel in some sort of bad state. ?This state is tripped over later by some unsuspecting process, causing the kernel crash.

To prove this out, I need to separate my rms server from my rms client. ?Unfortunately, I have only one radio, so I decided to take this test off-air. ?In my test setup, I built 2 Pi's, let's call them RMS and PAT to distinguish their roles. ?The Pis were built by imaging the SD card from PROD in the manner I've described previously.

Instead of connecting to a radio, I simply connected sound card to sound card using a pair of TRS cables (headphone to microphone in both directions). ?Direwolf was configured to match and PTT was disabled. ?RMS ran rmsgw from ax25d and a shell loop on PAT checked my mail every 30 minutes using pat -s (send only, as not to eat my actual inbox).

I let this setup run over night. ?In the morning, neither Pi had crashed, but I did notice that PAT was no longer able to connect to RMS. ?Tracking through the logs, it looks like about 7 hours after I setup the test, connections started failing. ?Digging into the RMS Pi, I found the netstat condition that Jon first reported. ?Note ax0 missing from the device column:

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ? ? ? ? ?LISTENING ? ?000/000 ?0 ? ? ? 0

?
The kernel on RMS had been trashed, but the Pi was still operating. ?Checking the PAT Pi, netstat output looked normal. ?Realizing I still needed an AX25 event to trigger the crash, I used axcall on RMS ?to generate some traffic. ?The RMS Pi immediately crashed, blaming axcall as it went down:

[61160.353159] CPU: 1 PID: 130380 Comm: axcall Tainted: G ? ? ? WC ? ? ?6.1.0-rpi7-rpi-vB #1 Debian 1:6.1.63-1+rpt1

?
For me, this is great news. ?I have an off-air way to quickly show the problem. ?This also continues to show that the crash is mobile between processes and demonstrates an unrelated trigger event. ?Next steps are to reproduce the crash to ensure it is reliable. ?I'm also going to move RMS and PAT out of the RF environment (e.g. the other end of the house) to ensure there is no RFI element.

? Cheers
? Mike

#15623

I think using badly outdated software is the answer here.? I never thought I make that recommendation for stuff.

? No doubt, it definitely feels wrong. ?But, I think you have a different objective than I do, so it sounds like the best thing to do for your use case.

? Just a quick update, I did more testing and was able to crash the setup between RMS and PAT twice yesterday, so this is definitely repeatable. ?In a crash cycle, once I can demonstrate "kernel weirdness" with netstat output, it usually takes two ax.25 commands to crash the RMS Pi. ?In one instance I ran axcall twice, in another instance I ran beacon twice.

? I'm going to be away from the console of these test Pis for a few days, so it's a bit pointless to test a crash that only takes a few hours to happen. ?Instead, I'm going to remove rmsgw from ax25d and let the test cycle run while I'm away. ?PAT will try to connect to RMS, but fail; that's ok. ?If I don't see any "kernel weirdness" when I return, then it's safe to conclude that rmsgw is a required element of the crash.

? Cheers
? Mike

#15622

After moving my Pi3 down to Buster, kernel 4.19.50-v7+, running rmsgw has been extremely stable.? Beacon does its thing, the ax0 interface is stable in the netstat command, the ACI script doesn't complain about radios being up or down, and I took the Pi+Baofeng gateway to our monthly club meeting and turned our members loose on the device for testing.? Several of us finished making Nino TNC boards, so we had a lot of clients connecting.
The Pi+radio, connected to wifi of the meeting hall, held up like a champ.

I think using badly outdated software is the answer here.? I never thought I make that recommendation for stuff.

#15621

? Hi everyone,

Been quiet on this topic recently as I haven't had much to report. ?I've had a lack of any crashes for over 12 days now, which seems to be related to disabling Pat in my environment. ?Please don't jump to conclusions here; this is a complex issue. ?As I've said in the past, Pat doesn't appear to be the cause of the crash, just the process that trips over the kernel garbage to trigger it.

Since I've had some time to think about this problem, here's what I've noticed:

? - Jon and I both run rmsgw. ?We both have crashes on the system running rmsgw.
? - The process that triggers the crash is mobile (beacon, pat, netstat), but is never rmsgw.
? - Jon and I both have few outside connections to rmsgw. ?My last outside connection was 10 days ago.
? - Jon has crashes in just a few hours; my crashes take days to weeks.
? - Jon frequently self checks his mail (possibly hourly? but I don't think he stated). ?I self check my mail infrequently (daily).

Based on these facts, my theory is this. ?rmsgw puts the kernel in some sort of bad state. ?This state is tripped over later by some unsuspecting process, causing the kernel crash.

To prove this out, I need to separate my rms server from my rms client. ?Unfortunately, I have only one radio, so I decided to take this test off-air. ?In my test setup, I built 2 Pi's, let's call them RMS and PAT to distinguish their roles. ?The Pis were built by imaging the SD card from PROD in the manner I've described previously.

Instead of connecting to a radio, I simply connected sound card to sound card using a pair of TRS cables (headphone to microphone in both directions). ?Direwolf was configured to match and PTT was disabled. ?RMS ran rmsgw from ax25d and a shell loop on PAT checked my mail every 30 minutes using pat -s (send only, as not to eat my actual inbox).

I let this setup run over night. ?In the morning, neither Pi had crashed, but I did notice that PAT was no longer able to connect to RMS. ?Tracking through the logs, it looks like about 7 hours after I setup the test, connections started failing. ?Digging into the RMS Pi, I found the netstat condition that Jon first reported. ?Note ax0 missing from the device column:

Dest ? ? ? Source ? ? Device ?State ? ? ? ?Vr/Vs ? ?Send-Q ?Recv-Q

* ? ? ? ? ?MYCALL-10 ? ? ? ? ?LISTENING ? ?000/000 ?0 ? ? ? 0

?

The kernel on RMS had been trashed, but the Pi was still operating. ?Checking the PAT Pi, netstat output looked normal. ?Realizing I still needed an AX25 event to trigger the crash, I used axcall on RMS ?to generate some traffic. ?The RMS Pi immediately crashed, blaming axcall as it went down:

[61160.353159] CPU: 1 PID: 130380 Comm: axcall Tainted: G ? ? ? WC ? ? ?6.1.0-rpi7-rpi-vB #1 Debian 1:6.1.63-1+rpt1

?

For me, this is great news. ?I have an off-air way to quickly show the problem. ?This also continues to show that the crash is mobile between processes and demonstrates an unrelated trigger event. ?Next steps are to reproduce the crash to ensure it is reliable. ?I'm also going to move RMS and PAT out of the RF environment (e.g. the other end of the house) to ensure there is no RFI element.

? Cheers
? Mike

#15610

If you read the linux-ham@... archives (
 ), you can see how
various patches have been submitted and accepted in w/o any testing or
validation.  These patches have been "fixing" potential race conditions
found by automated tools, security issues, and also changing out various
legacy kernel data structures for new ones.  What that has created is a
moving target where we cannot just backout the changes as the AX.25 and
surrounding kernel code has moved forward and is now incompatible.  <sigh>

--David
KI6ZHD
I figured that would be the case if it were indeed not possible.  Thank
you for the clarification.

-73 de Chris KQ6UP

#15609

开云体育

Hey Chris,

Well if the cause is/was bad patches, I am wondering if a frankenstein

kernel would work with the old AX.25 code "grafted in" to the new
kernel.  I am wondering what kind of mess that would make.  Not a kernel
programmer, so I am not sure what all would be involved.  I think
compiling an old slackware distro for a R-Pi 3 would be interesting too.

If you read the linux-ham@... archives ( ), you can see how various patches have been submitted and accepted in w/o any testing or validation.? These patches have been "fixing" potential race conditions found by automated tools, security issues, and also changing out various legacy kernel data structures for new ones.? What that has created is a moving target where we cannot just backout the changes as the AX.25 and surrounding kernel code has moved forward and is now incompatible.? <sigh>

--David
KI6ZHD

#15608

On Wed, Feb 07, 2024 at 08:33:04AM -0800, David Ranch wrote:

Hey Jon,

I saw both of your emails and I think this great work and your observation
of the missing AX.25 device like "ax0" from the output of "netstat -A ax25
-an" is an interesting one.  I'll look for that as well.

Beyond that, if you're looking for a solid AX.25 stack for your Raspberry Pi
1 or Raspberry Pi 2 hardware, I'd have to recommend using a VERY old release
like Jessie with the 4.14 kernel.  Making that recommendation really pains
me as it's EOL, known insecure, etc. but bad patches started getting
integrated into the Linux kernel around the 4.17 time frame.

You can see some of this hardware compatibility vs. OS version detail
captured here:

   

--David
KI6ZHD




Well if the cause is/was bad patches, I am wondering if a frankenstein
kernel would work with the old AX.25 code "grafted in" to the new
kernel.  I am wondering what kind of mess that would make.  Not a kernel
programmer, so I am not sure what all would be involved.  I think
compiling an old slackware distro for a R-Pi 3 would be interesting too.

73 de Chris KQ6UP

#15606

开云体育

Hey Jon,

I saw both of your emails and I think this great work and your observation of the missing AX.25 device like "ax0" from the output of "netstat -A ax25 -an" is an interesting one.? I'll look for that as well.?

Beyond that, if you're looking for a solid AX.25 stack for your Raspberry Pi 1 or Raspberry Pi 2 hardware, I'd have to recommend using a VERY old release like Jessie with the 4.14 kernel.? Making that recommendation really pains me as it's EOL, known insecure, etc. but bad patches started getting integrated into the Linux kernel around the 4.17 time frame.

You can see some of this hardware compatibility vs. OS version detail captured here:

??

--David
KI6ZHD

On 01/31/2024 09:43 PM, Jon Bousselot KK6VLO wrote:

Show quoted text

Mike and David,? all great input and I didn't consider 32/64 for application stability.? I chose 32 because if anyone in our club has a Pi-1 or Pi-2, they will be choosing the 32 bit release and rmsgw is such a light weight app and compiles from source...? I started with the most compatible release.???
The Pi-1 is a pilot project for our club, and the gateway and client radios are all within my house.? I've told others to finish soldering their TNC kit and connect, but so far I'm the only one.? I will attempt to monitor the number of ax25 packets that travel through the host.? I am usually checking winmail twice a day, and don't always have a lot of mail to get/send.? The quantity of packets on that interface are probably small, even for 1200 baud.

The Pi-1 crashed on its own today without the beacon app, after 1d 3h 47m uptime.? Here is what it cried out when going down.
[100075.375841] CPU: 0 PID: 3001 Comm: netstat Tainted: G???????? C???????? 6.1.0-rpi7-rpi-v6 #1? Raspbian 1:6.1.63-1+rpt1

Netstat, while not part of the ax25 tools, does query the ax25 interface statistics.? You might be onto something with the ax25 packet count theory.? I wasn't home when it crashed so I did not run netstat.
The rmsgw_aci is set to run on cron and it does call the netstat command several times, one specifically for the ax25 protocol.??? See for yourself.??? The -o is the output log file.
strace -ormsgw_aci.strace-log -fvtTq -s1024 /usr/local/bin/rmsgw_aci

I'm going to downgrade the Pi-1 to Bullseye, patch it thoroughly, and drop this config on it again and include beacon running every 35 minutes.?

I'll re-image my Pi-3 and load a 64bit Bookworm release and get it queued up for a second round of tests.? The two Pi's will swap roles for monitoring the serial console to catch the panic strings when the other goes crazy

We're collecting quite a few variables to test here.? This could take a while if crashes only happen every few days.
32 or 64 bit?
Bullseye or Bookworm?
How many ax25 packets until the system becomes unstable?
Beacon or no-beacon?

RPi Kernel Panic on Bookworm