Keyboard Shortcuts
ctrl + shift + ? :
Show all keyboard shortcuts
ctrl + g :
Navigate to a group
ctrl + shift + f :
Find
ctrl + / :
Quick actions
esc to dismiss
Likes
Search
RPi Kernel Panic on Bookworm
Hi everyone,
I was wondering if anyone has run into kernel panic issues on Bookworm? ?About a month ago, I decided to finally upgrade my fairly reliable RPi3 on Stretch to a 8G RPi 4 running Bookworm. ?I used the RPi3 mostly for packet, an RMS station, and Pat client. ?When I say upgrade, I mean I built the new Pi from scratch, merged prior configs into new configs, etc. ?After moving everything over, I started noticing crashes every 4 - 5 days. It took me a few crashes to figure things out. ?The usual suspects were ruled out (power, temperature, load, RAM, apt-get upgrade); I wasn't receiving any log messages indicating any type of issue. ?I also run all of my Pi's headless, so as it turns out I had to order one of those Micro HDMI cables to finally see the panic, relevant section below:
Curious to see if anyone is having issues like this. ?My uneducated read of this is that /usr/sbin/beacon did something bad with a pointer. ?I was able to verify that PID 5851 was the "beacon" process and this is the second panic I caught which names beacon as the culprit. My short term work around is to have Direwolf beacon for me and prevent /usr/sbin/beacon from running. ?It will take a week or two in order to know whether that work around is helpful. ?I usually prefer to use the native AX.25 tools whenever I can, but that toolset doesn't seem to be aging well. ? Thanks ? Mike |
There has been a newer build or two of bookworm since then, and firmware has also been updated.? Run sudo apt update && sudo apt full-upgrade -y && sudo rpi-eeprom-update -d -a
To bring it all up to the latest version.? Im part of the beta testers for bookworm, and its first release was way too premature, but seems the powers that be ignored the beta testers and their bug reports to surprise users with the new OS, and make sure the pi 5's had an OS that would (somewhat) work.
Get
On Dec 26, 2023, at 12:36, Michael Dunn <ml000-0013@...> wrote: Hi everyone, |
开云体育Hello Mike,Do you have a more complete kernel panic screen capture?? I see it's showing "AX.25" but there should be a full screen worth of detail.? Regardless, there are a few known issues with modern Linux kernels and it's AX.25 stack though nothing I'm aware with UI packets sent from tools like beacon.? Which ax25apps / ax25tools are you using??? The ones included from the Raspberry Pi OS repo?? Maybe the VE7FET repo (recommended) or maybe the Official AX.25 repo (somewhat out of date)? If you're using the Raspberry Pi OS packages, I recommend you remove them, build up the VE7FET ones, and see if that helps.? I have that all documented here: ?? If that doesn't help, I have received a few proposed Ax.25 kernel module changes that are supposed to fix a few of these known connected-mode packet connection issues (decent description is here:?? ) but to use them, it requires recompiling the AX.25 kernel module.? It's not to difficult to do once you have the setup but it has to be redone whenever there is a new kernel released as part of the usual patching process. --David KI6ZHD On 12/26/2023 12:36 PM, Michael Dunn
wrote:
Hi everyone, |
On Tue, Dec 26, 2023 at 02:34 PM, David Ranch wrote:
Do you have a more complete kernel panic screen capture?? I see it's showing "AX.25" but there should be a full screen worth of detail. ?? I do have a screenshot, but groups.io keeps downscaling the image to the point where it isn't readable. ?Is there a better way post a picture of sufficient resolution? Thanks for this; I am using the built in packages. ?Before I try building the VE7FET packages, I'm going to let the system run for a couple of weeks to verify it doesn't crash with /usr/sbin/beacon disabled. ?Once that is confirmed, I give this a go. ? Thanks ? Mike |
Following up on this issue, I've had a couple of more kernel panics since my last message. ?Unfortunately, I couldn't dedicate my monitor to troubleshooting, so I missed the first panic text output. ?However, since I had disabled beacon prior to the crash, I at least know it wasn't involved in this one.
? I strung up a serial console and ran "apt upgrade-full" to continue troubleshooting. ?The second crash happened this afternoon, from which I was able to capture the console messages below. ?It looks like Pat made the call that triggered the panic. ?I think that lets me rule out the user space AX-25 tools. ? I've updated the EEPROM firmware, but I'm not really hopeful that is going to make a difference. ?I'm trying to figure out what I should try after the next crash. ?It seems like replacing ax25-apps and ax25-utils with VE7FET versions may not be helpful, since Pat is outside of this code base and it triggered the crash. ?I thought about replacing libax25 with VE7FET's version, but I note that Pat isn't linked against libax25:
?
?Any ideas would be appreciated.? Thanks ? Mike
? |
开云体育Hello Michael, Following up on this issue, I've had a couple of more kernel panics since my last message. ?Unfortunately, I couldn't dedicate my monitor to troubleshooting, so I missed the first panic text output. ?However, since I had disabled beacon prior to the crash, I at least know it wasn't involved in this one. I don't think there are any known issues with the Linux AX.25 stack and UI (aka unconnected) packets.? There *are* known issues with connected-mode sessions though. ? I strung up a serial console and ran "apt upgrade-full" to continue troubleshooting. ?The second crash happened this afternoon, from which I was able to capture the console messages below. ?It looks like Pat made the call that triggered the panic. ?I think that lets me rule out the user space AX-25 tools. Not necessarily but I do suspect the issue is the in-kernel AX.25 stack but can be provoked by user-space I/O coming through libax25
That though process is incorrect here.? Pat is just an userland-based application and depending on configured it to make AX.25 packet connections, it will either do it via the Linux in-kernel AX.25 support of via an AGW connection (offered via say Direwolf, G8BPQ Qtsoundmodem, etc).? btw.. the Rpi4/5 eeprom firmware is really more for hardware initialization and booting level stuff..? not OS/software related function. ?I thought about replacing libax25 with VE7FET's version, but I note that Pat isn't linked against libax25: I don't think that's accurate as Pat is a modular Go program and the one binary might not link in stuff like this.? If you look at the package dependencies, they are there: ?? I would recommend this sequence: ?? 1. reboot your pi, start up the whole Linux AX.25 stack however you do that (script, system, etc).? Now bring DOWN the entire AX.25 stack.? Now bring the stack back up again, try Pat, and now try to reproduce the issue again (might take days as you said before) If you still eventually hit the kernel panic again, now try these steps: ?? 2. uninstall the old OS provided ax25-apps, ax25-tools, libax25 packages ?? 3. download, compile, and install the VE7FET, ax25-apps, ax25-tools, libax25 packages ?? 4. Bring up the AX.25 stack and try to reproduce the panic again If you still eventually hit the kernel panic again, try doing the #1 work around and see if it helps (works around a known issue of ANSWERING an incoming AX.25 connected-mode session which will panic the kernel on Ubuntu 20.04 hosts using the 5.15.0-xx kernel. --David KI6ZHD |
Hi David,
Thanks for your help on this. ?Had another crash this morning (kernel blamed Pat) and, for reference, the last change I made was the firmware update. Since I don't have much for connected mode packet, I skipped step 1 and instead did 2,3, and 4. ?I built the VE7FET tools on another identical RPi4 with Bookworm, removed the stock ax25 packages and installed the VE7FET versions on my packet station. ?I assume the *dbgsym* packages aren't necessary and were built by default? I don't think that's accurate as Pat is a modular Go program and the one binary might not link in stuff like this.? If you look at the package dependencies, they are there:I should mention that I'm using Martin's build of Pat 0.15.1 from?. ?Unlike the Debian maintained package, this version isn't linked against libax25 nor is it a dependency for the package. ?Anyways, it's a moot point as my traffic still goes through /usr/sbin/kissattach (thus libax25) even if Pat doesn't use it, so I can't eliminate anything for troubleshooting. I don't think there are any known issues with the Linux AX.25 stack and UI (aka unconnected) packets.? There *are* known issues with connected-mode sessions though.It's really hard to tell what traffic may be triggering this crash, so I can't really rule out either UI or connected. ?However, my station is very quiet and it can go weeks with out answering a connected mode session. ?I did leave the Direwolf console open prior to the last crash so I could see the last traffic samples; nothing addressed to my station. It is interesting, however, that the kernel shifted the blame from 'beacon' to 'pat', both apps generating UI packets. ?After the next crash, I think I'll disable the beacon in Pat to see if that shifts the kernel's blame again. ?I also have built a separate, nearly identical RPi4, and have configured it to beacon every minute into snd-dummy. ?The thought was to reproduce the crash on another node and hopefully speed it up. ?It's been running for 2 weeks and no joy. If you still eventually hit the kernel panic again, try doing the #1 work around and see if it helps (works around a known issue of ANSWERING an incoming AX.25 connected-mode session which will panic the kernel on Ubuntu 20.04 hosts using the 5.15.0-xx kernel.Do you have any details on this bug? ?I assume this is related to your thread?on the Direwolf list. ?If there is a known way to excerise the panic, then I could try to duplicate it on my setup. ?I'd welcome a way to crash my station on demand :) . ? Thanks ? Mike |
开云体育Hello MIchael, Thanks for your help on this. ?Had another crash this morning (kernel blamed Pat) and, for reference, the last change I made was the firmware update. Interesting.. if your custom built Pat really isn't linked to AX.25 at all yet the panic is coming from Pat.. this means Pat is doing something very bad. Since I don't have much for connected mode packet, I skipped step 1 and instead did 2,3, and 4. ?I built the VE7FET tools on another identical RPi4 with Bookworm, removed the stock ax25 packages and installed the VE7FET versions on my packet station. ?I assume the *dbgsym* packages aren't necessary and were built by default? Ok.. by skipping #1.. you made your life a bit harder since you might be hitting a known kernel issue but ultimately replacing the AX.25 packages with the VE7FET version is a good thing.? Correct, you don't need to install the dbgsym debugging symbol packages unless you want to improve the decoding a coredumps of programs (but not the kernel itself). I don't think that's accurate as Pat is a modular Go program and the one binary might not link in stuff like this.? If you look at the package dependencies, they are there:I should mention that I'm using Martin's build of Pat 0.15.1 from?. ?Unlike the Debian maintained package, this version isn't linked against libax25 nor is it a dependency for the package. ?Anyways, it's a moot point as my traffic still goes through /usr/sbin/kissattach (thus libax25) even if Pat doesn't use it, so I can't eliminate anything for troubleshooting. Ok.. sounds like a packaging miss but regardless, it sounds like your issue is still on the Linux AX.25 stack. I don't think there are any known issues with the Linux AX.25 stack and UI (aka unconnected) packets.? There *are* known issues with connected-mode sessions though.It's really hard to tell what traffic may be triggering this crash, so I can't really rule out either UI or connected. ?However, my station is very quiet and it can go weeks with out answering a connected mode session. ?I did leave the Direwolf console open prior to the last crash so I could see the last traffic samples; nothing addressed to my station. I'm currently aware of two ways to break the Linux kernel.? Doing this the most basic way: ?? 1. I'm not 100% sure this is reproducible for all current kernel versions but I CAN reproduce this on Ubuntu 20.04 running 5.15.0.? Bring up your AX.25 stack with a "server" program that creates a listening AX.25 socket (I also bring up other things like netromd, mheardd, etc): ?? ?? ?? ?? Maybe Pat can do this for accepting incoming connections but Linpac will do this as well.? In this example, I use Linpac (0.29 develop branch).? When in Linpac, make an outbound connection to a remote packet node that supports making new connections.? Once connected to that remote node, initiate a connection back to your system's callsign+SSID.? Once connected, request some data such as the help command with Linpac's "//h" command.? At that point, your machine running Linpac will kernel panic.? If you're doing all this from within a Xwindows-based GUI, the machine will just cease up.? If you do this from a text console view, you'll see the panic output showing AX.25.?? This was reported a long time ago but no official fixes have been upstreamed into newer kernels.? To work around this issue, I've found that if you bring up the Linux AX.25 stack, then bring it down, and then bring it back up again.. the machine will not panic when incoming connections happen. ?? 2. "Address already in use": From the Linux machines console (NOT SSHing into the AX.25 enabled LInux host) and using the "call" program, create an outbound packet connection to any remote device be it a node, a BBS, a Winlink station, etc.? One the connection is up, forcefully disconnect the session with they key sequence: tilde (~) and then period (.).? Btw, this is the same force-disconnect sequence at OpenSSH's disconnect.? Once this happens, run the Linux command "netstat -A ax25 -an".? In this view, you will see the old session pair between the remote callsign+ssid to your local callsign+ssid present but listed as "LISTENING" state.? See for more detail.? At this point, that callsign+SSID pair will never work again from this link machine due to a bad kernel state.? The only way to clear this state is to reboot.? When you try to reboot, you will see messages like the following on the console (the final negative number will differ depending on how many stale kernel sessions are present: ?? "unregister_netdevice: waiting for ax0 to become free.? Usage count = -2" The machine WILL reboot but it can take anywhere from 2 - 5 minutes before it finally goes down though I don't understand the variability in the time it takes. ?? 3. There are other minor "known issues" with the Linux AX.25 stack including some level of AX.25 state loss where received packets won't be acknowledged for a short period of time but resent packets will be recognized, some packet loss with TCP/IP over AX.25 (AX-IP) packets (see with high rate pings, etc over Ethernet or IP-IP tunnels), and I think there are a few other minor ones I'm forgetting --David KI6ZHD |
On Mon, Jan 8, 2024 at 09:37 AM, David Ranch wrote:
Interesting.. if your custom built Pat really isn't linked to AX.25 at all yet the panic is coming from Pat.. this means Pat is doing something very bad.? Well, Pat probably isn't doing anything bad; you have to consider that /usr/bin/beacon triggers this as well. ?It's unlikely that separate bugs in both of those programs trigger the same crash. ?It's most likely a bug in the kernel, IMHO. Ok.. sounds like a packaging miss but regardless, it sounds like your issue is still on the Linux AX.25 stack.After I posted my note yesterday, I pulled down the source for Pat and compiled it on my test Pi. ?The recommend build process includes a step where the build scripts grab a copy of libax25, builds and statically links it. ?It's not a packaging miss; Martin decided to statically include a version of libax25 code, one that presumably has fixes he finds useful. ?From the non-preferred build method:
?
I'm currently aware of two ways to break the Linux kernel. ? ?? 1. I'm not 100% sure this is reproducible for all current kernel versions but I CAN reproduce this on Ubuntu 20.04 running 5.15.0.? Bring up your AX.25 stack with aI tested the first bug by connecting to a remote node near by and connecting back to my -10 RMS server on the Pi. ?The Winlink banner came a cross and I issued the RMS help command, which sent about 2kb of data total through the remote node and back to my Pi. ?No crash or any notable issues. ?? 2. "Address already in use": From the Linux machines console (NOT SSHing into the AX.25 enabled LInux host) and using the "call" program, create an outboundI tested the second bug, but I did use a ssh connection. ?You can simply escape the sequence by sending ~~. ; unfortunately, my node is headless and my serial console is busy waiting for a kernel panic, so I didn't have a console option. ?When I terminated the session, my terminal broke back to the Pi (e.g. the ssh connection was still established), but netstat did not show any hung connections. ?Be that as it may, I have seen this bug many times in older versions of Raspbian, some 4-5 years ago. ? Thanks ? Mike |
Hi everyone,
It's been a few weeks since I've posted on this, but I've been busy working through testing and troubleshooting on multiple crashes. ?This might get a bit lengthy, so here's your mute switch to tap-out :) . ?TL;DR the ?bad news is that the Pi still consistently crashes regardless of what I've changed, but the good news is that I've duplicated the crash elsewhere. To summarize previous posts, the Pi had been crashing consistently "every few days", so I followed the utility advice of performing software and firmware updates; the system still crashed, however. ?I connected a serial console to the Pi to catch the kernel stack trace. ?The Pi's kernel blamed beacon for the crash, so I disabled beacons thinking it might be a workaround, but the crash came anyways. Since the last post I've introduced a 2nd Pi for testing, I'm referring to it as TEST and the original Pi as PROD. ?Following KI6ZHD's advice, I built the VE7FET replacement libax25 and tools on TEST and installed the packages (after removing disto default packages) on PROD. ?PROD crashed a few days later, but as an interesting note, the kernel trace blamed pat instead of beacon. ?I thought this detail was significant as beacon and pat are the only processes that produce UI frames, so my next test was to remove the beacon configuration from pat on the PROD Pi. At the risk of sounding like a broken record, PROD crashed again. Next up was another KI6ZHD suggestion to to bounce the AX.25 stack after a clean reboot, however PROD crashed several days past the reboot and stack bounce. ?At this point I backed out the VE7FET packages, trying to get back to a baseline configuration. One interesting item that came out of this crash was the time of day, 9:15a local; more on that later. ?Another item was some self-inflicted "complexity" on the ax.25 stack on PROD. ?Without going into details (would be lengthy), I removed this complexity, but PROD crashed again at 9:15a a few days later (today). So, that's the bad news, but there is a bit of good news. ?Sometime early on, I started working on building a second Pi (TEST) to reproduce this crash. ?The thought was to reduce the time it takes to test ideas by running multiple tests in parallel. ?As it turns out, getting TEST to crash has been elusive. ?On the initial build, I imaged the TEST SD card from the same Raspbian image, manually built the AX.25 stack and configured a dummy sound device. ?This build ran for 2 weeks, beaconing every 60 seconds with out a crash. On the second iteration, I built TEST by dd'ing PROD's SD card to TEST's SD card. ?The thought was that I must have missed some critical configuration, thus a block for block copy would solve that. ?On booting TEST, I changed a minimum number of items; just what was required for it to operate. ?I changed the hostname, configured snd-dummy and swapped out the audio device in Direwolf; I commented out Pat's scheduler so it wouldn't steal my email, and set beacons to 60s. ?I let this build of TEST run for about 9 days before I decided that it had failed to produce a crash. Since TEST is using snd-dummy and is essentially an echo chamber, the thought occurred to me that the crash might need a chain of events that require the kernel to hear traffic and enter some state prior to the crash. ?I ordered an additional CM108 card for TEST and proceeded to wire a tap into my radio connection, routing RX audio and ground to the CM108. ?This allowed TEST to hear, but not speak and a crash followed about 6-7 days later. I mentioned the significance of 9:15a earlier; 9:15a is the time that pat is scheduled to check my mail on PROD. ?I have witnessed at least 3 crashes around 9:15a, which should be a big, red flag. ?However, I say "about", because I've noted crashes that happen moments before my mail is checked and even crashes that happen in the minutes after my mail is retrieved. ?Recall that I disabled the pat schedule on TEST; that crash occurred sometime later in the day (11a?), but the kernel still blamed pat . ?I think pat is just the thing that happens to be generating traffic, and the schedule simply coalesces the time of day that crash happens. So fresh out of good ideas, yet flush with bad ones, I'm at a bit of an impasse. ?Does anyone have any ideas for me? One bad idea I have is to shutoff pat entirely; however, the lesson learned from shutting off beacon is that the crash will just stem from some other process using the AX.25 stack. ?That would just prove the producer of the crash is mobile, but not help me to work around it. Another bad idea would be to recompile the kernel on TEST to produce the core and debug symbols needed to trace the crash. ?That seems like a fair bit of work and I'm not really confident the output would be useful. Any thoughts would be greatly appreciated. ?In lieu of any good ideas, I plan to proceed with the bad ones mentioned above. :) ? Thanks ? Mike |
开云体育Hello Michael, It's great to see your persistence here and I think with some additional details, maybe we can narrow things down.? Few questions: ?? - You mentioned that when you rebuilt the TEST machine to use a real sound device instead of a sound-loopback, it started to crash.? To me, this could be RFI induced.? How physically close is the radio to your Raspberry PIs?? How much RF power is the radio transmitting at? ?? - What is the interval on your beacon message(s)?? Are you only sending one beacon or multiple?? Are you using a digi path on your beacons? ?? - When you use Pat, are you making an outbound AX.25 connection to a remote system or is some remote system making a AX.25 connection to you? ????? - How often do you make this connection attempt? ?? - It might be good to run a script via cron running every minute on your Pis under test that: ????? - Writes all of the following to a file on a permanent file say in your home directory ? ? ???? - captures all running processes and system load ? ? ???? - captures the amount of free memory, disk space, etc ? ? ???? - captures the bottom of dmesg for any new lines ?? ? ? ?? - issues a sync command to push the changes to the file to hopefully NOT loose as little data as possible to the "disk" when the PI crashes ????? - Once the Pi crashes, reboot it and review the file.? See if there are any concerning trends (memory leaks, USB resets, etc) ?? - What kernel version are you currently using? ?? - Can you take a picture of the kernel crash screen? One note: ?? - I don't think your statement of "I thought this detail was significant as beacon and pat are the only processes that produce UI frames" is correct?? Connectionless beacons use UI frames but PAT uses connected sessions. --David KI6ZHD On 01/28/2024 02:12 PM, Michael Dunn
wrote:
Hi everyone, |
?Hi David,
Thanks for looking though my notes, let's look at this one first, as I think it dovetails into some of the other topics. ?? - I don't think your statement of "I thought this detail was significant as?beacon?and?pat?are the only processes that produce UI frames"?is correct?? Connectionless beacons use UI frames but PAT uses connected sessions.I think we may have talked past each other on this one before, so to clarify, Pat uses connected mode sessions to retrieve mail AND can be configured to send beacons via UI. ?Below is how I had Pat configured (from Pat's config.json) to advertise peer to peer Winlink. ?I've removed this configuration for testing, but does that make sense why I was concerned with UI frames? ?Pat can send UI frames and, of course, use connected mode sessions. ? "ax25": {
? ? "port": "dw12",
? ? "beacon": {
? ? ? "every": 3600,
? ? ? "message": "Winlink P2P",
? ? ? "destination": "BEACON"
? ? }
??}
?
? ?- What is the interval on your beacon message(s)?? Are you only sending one beacon or multiple?? Are you using a digi path on your beacons?I have two beacons; one transmits every 30 minutes with node and RMS SSIDs. ?The other is the hourly Winlink P2P beacon you see above. ?I've de-configured /usr/sbin/beacon and Pat's beacons, and instead I'm now producing these beacons out of direwolf directly. ?No digipeater, just direct to local RF. ?? - When you use Pat, are you making an outbound AX.25 connection to a remote system or is some remote system making a AX.25 connection to you?Do you remember I mentioned the "complexity" in my AX.25 stack? ?I might need a little latitude on this one, so apologies in advance. As I'm sure you are aware, when you transmit through the AX.25 stack, you cannot "hear" that transmission. ?If you could, chaos would ensue. ?This is a bit inconvenient in the scenario where you want to monitor your own services. ?As an example, if you ran your Winlink client (pat) on the same station as your Winlink server (rmsgw), those two could never talk. ?Such is my situation where pat/rmsgw run on the same Pi and it is my only packet station. I found two solutions to this dilemma; the first is to digi back to yourself. ?Of course that's a massive waste of RF and I didn't want to be "that guy". ?The other, more elegant solution, involves kissnetd and a pair of loopback ax ports (a.k.a. "the complexity"). ?You can probably see why I didn't want to muddy the waters with this before; sorry about that. ?I've had this loopback config setup for about 5 years now. ?Did I mention I have disabled this? ?Just making sure. To answer the question, I've normally used the loopback for Pat's connections, so checking my mail doesn't cross RF. ?Realizing the loopback complexity was a liability, I removed those parts last week and I've been using the digi back to myself option. ?The schedule runs just once a day a 9:15a. ?I've had very few inbound connections to my RMS server, but I've never been able to correlate an inbound connection with a crash. ? ?- It might be good to run a script via cron running every minute on your Pis under test that:I've actually done most of that (uptime, free, slabtop, proc temperature, voltage events) on a 10 second cycle. ?I've not recorded disk space or process list as I didn't see much value there. ?As you implied, when the kernel crashes, nothing gets flushed to disk, so my logs are unremarkable. ?However, I like the dmesg idea. ?Given that's a direct read from /dev/kmsg, I think I'll avoid writing that to disk and just stream it to a terminal. ?The hot pipe means it has a better chance of making it to the wire than down to disk. ? ? ?- What kernel version are you currently using?? - Can you take a picture of the kernel crash screen?The kernel is?6.1.0-rpi7-rpi-v8. ?I do have a photo of the crash, but groups.io compresses the picture to such an extent that it is illegible. ?I posted the text of the crash (from the serial console) in a message a few weeks ago. ?Were you looking for something that might be on the screen that wouldn't be on the serial console? ? ?- You mentioned that when you rebuilt the TEST machine to use a real sound device instead of a sound-loopback, it started to crash.? To me, this could be RFI induced.? How physically close is the radio to your Raspberry PIs?? How much RF power is the radio transmitting at?Drats, RFI! ?Don't worry, I'm not going to pull the "not in my shack" argument. ?RFI is always an option, but is notoriously hard to diagnose. ?To answer your questions, I have 3 Pi (PROD, TEST, and an unrelated Pi4) 3 to 4 feet away from the radio and my radio is set to "Mid" power (which is 10w in Kenwood speak). ?I mentioned the audio tap, so my data cable is your typical Mini-DIN6 on the radio end and a very atypical, butchered mess of ribbon cable, and Dupont/TRS connectors on the other (i.e. RFI playground). ?However, do think back just a moment to my loopback remarks, and understand that the crash has usually happened without any RF transmit (i.e. no RFI) while mail was being checked over the loopback. Since RFI is so difficult to identify, how about a differential test against RFI? ?As it happens, I have an idle Pi3 sitting here. ?This Pi, let's call it LEGACY, is the direct predecessor to PROD, so it has a fully operational packet configuration that was decommissioned a couple of months ago. ?More importantly though, it has 2 years of clean operational history (no crashes) in this environment just a few inches away from where TEST and PROD are right now. ?It would be trivial to swap LEGACY into where TEST sits today. ?In that scenario, if LEGACY were to crash, then it points the finger at RFI; if it doesn't crash then a kernel bug is likely to blame. ?Thoughts? ? Thanks ? Mike ? Thanks ? Mike |
开云体育Hello Mike, ?? - I don't think your statement of "I thought this detail was significant as?beacon?and?pat?are the only processes that produce UI frames"?is correct?? Connectionless beacons use UI frames but PAT uses connected sessions.I think we may have talked past each other on this one before, so to clarify, Pat uses connected mode sessions to retrieve mail AND can be configured to send beacons via UI. ?Below is how I had Pat configured (from Pat's config.json) to advertise peer to peer Winlink. ?I've removed this configuration for testing, but does that make sense why I was concerned with UI frames? ?Pat can send UI frames and, of course, use connected mode sessions. Ok.. I didn't know that Pat included it's own "beacon" unique program.? Disabling that will hopefully help narrow things down. ? ?- What is the interval on your beacon message(s)?? Are you only sending one beacon or multiple?? Are you using a digi path on your beacons?I have two beacons; one transmits every 30 minutes with node and RMS SSIDs. ?The other is the hourly Winlink P2P beacon you see above. ?I've de-configured /usr/sbin/beacon and Pat's beacons, and instead I'm now producing these beacons out of direwolf directly. ?No digipeater, just direct to local RF. Got it.
No, the Linux "listen" program can print out TXed packets but you need to enable that feature with the "-a" option: ?? ?????? "-a??????? Allow for the monitoring of outgoing frames as well as incoming ones."
Ah.. ok, so you running rmsgw also on this machine is news.? Might not be a problem but it's worth knowing since as I mentioned before, there are known INCOMING "connected" mode issues with the Linux AX.25 stack but they aren't kernel panic level issues.
Ok.. for now, please keep things simple to figure out this issue.
Ok, this is the second part of new news.? Pat is making an outbound connection via a local digi and back to your Raspberry Pi.? Now to be clear, are you digipeating or NODEing out and back?? I ask because when you NODE around, your SSID gets decremented by one.? That nuance might matter here.
Ok.. so it's the same output as before.? Got it.
That's a LOT of power for only being so close to each other.? Can you put them in to "EL" or Extra Low mode which is 0.5w?? That might help here.? I would also argue that moving them father apart and also onto different Z-planes aka elevation might help if this is really an RFI issue.? If it is RFI related, I would expect to see other errors like USB device drops, etc.
Understood.
Swapping the SD cards around might help here and I imagine that "LEGACY" os is using an older kernel that might not have these AX.25 issues.? Did that LEGACY setup also have rmsgw and Pat running at the same time on it?? In addition to this test and since you have multiple PIs, you might consider splitting apart of the Pat and rmsgw onto different Pis.? That might help isolate the issue as well. --David KI6ZHD |
Looking for answers on this specifically, I see others are having the same problem.? Beacon seems to be in the panic string each time my Pi crashes.? I'm using a Nino TNC, Raspbian Bookworm 32 bit, and can duplicate the crash on a Pi-3 and Pi-1.? I needed a serial console to capture the full crash content.? If I launch beacon to fork into the background, sending a station beacon every 35 minutes, the system crashes after about 2 days.? I can also get the system to crash sometimes if I invoke beacon to do a one-time send.
I'm trying to setup RMSGW for Winlink mail.? Using the distro supplied packages for ax25-tools and ax25-apps, and the N7NIX source for rmwgw.? This is a new install. ax25-tools???? 0.0.10-rc5+git20190411+3595f87-6 ax25-apps????? 0.0.8-rc5+git20190411+0ff1383-5 Without beacon, it is fairly stable on Pi-1 and Pi-3. Yes, the Pi-1 is kind of slow, but the TNC is doing the work.? The TNC can also beacon on its own - an option I have on my list to try. |
Hi David,
No, the Linux "listen" program can print out TXed packets but you need to enable that feature with the "-a" option:? What I mean by this is that a client and a service sharing the same ax port will never hear each other. ?If rmsgw listens for -10 on ax0 and pat tries to call -10 on ax0, the two will never connect. ?I built the loopback to solve this problem. Ok, this is the second part of new news.? Pat is making an outbound connection via a local digi and back to your Raspberry Pi.? Now to be clear, are you digipeating or NODEing out and back?? I ask because when you NODE around, your SSID gets decremented by one.? That nuance might matter here.? ?So, nomenclature here to make sure we are on the same page, by digipeating you mean adding a digipeater to the initial connection, correct? ?Versus NODEing, where you make an initial connection to a NODE and make a second, in-band connection to the destination? ?In that context, definitely digipeating. ?In general form, the pat connect alias looks like this: ?ax25://dw12/DIGI/MYRMS-10 where "dw12" is the ax port, DIGI is any digipeater, and MYRMS-10 is rmsgw on the Pi. That's a LOT of power for only being so close to each other.? Can you put them in to "EL" or Extra Low mode which is 0.5w?? That might help here.? I would also argue that moving them father apart and also onto different Z-planes aka elevation might help if this is really an RFI issue.? If it is RFI related, I would expect to see other errors like USB device drops, etc.I'm not sure what you mean here by "close to each other". ?Are you referring to distance between Pi and radio, or are you talking about my RF partners? ?I actually think 10w is pretty conservative, given my closest RF partners are 25-30 miles away. ?I did run this radio at 5w for a few years, but noticed that distant partners were unreliable at that power setting. ?I would be glad to test with the power set to "low", but I don't have an "EL" option. ?Again, a single radio being controlled by a single Pi (PROD). ?You used the plural "them" when referring to "EL" setting as if maybe you thought I had 2 radios, but I don't. The Z-plane confused me for a minute because I was thinking cartesian coordinates, but you must be referring to an antenna coordinate system where Z runs up and down (parallel to gravity). ?In that coordinate system, the relationship between radio and Pi would be described as the radio at the origin, and the Pi is at x=0, y=0, z=-3 where units are in feet. ?To paint a word picture, the radio is mounted on the top of a bakers rack, with the Pi directly underneath, but 3 shelves down. ?If I moved the Pi, the best distance I could practically get would be x=-2, y=-8 and z=-5. ?It would take me a while to get the lengths of cables needed to make that happen. Swapping the SD cards around might help here and I imagine that "LEGACY" os is using an older kernel that might not have these AX.25 issues.? Did that LEGACY setup also have rmsgw and Pat running at the same time on it?? In addition to this test and since you have multiple PIs, you might consider splitting apart of the Pat and rmsgw onto different Pis.? That might help isolate the issue as well.My thought was actually swapping the whole Pi (LEGACY for TEST). ?The value in the LEGACY unit is that it is a known quantity; working software, firmware and hardware in the RF environment. ?Besides, LEGACY is a Pi3, where TEST is a Pi4, so I don't think a swapped SD would boot. ?Correct, LEGACY is an older build, maybe jessie or buster, but definitely an older kernel. ?LEGACY has both rmsgw and Pat; it's nearly an identical configuration to PROD/TEST, just older software versions. ?However, LEGACY was using different sound hardware, a UDRC-II. ?I would pry the UDRC-II off the header and use the CM108 usb adapter from TEST. As for splitting up rmsgw and Pat, I decided that I would just turn off Pat for a test cycle. ?If the Pi crashes, then we know that the kernel blaming Pat is a red-herring. ?Since we need some traffic to trigger the crash, I've been just axcall'ing to another node and disconnecting on occasion. ? Thanks ? Mike |
? Hi Jon,
Very interesting and welcome to the crash club :) . ?Could you share what kernel version you are on? ?I'm on?6.1.0-rpi7-rpi-v8. ?I also match on the ax25 package versions up to the "+" sign; since we are on different architectures, the package version won't be an exact match. ?Same here for rmsgw, N7NIX built from source. If you can hook up a serial console, I'd recommend it. ?That would allow you to copy/paste the crash messages. ?Make sure you have a TTL level serial adapter. I would recommend moving the beacon to the TNC or just turning it off for a few days. ?You may find the system still crashes, which would be diagnostic. ?In my case a different process triggered the crash, so the serial console would be important. ? Cheers ? Mike |
Crash club.? I think I've overpaid my dues to this club over the years.? I like it.
I'm currently testing the rmsgw without beacon running at all, see if I can make it past two days.? Telling the TNC to beacon should be easy and that is next on my list.? I think the final production deployment for our club gateway will be on an x86 system, a GMKTEK N5105, also using a TNC.? The out the door price for that pc is really close to all the peripherals needed to deploy a Pi, and we have bigger plans for the system, not just RMSGW.? For other club members who want the pi solution at their home (like myself) I want to figure it out. Here is my current kernel on the Pi-1.?? Linux rms-gw 6.1.0-rpi7-rpi-v6 #1 Raspbian 1:6.1.63-1+rpt1 (2023-11-24) armv6l GNU/Linux And the kernel on the Pi-3 which has the same issue.? Linux wa6bgs-rms 6.1.0-rpi7-rpi-v7 #1 SMP Raspbian 1:6.1.63-1+rpt1 (2023-11-24) armv7l GNU/Linux I do have a serial console attached to the Pi-1 to catch kernel panic messages. Here is the string I saw from prior crashes.? I was only able to capture some because I had journalctl -af running in a shell.? The serial console prints this automatically. Some crash strings from this month. Jan 25 19:22:40 wa6bgs-rms kernel: CPU: 0 PID: 976 Comm: beacon Tainted: G??????? WC???????? 6.1.0-rpi7-rpi-v7 #1? Raspbian 1:6.1.63-1+rpt1 Jan 26 09:06:37 wa6bgs-rms kernel: CPU: 0 PID: 988 Comm: beacon Tainted: G???????? C???????? 6.1.0-rpi7-rpi-v7 #1? Raspbian 1:6.1.63-1+rpt1 [208158.485745] CPU: 0 PID: 850 Comm: beacon Tainted: G???????? C???????? 6.1.0-rpi7-rpi-v6 #1? Raspbian 1:6.1.63-1+rpt1 And this happened when the system was only up for 3.5 hours, and I ran the beacon command manually. root@rms-gw:~# beacon -c WA6BGS -d "beacon" -s radio "RMS Gate = WA6BGS-10" It crashed instantly. [12716.423664] CPU: 0 PID: 1217 Comm: beacon Tainted: G???????? C???????? 6.1.0-rpi7-rpi-v6 #1? Raspbian 1:6.1.63-1+rpt1 This is when I was convinced beacon was causing the problem.? Going back one raspbian release is possible, but not permanent. |
开云体育I have a test Raspberry Pi 4 + 64bit Bookworm setup ( 6.1.0-rpi7-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux ) using the Linux AX.25 stack + VE7FET Ax.25 libs/apps/tools + Direwolf 1.7 (with GPIO PTT) and a Syba USB sound device but it's NOT connected to a radio to send RF traffic.? That said, it's been running beacon for several *months* w/o any crashes: ?? /usr/sbin/beacon -c KI6ZHD-8 -d beacon -t 60 vhfdrop KI6ZHD/k KI6ZHD-1/b SCLARA/n 44.128.0.1/ip : Linpac in Santa Clara One thing I notice below fJon's most recent post is that they are using a 32bit kernel (aka amrv6l) but 64bit binaries (Pat) where my setup is using using the 64bit kernel.? That might be an important difference. --David KI6ZHD On 01/31/2024 09:04 AM, Jon Bousselot
KK6VLO wrote:
Crash club.? I think I've overpaid my dues to this club over the years.? I like it. |
I do have a serial console attached to the Pi-1? Sorry, I misread your message ... At least I didn't post an amazon link to an adapter :) . Some crash strings from this month.? Cool, so between the two of us we have 4 Pis on 6.1.0 that crash. ?If you get a crash after you disable beacon, grab the crash string; I'm very curious to know what process it blames. ? And this happened when the system was only up for 3.5 hours, and I ran the beacon command manually.I had the same thought about beacon, but was disappointed to see another process trigger the crash; reverting to an earlier release may be your only solid work around right now. So I heard you say that prior crashes happened every 1-2 days and it looks like this crash happened after only 3.5 hours. ?My systems have varied between 4 - 6 days between crashes. ?I have a theory that the crash happens after the kernel processes a certain amount of AX.25 traffic. ?It could explain why your Pi crashes faster than my Pi. ?My packet channel is pretty quiet; sometimes 2 - 3 minutes go by without even a single transmission. ?Would you say your packet channel is busier than this? ?Maybe I should trend the packet count from ifconfig ... ? Cheers ? Mike |
to navigate to use esc to dismiss