I have faced a problem with my code for a long time that only occurs during the call to EClientSocket::eDisconnect() I have a separate message processing thread running which looks like this:
????? ftrMsgProcThrd_ = pool_->submit( ????????? [&]() ????????? { ??????????? while (clientSocket_->isConnected()) ??????????? { ????????????? signal_.waitForSignal();?? // This waits 2 seconds. ????????????? reader_->processMsgs(); ??????????? } ????????? });
I decided to tackle this annoying bug (not the first time) and have found that after the call to EClientSocket::eDisconnect() which calls EClientSocket::SocketClose() which just calls a Windows Sockets closesocket() on the open socket, I am still getting the message processing thread (EReader thread) trying to perform a Windows Sockets recv() messages on the closed socket resulting in a 509 error. I have traced that error to be socket error 10038 which confirms it is an invalid socket (in this case, a closed socket).
Before the line "reader_->processMsgs()"? I have tried checking for the socket still being open with if(clientSocket_->isConnected()) but it does not solve the problem. The EReader is running in its own thread as per the reader_->start() call.
I thought, perhaps I need to close the EReader before calling eDisconnect so I tried deleting the object and removing the call to eDisconnect() because the destructor of the EReader calls eDisconnect() itself but this does not fix the error. i still get the 509 caused by a read on the closed socket.
I am struggling here and would appreciate advise from any C++ coders that use a multi-threaded approach like the above. It is likely a threading issue but if anyone else has faced a similar "disconnect" issue I would be happy to hear what you did to resolve it.
|
After much deliberation, I think I figured out the problem. In my opinion it is a bug in the TWS API for C++. I would like someone's help to go thru' my logic and confirm it. The EReader::~EReader() destructor closes the socket by calling eDisconnect() then it waits for the thread to complete by calling WaitForSingleObject(m_hReadThread, INFINITE); In my opinion this is wrong. Why would you want to disconnect the socket when you have a thread running which is potentially calling recv() on the same socket? In my view, we have to wait for the thread to finish then disconnect the socket. My fix to the problem is to swap the two steps, i.e.
??? if (m_hReadThread) { ??? ? ? ?? m_isAlive = false; ?? ??? ??? m_pClientSocket->eDisconnect(); ? ?? ????? WaitForSingleObject(m_hReadThread, INFINITE); ?? ???? }
becomes
??? if (m_hReadThread) { ????????? m_isAlive = false; ? ? ????? WaitForSingleObject(m_hReadThread, INFINITE); ?? ?????? m_pClientSocket->eDisconnect(); ?? ???? }
This has resolved the issue. Could someone confirm my logic so I can issue a bug report to the IBKR guys? Thanks
|
I guess the assumption is that using std::atomic<bool> m_isAlive; turns the following into a critical section:
void EReader::readToQueue() {
//EMessage *msg = 0;
while (m_isAlive) {
if (m_buf.size() == 0 && !processNonBlockingSelect() && m_pClientSocket->isSocketOK())
continue;
if (!putMessageToQueue())
break;
}
But it does not, so I take your point.
|
Thanks for your comment Buddy.
I agree that the logic would seem that setting m_isAlive to false during the ~EReader() call should prevent the code trying to read from the socket.
I tried playing around with that. If I pause after m_isAlive is set to false, then continue, the code works successfully without 509 error. It is as if at full speed the atomic operation is not being completed before the call to eDisconnect() happens. On Visual Studio this even happens when I turn off all Optimisations. As eDisconnect() does not use anything related to that atomic, I wonder if the compiler is parallelising the operation as all atomic operations are very slow.
I also noticed that eDisconnect() sets m_fd = -1.? This should prevent the call to processNonBlockingSelect() from trying to call onReceive() which ultimately triggers receive() and recv() causing the error. Again the setting of the atomic variable is not being done in time.
I am still scratching my head over this. I suspect some synchronization issue between the various threads.
|
On Sun, Oct 1, 2023 at 10:02 AM, David Armour wrote:
Thanks for your comment Buddy.
You're welcome. I think you've found an interesting oversight.
I agree that the logic would seem that setting m_isAlive to false during the
~EReader() call should prevent the code trying to read from the socket.
Yeah, superficially and at first blush it would seem that way. But it'd be a mistake to think so. I can understand how that initial thought may have fooled the original author into thinking the code was correct though. All this is post-mortem speculation on my part however.
Nevertheless, once considered more thoughtfully, it becomes apparent that nothing prevents the destructor from being called while the processing thread's readToQueue is in the while block.
The std::atomic will only insure that reading and writing m_isAlive from multiple threads is well-defined; but guarantees nothing w.r.t. the destructor and what's happening in the while loop.
I tried playing around with that. If I pause after m_isAlive is set to false,
then continue, the code works successfully without 509 error. It is as if at
"Pause", with a sleep or something? No... that's surely a game of whack-a-mole. It might work sometimes, depending on unpredictable factors, and might not.
It can be fun to experiment with the timing while you investigate though. You can also try playing w/ ordering by running the code on a single core/cpu and see how that affects things.
full speed the atomic operation is not being completed before the call to
eDisconnect() happens. On Visual Studio this even happens when I turn off all
Optimisations. As eDisconnect() does not use anything related to that atomic,
I wonder if the compiler is parallelising the operation as all atomic
operations are very slow.
Compiler optimizations fall somewhat into the category of "unpredictable factors" I alluded to (although they'd be deterministic given the compiler source and enough work). It's a bit surprising that a critical section wasn't implemented via mutex, but maybe the author wanted to avoid a performance penalty.
I also noticed that eDisconnect() sets m_fd = -1.? This should prevent the
call to processNonBlockingSelect() from trying to call onReceive() which
ultimately triggers receive() and recv() causing the error. Again the setting
of the atomic variable is not being done in time.
I am still scratching my head over this. I suspect some synchronization issue
between the various threads.
Anyway... I think you found a legitimate problem and I wouldn't worry about it too much. I like your proposed solution as well. Joining the thread and then disconnecting makes clear sense to me. And, as long as m_pClientSocket isn't being shared all over the place I don't see a reason to use more complicated locking via mutex.
If I were the reviewer I could see signing off after some confirmation tests... open a ticket and see what they have to say. If you submit a patch remember to change the IB_POSIX ifdef block too :-)
|
I agree with everything you said although I am not entirely happy with my final solution. I do not feel that destroying the EReader object is the correct way to disconnect. It is certainly not intuitive which means I cannot be the only person finding this problem. Is everybody using C++ just ignoring this error?
Anyway, I will sleep on it one more night before submitting a bug report.
Is there a special place to report API bugs or do we just raise ticket as usual?
|
On Sun, Oct 1, 2023 at 03:15 PM, David Armour wrote:
I agree with everything you said although I am not entirely happy with my
final solution. I do not feel that destroying the EReader object is the
correct way to disconnect. It is certainly not intuitive which means I cannot
be the only person finding this problem. Is everybody using C++ just ignoring
this error?
You may see this as a wart but I, personally, won't comment with regard to how intuitive things are or any design aesthetic. I will say that consideration should be given to backward compatibility and just how much existing code would break (and therefore need to change) if the disconnect where removed from the destructor.
Just some food for thought... since there's usually ripple effects, unintended consequences, and API stability is actually very important. Also, I don't think it's wise to hold off fixing some simple, obvious, and existing issue merely because there's an intention of addressing it in a more significant re-design. As they say, there is no time like the present and I appreciate small and tractable, incremental changes.
However, this may come down to a judgment call. Since, I guess, people have indeed been ignoring it or working around it (or it's simply gone unnoticed). After all, it only happens late in the game when there usually isn't much network activity expected or necessarily happening.
Anyway, I will sleep on it one more night before submitting a bug report.
Is there a special place to report API bugs or do we just raise ticket as
usual?
A pull request is probably the best way to go about this:
|
On Sun, Oct 1, 2023 at 03:15 PM, David Armour wrote:
I do not feel that destroying the EReader object is the correct way to
disconnect.
IDK if this is what made you uneasy or what you meant, but maybe you'd prefer to see the call to eDisconnect happen in readToQueue as the penultimate action (i.e. before the final signal is issued)? In retrospect this makes a bit more sense to me.
|
What makes me feel uneasy is that according to the documentation:  It makes users think that by calling eDisconnect() you can safely disconnect from TWS without causing errors but we have shown that this is not the case. You must "stop" the thread, then disconnect from the socket, in that order. It is not possible for eDisconnect() to do that without creating a mutex locking system that shares mutexes between both EClientSocket and EReader. I agree with you that adding this complexity is not the right approach, for one as it will slow down the message receiving loop. What I am opting for and will propose in the GitHub when I get the approval to create a branch, is creating an EReader::stop() function like this:  then modifying the destructor to the following:  Then I will also propose that the C++ documentation for the eDisconnect() function be changed to say that to disconnect safely you must call EReader::stop() before calling EClientSocket::eDisconnect().
|
... I should add with the proposed code change, you can now stop the reader and disconnect the socket without actually destroying the EReader object by doing the following:  ' The last line here is where I wait for my? own message processing thread to end. This is now very clean and you can connect and disconnect from TWS without error and without having to destroy the EReader.
|
I like evolution, so I may be wrong but here I am not sure I would flag this as a bug. Just out of curiosity what undesirable side effect the current implementation generate? ?
Some food for thoughts: Theses are tricky parts as it require tracking all threads together to do proper analysis. again I won't pretend I know all the implications. I use it in C++/Win32 and it work really fine and fast (async) without any issues that worth a change from my perspective (IMHO).
Because: 1- If I ask a "disconnect" it's to be acted upon immediately I don't want any new data to be processed. Exiting gracefully and fast a mutithread application can become a tricky process; I am ready to live with recv receiving remnant of previous telco. It's a philosophical/aesthetic issue I agree, in an ideal world, we are 100% sure that the last transaction was completed. yes. but ... we are not (at least I am not ) in this ideal world.
Also the "error" system of IB have to be seen as a client messaging service Same for WIN32 where GetLastError() is sometime used as a signal (Example when looking for enumeration of files, this is the only way to get a "no more file" info.) The OS handle gracefully zombie packet on close socket, so no harm. (surely not the only apps that do that). Codes are used to that, just look at handleSocketError you see a hard errno = 0; typical of way to reset it without concerns for unhandled error.
2- Related to #1 above, Putting the WaitForSingleObject(m_hReadThread, INFINITE); before eDisconnect may execute a last putMessageToQueue. While the basic supplied code will do a hard beak, the while() loop with m_buf.size==0 , exit because m_isAlive==false, without executing anything else.
3- Notice that a EReader object is used during connection as a limited life (scoped) heap object to fetch server protocol revision in a synchronous mode, and to call startApi() (I am not aware of what this proc? does but by the name of it I would avoid missing this call) this looks like a "trick" to overcome an un-anticipated evolution of the protocol, See EClientSocket::eConnectImpl(....) Actions happening in the dctor are then to be taken carefully.
I would agree that the inversion of code lines you suggested might not affect the behavior at it's synchronous, (m_hReadThread=0) , that you ought to get the protocol revision and I don't even see how you can abort this single message operation anyway. (aside of a socket shutdown) never see a need.
As buddy point it out: - it show that playing with the dctor of EReader may have side effect. In particular there are legacy code without a call to a ->stop(), ?? it's implicit in ~dctor and somewhere does the job. - Just to mention, the m_signal used for this Connect EReader is shared by this ephemeral EReader, you most probably will use the same for your main EReader.? It need care, there are people who use more than one EReader on same m_signal
|
Yeah, I see most of this as making a mountain out of a molehill. It strikes me, largely, as a misunderstanding; much like issuing a SIGKILL but expecting SIGTERM.
If you are familiar with unix signal processing you know that SIGKILL is immediately forced upon the process by the OS and the process my be left in an untidy state. SIGTERM, on the other hand, gives the process an opportunity to clean up and exit gracefully.
Since the API code is readily available, folks can implement their own version of EReader which suits their preference in this regard. Therefore the distinction becomes moot.
There may be some chance of acceptance for a minor pull request which, e.g., swaps the calls to the thread join and socket disconnect. But, even this could be seen as a dog chasing it's own tail since the argument would merely come down to the dislike of a default.
Moreover though, when you consider that the code for EReader hasn't changed in years... even a small change becomes unlikely. And, the suggestion of anything "major", like an intermediate "stop" method, becomes far fetched IMHO.
It's certainly an interesting nuance but probably better addressed by documentation alone. At least we have this thread of conversation now for reference :-)
|
@Gordon, @Buddy,
"mountain"?? : hardly. There is no impact to existing code with this proposed change (tested). I am simply adding functionality that allows users to manually stop the thread that they started manually.? In fact I can leave the destructor of EReader the same as it was without swapping anything and just create a new function "stop()" which will allow users to disconnect cleanly if they desire to,
(Note to Gordon on your concern about the initial use of EReader during the connection, this is what the "if (m_hReadThread)" statement takes care of in the original TWS API code which is still there in my proposed change.)
Let me remind you that the current situation results in a 509 error, which is basically the API's way of saying something bad happened and I do not know what it is. Your code must be bad, which is not the case here. It is the API code that is bad. Now that we know what causes it, we can probably live with it and safely ignore it, but personally I don't like that sort of situation. If the developers intended to use this way of disconnecting the socket, then they should have properly taken care of the error message it creates. The fact that they didn't means that what they coded was unintentional and therefore is clearly a bug. A bug needs to be fixed.
However, I realise changing (adding to) the interface will make the C++ interface different than the other languages which is probably not desirable.
The only way to correct this and make the existing interface the same is to create an EMutex variable which is shared by both EReader and EClientSocket (suggest using a composition pattern here). Changes to the m_isAlive variable need to be locked by the mutex as well as inside the "while(m_isAlive)" loop. This is a larger code change and needs more care to be done right but it is not that difficult and I have implemented it successfully as well. In fact using EMutex here does not slow the loop down (checking the mutex is as simple as reading a flag) since the only time it is ever locked is when we are disconnecting which means speed is not a "critical trading" issue.
I agree it is good to have this thread capture this discussion for the record but I will definitely raise this problem to the coders as I still firmly believe this to be a bug. I do not wish to have to fix this myself every time there is a new API update.
|
Hey, if you feel that strongly about it and have the time then go ahead and make the pull request. In the worst case scenario it'll be good practice, a learning experience and software development lesson.
?袄冲(ツ)冲/?
|
A fix has been submitted a while ago and is now in the "latest" version of the API.
|
?Not sure if this is the right topic but I got a problem with these newly added code in EClientSocket::eDisconnect.
void EClientSocket::eDisconnect(bool resetState) {
// Stop EReader thread to avoid 509 error caused by reading on a closed socket.
? if (m_pEReader)
? ? m_pEReader->stop();
...
When TWS is shutdown manually, the code will hang at m_pEReader->stop(). The call stack is:
?
> ? ?APITWS.dll!EClientSocket::eDisconnect(bool resetState) Line 241 ? ?C++ ? ? ?APITWS.dll!EClientSocket::onClose() Line 370 ? ?C++ ? ? ?APITWS.dll!EClientSocket::receive(char * buf, unsigned int sz) Line 274 ? ?C++ ? ? ?APITWS.dll!EReader::onReceive() Line 199 ? ?C++ ? ? ?APITWS.dll!EReader::processNonBlockingSelect() Line 181 ? ?C++ ? ? ?APITWS.dll!EReader::readToQueue() Line 101 ? ?C++ ? ? ?APITWS.dll!EReader::readToQueueThread(void * lpParam) Line 93 ? ?C++
?
Commenting out m_pEReader->stop() the code runs happily and will generate a 509 error as expected.
?
|
I cannot replicate your problem. When I shutdown TWS manually, the EReader stops properly and calls EWrapperImpl::connectionClosed() which is what is supposed to happen.
?
A 509 error is not what is expected. A 509 error means your code is trying to read on a closed socket which is a bug. By commenting out "stop()" your EReader thread has hung in an infinite wait and will never end until you force the program to stop.
?
What version TWS and API are you using? Are you on Windows or Linux?My test just now is using the latest version of TWS and latest API running on Windows.
|
My IB Gateway version is latest stable 10.30.1t, API version is latest stable 10.30.01 running on Windows. I replicated this on the sample TestCppClientStatic (debug, both win32 and x64). 1) start IB gateway, 2) start TestCppClientStatic, 3) shutdown IB gateway, 4) the program hangs. After putting a breakpoint at m_pEReader->stop() in void EClientSocket::eDisconnect(bool resetState), it seems the hang happened at WaitForSingleObject in EReader::stop(). See call stack below.
?
--hang here --> WaitForSingleObject(m_hReadThread, INFINITE);
> ? ?TestCppClientStatic.exe!EReader::stop() Line 70 ? ?C++ ? ? ?TestCppClientStatic.exe!EClientSocket::eDisconnect(bool resetState) Line 241 ? ?C++ ? ? ?TestCppClientStatic.exe!EClientSocket::onClose() Line 370 ? ?C++ ? ? ?TestCppClientStatic.exe!EClientSocket::receive(char * buf, unsigned int sz) Line 274 ? ?C++ ? ? ?TestCppClientStatic.exe!EReader::onReceive() Line 191 ? ?C++ ? ? ?TestCppClientStatic.exe!EReader::processNonBlockingSelect() Line 173 ? ?C++ ? ? ?TestCppClientStatic.exe!EReader::readToQueue() Line 93 ? ?C++ ? ? ?TestCppClientStatic.exe!EReader::readToQueueThread(void * lpParam) Line 85 ? ?C++
|
You are running an old version. There was a bug in the Windows version which was corrected on 5 Sep 2024 and release in the "latest" version.
?
Version 10.30.1t was released back in July 2024.
?
If you do not want to install the full latest version 10.33 you can simply modify the stop() function as follows:
?
void EReader::stop() { ? if (m_hReadThread) { ? ? m_isAlive = false; #if defined(IB_POSIX) ? ? if (!pthread_equal(pthread_self(), m_hReadThread)) ? ? ? pthread_join(m_hReadThread, NULL); #elif defined(IB_WIN32) ? ? if (!(GetCurrentThreadId() == GetThreadId(m_hReadThread))) ? ? ? WaitForSingleObject(m_hReadThread, INFINITE); #endif ? } }
?
?
There is a detailed explanation for the hang in the bug fix documentation:
|
Just tested API version 10.33.02. It works as expected. Thanks!
|