Sounds similar to what I have been banging my head against.
As I said in my post. You should be able to filter the additional ticks out of reqHistoricalTicks() and convert them to streams that are identical to recorded reqTickByTick() streams for the same period. Just try it for a day or two next week.
I have no recorded natural gas ticks, but here the trade profile for ESH5 last week from recorded reqTickByTick() data:
There were 2.5Mio trades last week with a total volume of just shy of 7Mio contracts
55% of the trades and 20% of the total volume came from trades with a size of 1
90% of the trades had a size of 5 or less while 90% of the total volume came from trades with sizes of 25 or less.
You see there is a "long tail" ending with a trade of 1,895 contracts (just over $555Mio) where you can find "large" trades
闯ü谤驳别苍
?
?
?
?
?
?
?
On Sat, Jan 11, 2025 at 09:38 PM, Brendan Lydon wrote:
Jurgen,
?
Yes definitely a complicated task with a lot of considerations both in backtest and application. I do need to compare the 2 functions' data streams of reqHistoricalTicks() and reqTickByTick() and check feasibility there. My strategy is similar to another classification machine learning model I wrote for 1 minute OHLCV bars which I used to map a bunch of different features and ran in realtime. That was straight forward, as you can stream reqHistoricalData() in realtime for each new bar update on the minute every minute and feed it through the model to get your prediction on the next candle: up or down, and form a strategy around existing positions that way. With this though, I plan to center my classification model around 'large' orders to answer the same question though: up or down, but I want to pass a new x-vector through my trained model on each 'large' order. Assume its a basic logistic regression binary classification problem, so the x-vector I pass into the model will make a signal every time the large order threshold is met. I will obviously set a threshold for 'large', probably dynamic based on current tick-by-tick volumes, say outside 2 standard deviations for example. I want to avoid using bar like data structures like tick-bars, volume bars, imbalance bars, etc. I said a basic logistic regression binary classification model, but in reality it will be probably end up being a multi-class classification model using an LSTM like architecture with classes like: long, short, do nothing. And pass a multi-dimensional vector packing in all tick data and associated, mapped features between large-orders into the next signal. None of the strategy matters though if the data I use in real time is not as close to the data I back test on. And I am worried that a 'large' order defined by the data I am using for the backtest, collected via reqHistoricalTicks(), will approximate parameters for the features I come up with based on values I may never see in realtime if I then decide to use reqTickByTick() making the backtest obsolete with this current data I have. Collecting enough reqTickByTick() data could potentially take years to have a valid dataset to train on if I started collecting now. I hope that clears up my intentions for the strategy and I appreciate your thoughts.