Re: What type of storage or db is best for a lot of trading data that streams to you?

I'm locking this thread now.

It's really off-topic, and there are probably as many answers to it as there are members of the group!

Please, folks, try to remember what the purpose of this group is: to help people who are having problems with the TWS API. It's not a free-for-all discussion on any topic that's in some way related (or even unrelated) to the API. You need to make your own decisions about what technologies to use.

Members who persist in making off-topic posts or who appear not to be willing to make an effort to find answers themselves before asking the group, are like to be put back on moderation.

Richard King Group Owner and Moderator

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46504

Hi Bruce,

? I use pandas(python) to and from csv and that works for me. My datasets are not in the GB size and hence ymmv depending on whether you are trading?at the tick level.

-Ajay

Show quoted text

On Sun, Feb 14, 2021 at 2:19 PM Amaganset <joe.paoloni@...> wrote:

MSAcess

?

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46503

MSAcess

?

image.png

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46502

Hi Bruce,

I've never played with "billions lines of data" and use only?

?for storage (I use sqlite3 and framework written in java).

I keep it simple - one file per one table. So if for example I download a few months of 1-sec bars with "what to show" set to "trades", then it will end up in one file, if it is "what to show" == something else, then it will be a different file. It works for me and covers all my needs. The most complex use case for that storage was to replay few ears of all 500+ components of S&P 500. It took about 2-3 weeks back then to download all the 1-min history bars for more than 10 years (for some ticker more for some less). And I end up with ~40-50GB of db files or so (won't fit ram:). Then I solved the challenge of reading historical data?from?multiple?DB/tables in parallel without making "join" on those huge datasets. I put "prototype" here as a self-sufficient standalone project:?

And then spent couple evenings to integrate this into my java "speculant" framework, so it is now part of the "replay" mechanizm.

I'll put here and example of generated files for 1 particular contract:

I put all the scanners result in one file (one table), then have "symbol_store.db" for all the contract details I've ever seen. When run scanners and getting back another 50 conid's I automatically query ContractDetails on all of those. Also recently added "tags.db" to simply be able to associate a tag (string) to contract id - this is useful to group things around etc. If interested - PM me and I'll send all the "create table" statements for all the mentioned DBs.

Cheers,
Dmitry Shevkoplyas

Show quoted text

On Sun, Feb 14, 2021 at 2:06 PM Sean McNamara <tank@...> wrote:

If you are doing large scale backtesting with the full set sets of data then the suggestion from ds-avatar is a good one as the Parquet file format is quite nice.?

If you are looking for a more database-centric approach, I've had great luck using PostgreSQL () with the TimescaleDB () module enabled.? I like the fact that you get excellent compression of data, that you can generate on-demand subsets of history (vs. file-based persistence?like Parquet), and that the interface to the data is normal SQL queries.

?

I've played around with a few other options such as InfluxDB, but found that for my personal use-case TimescaleDB was preferable.

?

It's not clear how you intend to interact with the data, and those specifics will most likely point you in the right direction.

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46501

You would be better off using 4K blocks as that's the size of an inode on disk.? The next size a filesystem uses is 2MB which is 1 parent inode filled with 64 bit pointers to full 4K inodes.? Those are the fastest sizes of data to write to most filesystems.

Hunter

On Sunday, February 14, 2021, 12:43:16 PM PST, btw <newguyanon@...> wrote:

I use gzipped text files, basically just as they come from IB.? Gzip can compress on the fly so I append 32k blocks when I have enough data so it doesn't write all the time.? I get about 7* compression on my data which increases to 8* if I recompress an existing file.? Reading is faster for compressed files.? I use a normal ssd.

? ? ? ? ? ? if (rtvBytes > 16000 || isForce){//32k block size but prefer more writes than less, force on close, still good compression

? ? ? ? ? ? ? ? try (GZIPOutputStream gzout = new GZIPOutputStream(new FileOutputStream(outf, true))) {//true append

? ? ? ? ? ? ? ? ? ? for (; idxF < indVals.size(); idxF++) {

? ? ? ? ? ? ? ? ? ? ? ? RTV rtv = (RTV) indVals.get(idxF).misc;

? ? ? ? ? ? ? ? ? ? ? ? gzout.write(rtv.toSCSV(fmtStr).getBytes());

? ? ? ? ? ? ? ? ? ? }

?

?oops maybe I use 16k blocks

I don't need any specific database queries since I just ask for a full month/year/contract etc. at a time.? If you need one symbol for one week and another for another week in reverse order or something then a database makes sense.? ?Obviously this never happens.

For historical 5 sec bars which I get some every week I gzip them all at the end.? If you screw up and get too many or forget a week and get them out of order the file is screwed up.? So when loading I put them in a set then sort (all by date).? I was going to write code to detect errors and save the fixed file but I never bothered as it takes just a few msecs to fix every time.

I think they would all be called timeseries since I save a timestamp.

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46500

I use gzipped text files, basically just as they come from IB.? Gzip can compress on the fly so I append 32k blocks when I have enough data so it doesn't write all the time.? I get about 7* compression on my data which increases to 8* if I recompress an existing file.? Reading is faster for compressed files.? I use a normal ssd.

? ? ? ? ? ? if (rtvBytes > 16000 || isForce){//32k block size but prefer more writes than less, force on close, still good compression

? ? ? ? ? ? ? ? try (GZIPOutputStream gzout = new GZIPOutputStream(new FileOutputStream(outf, true))) {//true append

? ? ? ? ? ? ? ? ? ? for (; idxF < indVals.size(); idxF++) {

? ? ? ? ? ? ? ? ? ? ? ? RTV rtv = (RTV) indVals.get(idxF).misc;

? ? ? ? ? ? ? ? ? ? ? ? gzout.write(rtv.toSCSV(fmtStr).getBytes());

? ? ? ? ? ? ? ? ? ? }

?

?oops maybe I use 16k blocks

I don't need any specific database queries since I just ask for a full month/year/contract etc. at a time.? If you need one symbol for one week and another for another week in reverse order or something then a database makes sense.? ?Obviously this never happens.

For historical 5 sec bars which I get some every week I gzip them all at the end.? If you screw up and get too many or forget a week and get them out of order the file is screwed up.? So when loading I put them in a set then sort (all by date).? I was going to write code to detect errors and save the fixed file but I never bothered as it takes just a few msecs to fix every time.

I think they would all be called timeseries since I save a timestamp.

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46499

If you are doing large scale backtesting with the full set sets of data then the suggestion from ds-avatar is a good one as the Parquet file format is quite nice.?

If you are looking for a more database-centric approach, I've had great luck using PostgreSQL () with the TimescaleDB () module enabled.? I like the fact that you get excellent compression of data, that you can generate on-demand subsets of history (vs. file-based persistence?like Parquet), and that the interface to the data is normal SQL queries.

?

I've played around with a few other options such as InfluxDB, but found that for my personal use-case TimescaleDB was preferable.

?

It's not clear how you intend to interact with the data, and those specifics will most likely point you in the right direction.

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46498

Thanks for the feedback.?I will read on Parquet.

I am surprised by the sheer amount of companies tackling storage issue. Another tool with quiet a bit of bragging vs Redia is Tarantool that I have read about. They solve the hot/cold issue with cache apparently...

It seems picking a storage technology is a whole project in itself nowadays.

Reading just a bit about kdb+ it seems pretty interesting how simple and high level and meaningful the syntax is and also column saving instead of row saving makes sense.

*I wish someone had benchmarked all the top storage systems and had written about all of them in one place specifically for trading use :)

Maybe Dmitry and Richard can also give some feedback too.

Thanks,

Show quoted text

On Sun, Feb 14, 2021, 5:02 AM ds-avatar <dimsal.public@...> wrote:

Trying out parquet. It's binary table file format, good for time series, apparently designed for big data. Only started to store harvested tick data with it recently but I hear lots of good things about its efficiency and Matlab seems to make it easy to operate for out of memory processing.

вс, 14 февр. 2021 г., 5:55 Bruce B <bruceb444@...>:
Hello,

Those who record streaming and history data what do you use for storage?

And what type of storage do you use for persistent and non-persistent use (on the fly analysis)?

Do you use any timeseries?

I am talking billions lines of data of course. I like to you hear your experience.

Thanks,

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46497

Trying out parquet. It's binary table file format, good for time series, apparently designed for big data. Only started to store harvested tick data with it recently but I hear lots of good things about its efficiency and Matlab seems to make it easy to operate for out of memory processing.

вс, 14 февр. 2021 г., 5:55 Bruce B <bruceb444@...>:

Show quoted text

Hello,

Those who record streaming and history data what do you use for storage?

And what type of storage do you use for persistent and non-persistent use (on the fly analysis)?

Do you use any timeseries?

I am talking billions lines of data of course. I like to you hear your experience.

Thanks,

Re: Individual examples for each function of API?

#46496

For C#, there is a pretty good sample app coming with the API, source included. It is not perfect due to somewhat outdated and contrived backbone (uses winforms and lacks full support for task based async code, while relying on an internal custom messaging subsystem that is a bit overwhelming) but it is very extensive, well structured and can be used for tinkering and prototyping of minor custom incremental functionality.

вс, 14 февр. 2021 г., 5:52 Bruce B <bruceb444@...>:

Show quoted text

Hello,

Are there any individual / simple examples (single files) for each function listed in APIs posted anywhere (in any of the supported languages)? or a project that is breaks down and makes examples of the whole API or big parts of it?
The samples are full of bugs that come with the SDK are full of bugs and cluttered with a lot of things in one place. It's not efficient to separate things out of those to get simple examples running.

Thanks,

Re: What type of storage or db is best for a lot of trading data that streams to you?

Dean Williams

#46495

开云体育

Bruce

Unfortunately I have no experience with Redis.

This gives a short overview of kdb:

There is also a developers group for the non-commercial version:

Dean

Show quoted text

On Feb 13, 2021, at 22:58, Hunter C Payne via <hunterpayne2001@...> wrote:

I just use CSV files for pricing data.? The size of the pricing data just isn't that large and it makes the complexity of using a DB for pricing data just unnecessary.? I use a DB for other things in my system and I still don't use it for pricing data.

If you must, you probably want to use a

which is what kdb+ is.? Q is analogous to SQL.? Its not suitable for general purpose programming but its fine for the query language itself.? I imagine you would have to write something in your language to call the Q queries.

Redis is probably unsuitable for your needs completely.

Hunter

On Saturday, February 13, 2021, 7:31:10 PM PST, Bruce B <bruceb444@...> wrote:

Dean,

Thanks for the feedback.

Can you expand on this please specially if you have Redis experience and can compare?
What of kdb+ is useful for this purpose? Q Language and future analytics capability, speed of read / write, or something else?

Thanks,

Re: Semantic difference between tickByTickBidAsk() and tickByTickAllLast()?

#46494

That is not correct. tickPrice() and tickByTickLastAll() are both Level 1. BidAsk is Level 2.
I point you again to

reqMtData() and the tickPrice() callback relate to TWS watchlists and is data aggregated over 250ms (e,g, no more than four values per second)
reqTickByTickData() and the tickByTickLastAll() callback relate to the TWS "Time and Sales" window and is not aggregated (e.g. as many values per second as needed)

And for the difference between Last and AllLast I point you once more to again:

AllLast has additional trade types such as combos, derivatives, and average price trades which are not included in Last.

On Sat, Feb 13, 2021 at 08:48 PM, Bruce B wrote:

Thanks for the explanation.
?

From IBKR jargons, tickPrice is Level 1 and tickByTickLastAll is Level 2 data.

?

I think the only main question I have left now is what is the difference between tickType = All Last and tickType = Last from tickByTickAllLast() class.

?

Thanks,

?

?

On Sat, Feb 13, 2021, 8:55 PM JR <TwsApiOnGroupsIo@...> wrote:

The main difference is that reqMktData() returns aggregated data snapshots while reqTickByTickData() does not aggregate and reports all relevant events individually.

More detail at
?

?

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46493

I just use CSV files for pricing data.? The size of the pricing data just isn't that large and it makes the complexity of using a DB for pricing data just unnecessary.? I use a DB for other things in my system and I still don't use it for pricing data.

If you must, you probably want to use a

which is what kdb+ is.? Q is analogous to SQL.? Its not suitable for general purpose programming but its fine for the query language itself.? I imagine you would have to write something in your language to call the Q queries.

Redis is probably unsuitable for your needs completely.

Hunter

On Saturday, February 13, 2021, 7:31:10 PM PST, Bruce B <bruceb444@...> wrote:

Dean,

Thanks for the feedback.

Can you expand on this please specially if you have Redis experience and can compare?
What of kdb+ is useful for this purpose? Q Language and future analytics capability, speed of read / write, or something else?

Thanks,

Re: What type of storage or db is best for a lot of trading data that streams to you?

#46492

Dean,

Thanks for the feedback.

Can you expand on this please specially if you have Redis experience and can compare?
What of kdb+ is useful for this purpose? Q Language and future analytics capability, speed of read / write, or something else?

Thanks,

Re: What type of storage or db is best for a lot of trading data that streams to you?

Dean Williams

#46491

开云体育

Show quoted text

On Feb 13, 2021, at 21:55, Bruce B <bruceb444@...> wrote:

Hello,

Those who record streaming and history data what do you use for storage?

And what type of storage do you use for persistent and non-persistent use (on the fly analysis)?

Do you use any timeseries?

I am talking billions lines of data of course. I like to you hear your experience.

Thanks,

What type of storage or db is best for a lot of trading data that streams to you?

#46490

Hello,

Those who record streaming and history data what do you use for storage?

And what type of storage do you use for persistent and non-persistent use (on the fly analysis)?

Do you use any timeseries?

I am talking billions lines of data of course. I like to you hear your experience.

Thanks,

Individual examples for each function of API?

#46489

Hello,

Are there any individual / simple examples (single files) for each function listed in APIs posted anywhere (in any of the supported languages)? or a project that is breaks down and makes examples of the whole API or big parts of it?
The samples are full of bugs that come with the SDK are full of bugs and cluttered with a lot of things in one place. It's not efficient to separate things out of those to get simple examples running.

Thanks,

Re: Semantic difference between tickByTickBidAsk() and tickByTickAllLast()?

#46488

Thanks for the explanation.

From IBKR jargons, tickPrice is Level 1 and tickByTickLastAll is Level 2 data.

I think the only main question I have left now is what is the difference between tickType = All Last and tickType = Last from tickByTickAllLast() class.

Thanks,

Show quoted text

On Sat, Feb 13, 2021, 8:55 PM JR <TwsApiOnGroupsIo@...> wrote:

The main difference is that reqMktData() returns aggregated data snapshots while reqTickByTickData() does not aggregate and reports all relevant events individually.

More detail at

Re: Semantic difference between tickByTickBidAsk() and tickByTickAllLast()?

#46487

The main difference is that reqMktData() returns aggregated data snapshots while reqTickByTickData() does not aggregate and reports all relevant events individually.

More detail at

Re: Semantic difference between tickByTickBidAsk() and tickByTickAllLast()?

#46486

Thanks JR.

"This is very different and distinct from the Tick-By-Tick data interfaces."
This is different and distinct in the way request is made versus how tickByTickLastAll request is made or also result is different too? If result is different from BidAskLast/LasAll then how is tickPrice different?

Below is how they describe it and it shows "contract traded" so this is really a trade too. How many trades types are there??? there must be only one.
Ref:?

Last Price

4

Last price at which the contract traded (does not include some trades in RTVolume).

Thanks,

Show quoted text

On Sat, Feb 13, 2021 at 02:04 PM, JR wrote:

The tickPrice() callback you refer to in 4) belongs to the EWrapper class and relates to the Market Data interfaces

That data is requested by calling reqMktData() described at

Those interfaces use tick types described at

And their callbacks are , , , , , ,

This is very different and distinct from the Tick-By-Tick data interfaces.

I don't really have answers for your other questions. Never came up for us but I am sure the answers are out there.