¿ªÔÆÌåÓý

ctrl + shift + ? for shortcuts
© 2025 Groups.io

Locked Re: What type of storage or db is best for a lot of trading data that streams to you?


 

Hi Bruce,

I've never played with "billions lines of data" and use only?

?for storage (I use sqlite3 and framework written in java).
I keep it simple - one file per one table. So if for example I download a few months of 1-sec bars with "what to show" set to "trades", then it will end up in one file, if it is "what to show" == something else, then it will be a different file. It works for me and covers all my needs. The most complex use case for that storage was to replay few ears of all 500+ components of S&P 500. It took about 2-3 weeks back then to download all the 1-min history bars for more than 10 years (for some ticker more for some less). And I end up with ~40-50GB of db files or so (won't fit ram:). Then I solved the challenge of reading historical data?from?multiple?DB/tables in parallel without making "join" on those huge datasets. I put "prototype" here as a self-sufficient standalone project:?
And then spent couple evenings to integrate this into my java "speculant" framework, so it is now part of the "replay" mechanizm.

I'll put here and example of generated files for 1 particular contract:
image.png

I put all the scanners result in one file (one table), then have "symbol_store.db" for all the contract details I've ever seen. When run scanners and getting back another 50 conid's I automatically query ContractDetails on all of those. Also recently added "tags.db" to simply be able to associate a tag (string) to contract id - this is useful to group things around etc. If interested - PM me and I'll send all the "create table" statements for all the mentioned DBs.

Cheers,
Dmitry Shevkoplyas


On Sun, Feb 14, 2021 at 2:06 PM Sean McNamara <tank@...> wrote:
If you are doing large scale backtesting with the full set sets of data then the suggestion from ds-avatar is a good one as the Parquet file format is quite nice.?

If you are looking for a more database-centric approach, I've had great luck using PostgreSQL () with the TimescaleDB () module enabled.? I like the fact that you get excellent compression of data, that you can generate on-demand subsets of history (vs. file-based persistence?like Parquet), and that the interface to the data is normal SQL queries.
?
I've played around with a few other options such as InfluxDB, but found that for my personal use-case TimescaleDB was preferable.
?
It's not clear how you intend to interact with the data, and those specifics will most likely point you in the right direction.

Join [email protected] to automatically receive all group messages.