• Home
  • Jobs
  • Courses
  • Questions
  • Teachers
  • For business
  • ES/EN

0

46
Views
Relational/time series databases and very large SELECT queries

I need to store a large number of structured records (potentially hundreds of billions) in a database. Data would be written continuously by many sensors, with a high insert rate (up to 100k rows/second).

The data being well structured, it seems to be a good match for a structured database such as Postgres. However, I am afraid the performance wouldn't be sufficient for the amount of data that needs to be ingested.

Furthermore, I do not need all of the features of a relational database (no need for full SQL support). Data would be written once, and read a few times as large chunks using basic queries such as:

SELECT time, value FROM data WHERE time>1000 AND time<2500 AND sensor_location="home" ORDER BY time

That is, select all records between two timestamps for a given sensor (or set of sensors). I do not need any ability to make complex queries such as joins or updates. The ORDER BY clause is important, since I need to be able to process these messages in the order they were written (using a Python script). These queries typically return many rows, and are often too large to fit in RAM. Furthermore, returning that many rows is very slow with most RDBMSs due to their text-based wire protocol, even if I split the query.

This seems to be a good use case for a time series database such as InfluxDB. However, its open source version cannot easily be distributed (which is a requirement in my case, both for resilience and scalability), and my tests showed that it is not performant enough when it comes to large queries (in particular, its wire protocol is too slow to efficiently transfer this many rows - and it sometimes even crashes when the query returns too many rows).

I recently learned about Clickhouse, which is horizontally scalable and highly performant. It has a binary/compressed wire protocol and one of the Python drivers (clickhouse_driver) has an execute_iter function which avoids blowing up the client's RAM when making these large queries. However, I am quite worried about its resilience (data corruption is not tolerable in my use case) since it is fairly recent and has a limited user base.

I am aware that my use case if quite specific. Are there other free/open source options I should be aware of?

3 months ago ·

Santiago Trujillo

3 answers
Answer question

0

Look like your case is typical for ClickHouse please use ReplicatedMergeTree table engine https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/

3 months ago · Santiago Trujillo Report

0

Take a look at VictoriaMetrics time series database. It easily handles 100k rows/sec ingestion performance on a single node with a few CPU cores. It is optimized for storing and querying trillions (10^12) of rows - see case studies. It also scales to multiple nodes - see docs for cluster version.

It also provides MetricsQL query language, which is optimized for typical time series queries in production. For instance, the following query would return time series for all the temperature sensors at home: temperature{sensor_location="home"}.

3 months ago · Santiago Trujillo Report

0

You should be aware of Warp 10. It is scalable and looks a good fit to your use case.

Since you process the messages using Python, the fact that is it well integrated with it should be relevant to you. It supports both Pickle and Arrow to wire data to Python. You can also distribute the processing using its integration with Spark.

3 months ago · Santiago Trujillo Report
Answer question
Find remote jobs
Loading

Discover the new way to find a job!

Top jobs
Top job categories
Business
Post job Plans Our process Sales
Legal
Terms and conditions Privacy policy
© 2022 PeakU Inc. All Rights Reserved.