Panel Discussion: Benchmarking Timeseries Databases — Current State and Future Perspectives-天谋科技IoTDB

From August 26 to August 30, VLDB 2024, a premier international conference in the database domain, was held in Guangzhou. Three research papers featuring the latest advancements of IoTDB were accepted by the conference. Additionally, at the TPCTC 2024 conference hosted by the Transaction Processing Performance Council (TPC), the IoTDB team was invited to present a paper, deliver a keynote talk, and organize a panel discussion centered on timeseries databases.

The Significance of TPC and TPCTC Conferences

The Transaction Processing Performance Council (TPC) is a non-profit organization established in 1988 and is one of the most authoritative organizations for database performance benchmarking. Most of the world's leading database vendors and enterprise products, such as Oracle, Microsoft SQL Server, IBM DB2, and Databricks, have participated in TPC benchmarks. At the invitation of the TPC committee, an group of academic and industrial experts gathered at TPCTC 2024 for the panel discussion on "Benchmarking Timeseries Databases: Current State and Future Perspectives."

Special guests included:

Lei Chen: Dean of the Information Hub at The Hong Kong University of Science and Technology (Guangzhou), VLDB 2024 Chair
Jianmin Wang: Dean of the School of Software at Tsinghua University and Executive Director of the National Engineering Research Center for Big Data Software
Raghu Nambiar: AMD Global Vice President and CTO, TPCTC 2024 Chair
Hongzhi Wang: Chair of the Department of Computer Science and Engineering, Harbin Institute of Technology
Mingsheng Long: Associate Professor, School of Software, Tsinghua University
Qiang Li: Deputy General Manager and CTO of CISDI Information Technology (Chongqing) Co., Ltd.
Pengcheng Zheng: Managing Director of Timecho Europe

The discussion explored IoT scenarios from both academic and industrial perspectives, focusing on the integration of cutting-edge AI and machine learning technologies, addressing challenges in timeseries database research and industrial practice, and shaping benchmarks for evaluating timeseries databases.

1. Why Are Database Benchmarks Necessary?

When discussing database systems, particularly timeseries databases, experts agreed that standardized benchmarks are not only essential tools for evaluating database performance but also foundational for driving technological innovation.

Prof. Lei Chen, Dean at The Hong Kong University of Science and Technology, pointed out that existing general-purpose database benchmarks fail to capture the specific performance requirements of timeseries databases in IoT scenarios, especially the dual demands of high throughput and low latency. Without specialized benchmarks for timeseries databases, researchers and end-users cannot compare different systems fairly and objectively.

Prof. Lei Chen, Hong Kong University of Science and Technology (Guangzhou), VLDB 2024 Chair

Prof. Jianmin Wang, Dean of the School of Software at Tsinghua University, emphasized that benchmarks provide a standardized method for comparing database solutions, ensuring that performance evaluations are conducted under consistent conditions. This avoids biases introduced by vendor-optimized tests and offers users an objective framework for making informed decisions between competing database systems.

Prof. Jianmin Wang, Tsinghua University

Additionally, benchmarks play a pivotal role in driving innovation by providing developers with clear performance targets. Feedback from benchmarks allows developers to refine system architectures to meet evolving performance standards. Through this iterative process, benchmarks foster transparency, objectivity, and continual advancement in database technology.

To ensure fairness, experts highlighted the need for stringent oversight of benchmark tools, testing processes, parameter configurations, and hardware/software environments to guarantee consistency and accuracy in performance evaluations.

2. Key Characteristics of Databases for IoT Scenarios

✍ Keywords: Embedded-edge-cloud, high cardinality, scalability, out-of-order data, reliability and maintainability, AI integration

The role of timeseries databases in IoT environments was a central topic. The rise of Industry 4.0 has amplified the complexity of IoT systems, which typically span multiple layers—including data-generating devices, gateways for aggregation and transmission, and data centers for processing and storage. Timeseries databases must address the challenges of distributed edge devices and centralized data management. Building a robust IoT data management infrastructure requires optimizing the performance of each layer to ensure seamless end-to-end responsiveness.

Prof. Jianmin Wang highlighted the massive data traffic characteristic of IoT scenarios, driven by high-frequency measurements and vast numbers of sensors and devices. This leads to the problem of high cardinality. For example, the Changan Automobile’s connected vehicle platform manages over 1.5 billion measurement points, necessitating databases capable of efficiently handling diverse and complex datasets.

Analyzing the evolution of traditional relational databases, Prof. Lei Chen proposed that, akin to relational algebra for relational databases, timeseries databases should feature native algebra support to efficiently represent, store, and query data. Simply adapting relational database structures to timeseries scenarios is inadequate; purpose-built mechanisms are crucial for ensuring processing efficiency. Experts agreed that timeseries data should be treated as "first-class citizens" in data management.

Scalability, both vertical and horizontal, is another core characteristic. With the growing number of connected devices and data volumes, databases must scale effectively to maintain performance as workloads increase.

Dr. Qiang Li, CTO of CISDI, emphasized that industrial applications demand stability and maintainability. Industrial systems often require 24/7 uptime, and database failures or interruptions can lead to significant economic losses. Thus, timeseries databases must ensure long-term stability, ease of maintenance, and upgradeability. Pengcheng Zheng added that in industrial scenarios, network latency or sensor malfunctions often result in out-of-order data. Traditional databases, designed for orderly data, suffer performance degradation when handling such cases. IoT databases must excel in processing out-of-order data efficiently while maintaining write and query performance.

Dr. Qiang Li, CISDI Information Technology

Looking ahead, the development of artificial intelligence (AI) technology holds tremendous potential for advancing database systems. Dr. Raghu Nambiar, Global Vice President of AMD, remarked that over the past decades, we have witnessed technological transformations such as the internet, IoT, cloud-native systems, and AI, highlighting the growing importance of data. Extracting valuable insights from the massive data generated by connected devices has become increasingly critical. Native timeseries databases, with their capabilities for efficient management and utilization of timeseries data, are now more vital than ever. By leveraging AI technologies, databases can unlock the latent value of timeseries data, offering more intelligent real-time analytics and predictive capabilities.

Dr. Raghu Nambiar, AMD, TPCTC 2024 Chair

Professor Jianmin Wang noted that the development of native timeseries databases, combined with optimization for timeseries-specific operations, presents significant opportunities for this field. By focusing on high cardinality data characteristics, native support for timeseries operations, real-time responsiveness, scalability, out-of-order data handling, stability, and deep integration with AI, timeseries databases can meet the stringent demands of IoT applications, driving the continuous evolution of both consumer-grade and industrial-scale connected systems.

3. Integrating AI Technologies with Database Systems for IoT Scenarios

✍ Keywords: Timeseries large models, anomaly detection, prediction, in-database inference/training, reliability, interpretability

AI technologies, such as machine learning, deep learning, natural language processing, and large models, are playing an increasingly significant role in managing and analyzing timeseries data in IoT scenarios. However, integrating AI with database systems presents numerous challenges.

(1) The Role of AI in Timeseries Databases

AI enhances multiple aspects of database performance, from query optimization to storage efficiency. Timeseries data, characterized by high frequency, large volume, and real-time requirements, presents unique challenges to database management systems (DBMS). AI can help address these challenges, such as improving performance and enabling anomaly detection and trend forecasting.

Prof. Mingsheng Long, Associate Professor at Tsinghua University, highlighted the potential of large-scale pretrained models for timeseries analysis. These models can handle tasks like prediction, data imputation, and classification. By integrating these AI-driven capabilities into databases, systems can perform real-time predictions, anomaly detection, and pattern recognition without external processing. Direct in-database training enables continuous learning from incoming data streams, dynamically adapting to changing patterns and improving resource efficiency.

Prof. Mingsheng Long, Tsinghua University

While large models offer powerful solutions, the heterogeneity of IoT systems requires flexible adaptation to varying application needs. For instance, lightweight models may be necessary for edge devices to process data locally while preserving privacy. Prof. Lei Chen emphasized that AI model design should be tailored to specific applications to strike a balance between real-time performance and privacy protection. Developing models specialized for timeseries data can automate various tasks within databases.

However, challenges such as model interpretability and reliability remain. Prof. Hongzhi Wang, Chair of the Department of Computer Science and Engineering at Harbin Institute of Technology, stressed the importance of ensuring the transparency and trustworthiness of AI models, especially in high-stakes domains like industrial automation and healthcare.

Prof. Hongzhi Wang, Harbin Institute of Technology

(2) AI-Driven Optimization of Benchmarks

AI not only enhances database performance but also revolutionizes benchmarking methods. Traditional benchmarks operate like standardized "tests," with databases evaluated based on predefined workloads. AI introduces the potential for more dynamic "interview-style" evaluations, where benchmarks adapt to the unique characteristics of each database. This approach is particularly significant for timeseries databases, as application requirements vary widely. AI-driven benchmarks can better assess database performance for specific workloads, offering valuable insights for developers and users.

Despite these advancements, challenges such as the reliability of AI-generated data must be addressed. Generative AI could produce inaccurate data, distorting benchmark results. Human oversight remains essential to ensure the credibility of AI-assisted benchmarks.

Prof. Jianmin Wang emphasized that introducing AI technologies into database systems necessitates the development of new benchmark methods and AI-driven workloads tailored to specific requirements. Traditional database operations, such as Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), have well-established benchmarks. However, these standards do not adequately address the demands of AI workloads. Thus, new benchmarks should evaluate database performance in areas like query optimization, data consistency, and real-time capabilities.

4. TPCx-IoT Benchmarking

As IoT environments expand rapidly, businesses face the dual challenge of managing massive data streams and deriving meaningful insights from them. To evaluate the performance of hardware and software systems for such tasks, the TPC committee introduced the TPCx-IoT benchmark in 2017. This standard provides a framework for assessing the scalability and efficiency of IoT systems in handling real-time, high-throughput workloads.

The TPCx-IoT benchmark is based on the widely used Yahoo Cloud Serving Benchmark (YCSB) framework and simulates workloads such as managing substation sensor data. These workloads include continuous data flows from edge devices to gateways and backend data centers. The benchmark focuses on tasks like data ingestion and edge analytics, which are critical for real-time IoT processing and insights.

Dr. Raghu Nambiar emphasized that TPCx-IoT employs an objective and neutral framework to evaluate the performance and cost-effectiveness of systems under real-world conditions. The benchmark ensures fair and consistent assessments, verifying system reliability for sustained performance. This provides industrial users with objective and verifiable performance evaluations.

Several database systems have been evaluated using TPCx-IoT, including the enterprise-grade timeseries database TimechoDB (based on Apache IoTDB), the scalable Machbase, the cloud-native Lindorm, and the widely used distributed storage system HBase. These systems have demonstrated diverse performance characteristics when handling different IoT workloads. In recent benchmarks, TimechoDB, leveraging the advanced technology of Apache IoTDB as a native timeseries database, set new records for performance and ranked at the top.

TPCx-IoT Benchmark Top Results

(5) Future Optimization and Development of the TPCx-IoT Benchmark

At the conclusion of the roundtable discussion, invited experts shared insights and recommendations for the future optimization and development of the TPCx-IoT benchmark. Key improvement areas included diversifying data types, handling out-of-order data, enhancing query performance, integrating AI-driven analytical workloads, and addressing industry-specific needs.

Prof. Jianmin Wang proposed two suggestions: First, benchmark development should attract greater participation from industry stakeholders. Second, he emphasized the need for more involvement from academia to collaboratively explore shared challenges in database development, thus driving continuous improvement in databases and benchmarks.

Prof. Lei Chen highlighted the critical role of AI in advancing benchmarks, particularly in the field of Artificial Intelligence of Things (AIoT). He suggested defining foundational models and operations required for benchmark design and continuously enhancing their comprehensiveness.

Dr. Raghu Nambiar noted that TPC’s organization of conferences like TPCTC helps maintain technological relevance and adapt to developments in areas such as timeseries databases and AI. TPC’s dedication to establishing industry standards benefits both academia and the commercial sector.

Prof. Hongzhi Wang stressed that different subdomains may require distinct benchmarks. He recommended involving domain experts in benchmark development to ensure that database evaluations reflect the unique needs of specialized timeseries databases, rather than relying on generic systems like MySQL or Oracle.

Prof. Mingsheng Long suggested incorporating AI-related requirements into future benchmarks and collecting feedback from AI researchers to better involve experts in improving timeseries data analysis, forecasting, and large models.

Dr. Qiang Li pointed out that TPC benchmarks serve as valuable references for solution providers, helping them identify risks and performance metrics. He advocated closer collaboration with industry to identify additional performance metrics relevant to specific sectors, enabling more targeted benchmarks.

Pengcheng Zheng emphasized that benchmarks should fully consider the demands of real-world production environments, such as multidimensional workloads, diverse data types, aggregation queries, and out-of-order data handling. This would prevent benchmarks from becoming mere "tests for testing’s sake" and ensure that the results hold practical significance.

Conclusion

As industrial digitalization accelerates, the volume of data in IoT scenarios continues to grow, making the innovation of timeseries databases and comprehensive benchmarking methods increasingly critical. Benchmarks like TPCx-IoT play an irreplaceable role in evaluating database performance under complex IoT conditions.

This roundtable highlighted the importance of standardized benchmarks and key characteristics required for databases in real-time, high-frequency scenarios. While AI introduces significant opportunities for database systems and benchmarking, it also brings challenges related to interpretability, reliability, and complexity.

The experts offered specific suggestions for optimizing TPCx-IoT, including diversifying data types, improving out-of-order data handling, enhancing query capabilities, integrating AI-driven analytical workloads, and aligning benchmarks closely with industrial scenarios. These improvements will help benchmarks better reflect real-world applications, especially when addressing unique challenges in different industries.

Looking ahead, benchmarks must evolve alongside technological advancements to remain relevant in both traditional and AI-enhanced IoT scenarios. Through close collaboration between academia and industry, benchmarks will continue to drive innovation and optimization in timeseries database technologies, enabling them to meet the growing complexity of IoT applications.

Panel Discussion: Benchmarking Timeseries Databases — Current State and Future Perspectives