Wang Jianmin Talks About Industrial Big Data Software in Universities | ApacheCON Asia 2021

Author: Jianmin Wang

| Design: Ying Zhou

| Editor in charge: Mingkang Li

"Open source in universities is to cultivate international talents, especially in the software field. Open source and innovation model is a basic norm. If our graduates do not understand the logic of open source, it will be difficult to lead the world on the stage of software engineering. Therefore, we must do a good job in open source education and cultivate college students with open source thinking; we must disseminate scientific research results and provide support for open source development; we must vigorously create an atmosphere in which the whole society attaches importance and participates in open source, in order to help China's software industry globalization."

This is a speech made by Wang Jianmin, Dean of the School of Software, Tsinghua University, at the 3rd World Forum on Science and Technology and Development - Open Science and Open Source Innovation and Development Sub-forum [1]. First, this time, we would like to share a similar speech given by Dean Wang at the ApacheCON Asia 2021 annual meeting. This speech not only introduces the current situation of industrial big data, but also explains why they use open source to do this. It also includes the specific practice of Tsinghua University School of Software in open source innovation and open source education. I hope this article will be helpful to readers who are concerned about industrial big data open source software and open source education.

Part 1 Speaker Introduction

Professor Jianmin Wang

Professor Jianmin Wang, Dean of the School of Software, Deputy Dean of the School of Information, Tsinghua University, Executive Director of the National Engineering Laboratory for Big Data System Software, Leader of the National 863 Program Advanced Manufacturing Expert Group, Member of the National Industrial Internet Strategy Advisory Expert Committee. Main research fields: big data and knowledge engineering, including unstructured data management, business process and product life cycle management, digital copyright and system security technology, database testing technology, etc.

Part 2 Excerpts from the text of the speech[2]

It is an honor to share with you our work on industrial big data software research and open source practice at Tsinghua University.

Today, big data is not a new word. There are many leaders in the Internet industry in China, such as Alibaba, Tencent, Baidu, and Huawei. They are all strong players in the field of big data, and most of them are consumer-oriented.

However, if we look closely at the Chinese economy, we will find that China still lags in some major areas of big data application - such as manufacturing, construction, transportation, etc. Today, these industries face two major challenges: a lack of talent with a deep understanding of advanced big data technologies, and the lack of technology available today to solve the specific problems in their hands. However, big data also has many new areas of focus, such as artificial intelligence, machine learning, data science, and more.

1. The source of industrial big data

Our mission is to innovate big data technologies and applications for these industries. According to the big data report released by McKinsey Global in 2011, the data volume of the manufacturing industry even exceeds that of the financial industry. Where does industrial big data come from?

The first data source is enterprise information systems, such as CAD systems, PDM and PLM systems, ERP and CRM systems, etc. (these systems have been used by enterprises since the 1960s). The second source of data is the Industrial Internet of Things that started in the early 2000s, such as airplanes, wind turbines, etc. Industrial IoT data, also known as machine equipment data or working condition data, constitutes the main body of industrial big data. The third source of data is cross-domain data from the Internet, such as meteorological, geographic, and environmental data, which is readily available in today’s AI era.

Where does the Industrial Big Data Come from?

The first data source of industrial big data is enterprise information system. The data in the enterprise information system contains unstructured data, such as 2D engineering drawings, 3D part models, service cards, business documents, etc., which are usually stored in the file system; at the same time, it also contains structured data, such as bills of materials and Metadata for part items, product unstructured data and their file paths, which are stored in a relational DBMS.

According to the theory of PLM (Product Lifecycle Management, product life cycle management), the design and manufacturing stage of the product is also called the beginning of the life cycle (Beginning of the Life, hereinafter referred to as BOL), and the maintenance and service stage of the product is also called the middle of the life cycle. Stage (Middle of the Life, hereinafter referred to as MOL).

In order to meet the requirement of bidirectional connection between BOL data and MOL data, we introduce a neutral BOM (Bill of Materials, product structure list) structure to effectively coordinate the BOM in the design and manufacturing stage with the BOM in the service stage. The neutral BOM reduces the complexity of the association between different life cycle BOMs, and is widely used in enterprises (such as Dongfang Steam Turbine Co., Ltd.), and is released as a national standard.

The second data source of industrial big data is industrial IoT data from engineering equipment or mechanical equipment. In order for equipment to operate efficiently, we need to collect, store and analyze as much operating data as possible. Original equipment manufacturers (eg Sany, Zoomlion) embed many sensors in their machines.Take excavators as an example. When they work on construction sites, sensors collect data and send data to cloud data centres through Wi-Fi and 5G networks. This data records the operation status of the machine and equipment. For example, when the equipment moves from one site to another, we collect data on their speed, location and fuel consumption; when the equipment works, we collect data such as chassis angle, pump pressure, etc. Assuming that a device has an average of 500 sensors, the 10,000 engineering devices generate more than 50 billion records every year.

Today, IoT data has become the main body of industrial big data, and it will still dominate the total amount of industrial big data in the future.

The third source of data is cross-domain data from the Internet and third parties. According to Michael Porter's article, information technology is revolutionizing industrial products. In the future, most products will be connected to the Internet and become smart products.For example, agricultural equipment systems will work with weather data systems, seed optimization systems, and irrigation systems, which means that data from the Internet and third-party systems will be integrated and aggregated with enterprise data and industrial IoT data.Since June 2013, we have established a climate big data integrated processing system in cooperation with the National Meteorological Center of China. Currently, it manages 1,073 unstructured real-time weather data, and its predictions have been used by Goldwind to allow wind turbines to run smoothly and generate more power.

2. Application scenarios of industrial big data

As we have seen, industrial big data comes from three sources: enterprise information systems, industrial IoT, and the Internet, and can be used in four application scenarios:

  • Scenario 1, monitoring and alerting. Industrial IoT data and cross-domain data can be used to monitor equipment's working status, social events and abnormal alarms, even perform closed-loop control on them.

  • Scenario 2, query and search. The data accumulated in enterprise information systems such as ERP, PLM, and SCM are high value density and the master data of industrial big data. On the one hand, these data are used for data query and search tasks in the day-to-day operations of the enterprise. On the other hand, on the basis of enterprise information system data as master data, industrial IoT data and cross-domain data are organized together to form an industrial data lake.

  • Scenario 3, processing and reporting, that is, business intelligence applications (Business Intelligence, hereinafter referred to as BI) . Industrial raw data stored in a data lake needs to be processed (transformed from one dataset to another), usually transferred into key performance indicators (KPIs). The processing results will be delivered as reports, which is a typical BI application scenario.

  • Scenario 4, decision-making and prediction, artificial intelligence application. If BI applications only complete the transformation between datasets, then what AI applications do is extract knowledge from datasets—especially training datasets. Today, machine learning is the mainstream of artificial intelligence applications. Therefore, neural networks generated by deep learning or transfer learning can be applied to decision-making and predictive work.

In the above four application scenarios, the industrial big data life cycle can be divided into five stages: collection, management, processing, analysis and application.

3. Our Industrial Big Data Project

In specific industrial big data applications, these five stages may be intertwined. The industrial big data software stack is consistent with the big data life cycle and the DIKW pyramid (Data-to-Information-to-Knowledge-to-Wisdom Model, that is, Data-Information-Knowledge-Wisdom).Considering the five-stage life cycle of data and its application in four scenarios, we propose a new big data software architecture - Tsinghua Dataway. Among them, we have developed some projects for industrial big data needs (the light yellow box in the figure below), such as IoTDB, TsClean, Flok, AnyLearn, AutoVis and so on. Due to limited time, I will share three of them with you.


The first is the Digital Framework (DWF), a rapid development platform for data-intensive applications. It serves two purposes: the first is rapid development, we adopt a model-driven architecture to change the way we implement applications - upgrading from traditional hard-coding to lightweight configuration, enabling junior engineers also to use a low-code way to create applications; the second is a big data-intensive application, which means it has an underlying model that makes it easy for different big data components (such as Hadoop and Spark) to collaborate and integrate those components into the application, so users can use this framework as a data bus, a control bus, and an interaction bus.

DWF FloK is the control bus for big data processing and responsible for managing the workflow between big data software components. As we all know, CRISP-DM, as a recognized industrial big data analysis paradigm, contains six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Flok supports the rapid construction of data processing workflows (according to the above analysis paradigms) via drag-and-drop operators, and has over 180 built-in algorithms.


The second is AnyLearn, which is an AI-oriented machine learning system and a cloud-native system. Anylearn is built for expert users in the industrial domain who have extensive industry knowledge but are not experts in machine learning. Anylearn has a variety of user-friendly GUI interfaces such as interactive web interface, Jupyter Notebook and command line. Additionally, Anylearn makes it easy to deploy models in production, whether on the cloud or at the edge. Finally, Anylearn uses transfer learning capabilities as its inherent function, which is suitable for many similar scenarios in the industry.

Anylearn provides a library of algorithms for different domains, such as weather forecasting, wind forecasting, and a transfer learning framework developed by our team. Anylearn also provides the Anylearn edge inference engine for industrial IoT scenarios that runs on Android and real-time Linux operating systems, where inference results from ML models can be merged with state diagrams (e.g., merging business rules and sensor data from real-time monitoring) ). We built a wind prediction test bench supported by Anylearn and IoTDB in the laboratory, and the sensors were installed on the roof of the east wing of our college. It collects sensor data with IoTDB via a solar powered Raspberry Pi and provides wind forecasting services using Anylearn as the engine. Machine learning models for wind forecasting are trained in the cloud.


The third is a time series database management system - IoTDB. There are three different usage scenarios for this database. First of all, it can be used as a data file in the terminal device, we provide high compression ratio and simple write and read system. Second, it can be used as a shop floor or factory level database. More powerful when used as a centralized control scenario, such as asset monitoring and processing. Finally, IoTDB natively supports big data analysis frameworks such as Spark and Hadoop, making it easier for the industry to carry out industrial big data analysis, especially for cloud computing-based industrial Internet applications.

Launched in 2011, the IoTDB project originated from helping Sany Heavy Industry upgrade its Sany Enterprise Control Center System (ECC), which monitors more than 100,000 devices worldwide. At the time, ECC systems stored device data in relational databases. However, the performance of the system is not sufficient for applications such as on-board concrete pump locking and diesel theft detection. After analyzing this application, we found the following 3 main challenges:

  1. In the industrial IoT application scenario, the metadata of the time series is defined by the terminal device, that is, the new time series may appear without back-end registration.

  2. In IIoT scenarios, we should process data as close to the field as possible, which is consistent with the L0 to L4 factory hierarchy.

  3. The data from the Industrial Internet is usually data about the health of machines, and signal processing functions are frequently applied to such IoT datasets.

4. Why choose open source?

Starting from 2015, we officially launched the development of a new version of IoTDB; in 2017, we opened source code on GitHub; in November 2018, the Apache Software Foundation accepted IoTDB as an incubator project; 20 months later, Apache IoTD B has become the top project of ASF.

Why open source? Nowadays, open source has become an innovative paradigm in the software industry and research field.

We can recall Android developed by Google - a mobile operating system, TensorFlow - an end-to-end open source platform for machine learning, which has changed the world of mobile Internet, while the latter makes models easy to build and deploy.

In addition, open source is an effective means for universities to output technology externally, such as Spark, a unified analysis engine for large-scale data processing, and Ray, a high-performance distributed execution framework, all of which come from the University of California, Berkeley.

Finally, open source is an educational platform for the new generation of software talents, where they can get in touch with the needs of real-life applications and develop development skills that they can't learn in school.

5. Why choose Apache?

When it comes to open source options, why choose Apache? Because if you just publish the code on Github, you may have to play this project by yourself at most. As we all know, the Apache Software Foundation has made many achievements in the past 20 years, but what is more important to us is Apache's culture. They believe in the community more than code - which means that a healthy community is more important than good code. Moreover, AFS has a good code of conduct and respects everyone in the community.

6. Open source practice at the School of Software of Tsinghua University

Next, let me introduce our open source practice. I have been responsible for teaching database courses for senior undergraduates for 20 years. The goal of this course is to understand relational data models, SQL languages, database design methods, DBMS (database management system) structure and their implementation. One of the project assignments of this course is to develop a small database management system. I encourage students to learn from open source projects (previously using HSQL) and do their best to contribute to the open source projects they study. In addition, in our college, software engineering is a very important discipline, which requires strong practical ability and practical experience. After participating in open source projects, students will have a good understanding of the nature of agile development and the concept of SCRUM project management. They have also accumulated experience in test-driven software development frameworks through unit testing, integration testing and continuous integration testing. When they contribute code to the community, they also use checker such as Sonar and find problems in the source code.

Open source platform is a very important software engineering training environment. At the School of Software of Tsinghua University, we encourage students and teachers to contribute to open source software projects. Since 2018, the selection criteria for our student scholarships have changed: not only emphasising the publication of papers, but also considering students' contributions to open source projects, such as submitting pull requests for projects, bug fixes, putting forward new ideas and being adopted by projects. In addition, we also actively promote open source culture by holding open source conferences and sharing meetings, inter-university project cooperation and speeches.

7.  Conclusion

At the end of the speech, I would like to emphasise that our mission is to innovate industrial big data technology and software tools to create applications quickly and conveniently. We believe that industrial big data software and its applications are a long-term job in the future in China and around the world; Tsinghua Digital Software Stack is our initial exploration of this direction and the starting point of our open source journey. Finally, I hope you can pay attention to our "Tsinghua Number" project and invite you to participate in the development and construction of the Apache IoTDB project.

Reference material

[1] Wang Jianmin, Dean of the School of Software of Tsinghua University: Cultivate open source thinking and help the software industry go global:

[2] Excerpt Source: Excerpt and translation from Mr. Wang Jianmin's English keynote speech "Industrial Big Data Software and Open Source Innovation" at the ApacheCon-Asia-2021 conference venue in August 2021. Translation: Wang Kehan, proofreader: Huang Xiangdong

This article is shared from WeChat official account - Kaiyuanshe, author: Wang Jianmin

The source and reprinted information of the original text are detailed in the article. If there is any infringement, please contact to delete it.

Original release time: 2021-11-29