Workshop "Data Science for Engineers"

Venue

The event will take place on April 5^th, 2018 at 13:00 in AVG 50.41 Room 045/046 (directions via Kit Campus Map).

To further promote interaction and cooperation between data science on the one side and the engineering disciplines on the other side within KIT, our research-training group is organizing a workshop "Data Science for Engineers", on April 5th 2018. Doctoral students from the research-training group will talk about their experience on how data-science methods/modern data-analysis methods may help or already have helped in concrete terms to solve engineering problems and to improve technical systems. The presenters will take their respective projects and use cases as starting points and will focus on how one can transfer ideas and results to other settings.

There will be ample space for questions from the audience and discussions.

Registration

This event is not open, invitations have already been sent out. If you as an employee of KIT are very interested in participating nevertheless, please send us an email (sekretariat Boehm ∂does-not-exist.ipd kit edu). If demand is high, we are ready in principle to run this event another time.

Program

13:00

Fundamentals

User-Centric Data Synthesis (Adrian Englhardt)

Data of various kinds is collected in virtually any branch of engineering. Systems exist to process such data and to cover different information needs. To validate the full functionality and performance of a system that processes data, not enough data is available in many cases. A common solution is to generate (additional) synthetic data. Such benchmark data must be diverse, to allow validation of a broad range of states of a system. Data-science research has developed a variety of data-generation approaches. However, most approaches are specific to one application and are difficult to generalize for other settings. In the project presented in this talk, we focus on the user perspective, i.e., how to synthesize data with little intellectual effort. In other words, we will discuss how to directly include a user in the synthesis process. Our core idea is that the system envisioned actively asks a user for feedback on data already generated. Additionally, we sketch our ideas how including user feedback might help to generate sufficiently diverse data.

User-centric Outlier Detection for Experimental and Simulation Data (Holger Trittenbach)

Outlier detection is a data-mining paradigm to discover unusual objects in data. Typical use cases come from engineering disciplines where researchers want to identify observations which are exceptional, compared to the bulk of their experimental or simulation data. Conventional methods report a ranking of potential outliers but leave further analysis to the user. However, the interpretation of outlier-detection results often is intricate, and not all outliers identified might be of equal interest.

In many scenarios, users are willing to give feedback to increase the quality of data mining. In our current research thread, we strive towards methods that query the user for information to improve the detection of outliers. In this talk, we discuss how such an interactive approach can help researchers to identify outliers and to explore interesting regions in the data space.

Generating Synthetic Outliers (Georg Steinbuss)

Outliers are data objects that are unusual in comparison with most other data objects. Many types of system faults reveal themselves as outliers in monitoring data. To illustrate, think of a temperature which is twice as high as the usual temperature of an engine. Often there is no or only a limited number of cases exhibiting system faults in monitoring data. Hence, it is difficult to develop and assess a detection model for faults based on outliers. In a current research project, we want to synthesize possibly novel types of faults by creating artificial outliers. The process we envision to this end is to create many artificial outliers and then check (by domain constraints) whether they potentially are realistic system faults. Finally, the resulting artificial system faults can be used to either assess an existing model or to develop a new one. Assessing a model with this procedure gives way to insights regarding the expected quality of the detection system. For example, which types of fault will be detected.

14:15

Break, with Coffee/Drinks and Snacks

14:30

Streams

Discovering and Monitoring Relationships in Sensor Data (Michael Vollmer)

While automation is becoming ubiquitous, monitoring the involved devices by hand is infeasible. Automatic analysis of data collected with sensors is a promising alternative. However, this analysis often evaluates each sensor individually and does not consider relationships between them. These relationships, and particularly their changes over time, can offer important insights regarding the efficiency or condition of devices. While there are different ways to quantify and measure the dependencies between different sensors, this session focusses on the most general measure “Mutual Information” (MI). We will discuss benefits, requirements and limitations of MI with regard to sensor data and other relationship measures. Additionally, we will talk about state-of-the-art techniques to use MI on both recorded, “static” data and streaming data.

Knowledge Discovery in High-Dimensional Streams (Edouard Fouché)

"Knowledge Discovery in Databases'' (KDD) is known as the process of extracting useful insights from data. KDD is challenging with high-dimensional streams, i.e., when (a) the data is composed of many signals, e.g., hundreds or thousands, and (b) the data can evolve over time in unpredictable ways. "Knowledge Discovery in High-dimensional Streams'' (KD-HS) is an open problem in the field of Machine Learning. On the other hand, it is extremely useful in many scenarios: If knowledge hidden in the data can be extracted immediately, production volumes can increase, or additional costs can be avoided. This would lead to significant impacts on "Predictive Maintenance'' and "Industry 4.0''. In this project, we aim at establishing the foundations for KD-HS. In this talk, we will present our current work in progress, including the development of novel methods to estimate the correlation between multivariate streams and the monitoring of correlation changes in high-dimensional streams. We will give an outlook on possible deployments in manufacturing processes, such as the detection of anomalies, intrusions or data clusters.

15:20

Break, with Coffee/Drinks and Snacks

15:40

Use Cases

What data science can say about the energy behaviour of processes (Nicole Ludwig)

By now, many machines are equipped with smart meters, measuring energy consumption in sub-minute intervals. This measuring generates a significant amount of data, which might contain valuable information. However, finding this information is not trivial. This talk wants to give an overview of what data science can say about the energy consumption of processes. We will focus on finding patterns in time series without prior knowledge about them. For example, when does a process start, and what is a regular energy-consumption shape of a specific machine. We will show how analysing these patterns in the industry might help to stabilise the future energy grid. Additionally, we give an outlook on how other applications in mechanical engineering might benefit from the presented methods.

Concise metamodels of simulated systems (Vadim Arzamasov)

Simulations or computer experiments are often used to learn about the properties of physical systems, e.g. electricity grids or vehicles. Simulations allow to directly answer questions like “If we change the characteristics of the electricity transmission line to another value, will the power grid remain stable?” or “If the shape of the car will be different, will it be faster/safer?”. However, in any interesting simulation there is more than one input variable. For example, “shape of the car” can comprise its length, height, width, diameter of wheels or location of the engine. A power grid can be modelled as a system of energy producers and consumers, each having its own characteristics and connection with transmission lines of various capacities. New types of questions arise. For example:

Which inputs do have a huge effect on the system, and which ones do not?
What are interactions between inputs? E.g., do some inputs a and b have some synergy, i.e., does changing both of them have an effect that one could not have expected from changing their values one-at-time?

Different combinations of input values could lead to different simulation outcomes, and their number grows exponentially. Thus, to be able to answer the questions above, one should decide:

For which input values to run the simulations?
How many simulations to run?
How to represent their result, so that it is simple enough to grasp and still insightful?
How to minimize the number of simulations, still achieving the same result representation?

Some of these questions have been studied extensively, such as the first one, also known as experimental design. However, we believe that the result will be better when considering them jointly. In our current work we first choose the model (answering Question 3). Having done this, we address the other issues (1, 2 and 4).

16:30

End of Presentations, Further Opportunities for Discussion

Organization

In case of any queries regarding the event, please contact us at the above email address.