Data, Knowledge, and the Web

The advent of large-scale data on the Web and elsewhere poses new challenges and opportunities. Concepts, models, and algorithms from several fields, including database systems, information retrieval, natural language processing, statistical learning, and data mining can help us to analyze and learn from this data.

Groups and Researchers in this Field


Text + Time Search & Analytics

Klaus Berberich coordinates the Text+Time Search and Analytics research area in the Databases and Information Systems Department at the Max Planck Institute for Informatics, focusing on developing efficient, effective methods to search and analyze natural language texts that come with associated temporal information. This may include temporal expressions, which convey time periods a text refers to, as well as publication timestamps. Data of interest include web archives, newspaper corpora, and other collections of born-digital or now-digital documents. Implementing and experimentally evaluating methods on real-world data is integral to the group’s approach. Recent and ongoing efforts include time-travel text search, algorithms to compute n-gram statistics at large scale, and redundancy-aware retrieval models. Read more

Klaus Berberich

Klaus Berberich

MPI-INF, Senior Researcher

Personal Website

Machine Learning and Large-scale Data Mining Methods

Manuel Gomez Rodriguez is a research group leader at the Max Planck Institute for Software Systems. He is interested in developing machine learning and large-scale data mining methods for analysis and modeling of large real-world networks and processes that take place over them. His research comprises several dimensions: developing models of these networks and processes, assessing their theoretical properties and limitations; developing machine learning algorithms to fit the models and computational methods to influence processes over networks; and validating models and methods on gigabite- and terabyte-scale real-world datasets. Ultimately, he aims to provide computational tools with applications in a variety of domains, e.g. social and information sciences, economics, decision theory, causality, and epidemiology. Read more

Manuel Gomez Rodriguez

Manuel Gomez Rodriguez

MPI-SWS, Faculty

Personal Website

Data Mining

Pauli Miettinen leads the Data Mining area in the Databases and Information Systems Department of the Max Planck Institute for Informatics, and is also an adjunct professor (docent) of computer science at the University of Helsinki. His group’s goal is to develop data mining methods and algorithms based on well-founded algorithmic principles, by analyzing theoretical aspects of a problem, developing practical algorithms, and then applying those algorithms to real-world problems. Many interesting datasets contain binary or higher-arity relations, collections of sets of elements, or bipartite graphs; all of these can be expressed using binary matrices or tensors. Currently, the group focuses on studying the decomposition methods for binary matrices and tensors, as well as variations of these problems. Read more

Pauli Miettinen

Pauli Miettinen

MPI-INF, Senior Researcher

Personal Website

Semantic Data

Daria Stepanova leads the Semantic Data research group in the Databases and Information Systems Department of the Max Planck Institute for Informatics. Her group’s goal is to advance deductive and inductive reasoning methods for knowledge graphs. Specifically, the group addresses the problems related to handling incomplete and inconsistent data, diagnostic reasoning, knowledge revision as well as inductive learning. Examples of ongoing work include automatic rule extraction from structured and unstructured sources, semantically enhanced methods for Web-based fact checking, and effective combination of reasoning and learning algorithms for data exploration. Read more

Daria Stepanova

Daria Stepanova

MPI-INF, Senior Researcher

Personal Website

Text Analysis

Despite the development of knowledge bases and their improvements in recent years, most human knowledge is still available only in unstructured format, in particular as natural language text. Thus, applications such as search engines and question answering systems benefit from knowledge extracted from texts. Our group aims at developing natural language processing tools for extracting valuable information from large document collections. In particular, we tackle the extraction and interpretation of temporal information, as time is an important dimension in any information space. For instance, we are extending and improving the temporal tagger HeidelTime. We also work on semantically refined tasks: we have developed a time-aware search engine, which allows one to formulate queries with temporal constraints on the documents’ content, and perform exploratory corpus analysis. Read more

Jannik Strötgen

Jannik Strötgen

MPI-INF, Senior Researcher

Personal Website

Exploratory Data Analysis

Jilles Vreeken is a senior researcher in the Databases and Information Systems Department at the Max Planck Institute for Informatics, and leads the Exploratory Data Analysis independent research group at the Cluster of Excellence on Multimodal Computing and Interaction. His research focuses on exploratory data mining: developing theory and algorithms to identify interesting structures within given data. Of particular value here are statistical methods, such as information-theoretic principles of minimum description length and maximum entropy. Next, he develops efficient algorithms to extract these structures from large and complex data, and investigates how they can be used in a range of applications, including identifying rare diseases, e-health, bio-informatics, market analysis, product recommendation, etc. Read more

Jilles Vreeken

Jilles Vreeken

MPI-INF, Senior Researcher

Personal Website

Knowledge Harvesting

Gerhard Weikum is a Research Director at the Max Planck Institute for Informatics, where he leads the Databases and Information Systems Department. He is also an adjunct professor in the Department of Computer Science of Saarland University, and a Principal Investigator of the Cluster of Excellence on Multimodal Computing and Interaction. The long-term objective of his research is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from a wide array of structured, semistructured, and textual information sources, which may exhibit varying levels of credibility. His group’s approach towards this goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining. Read more

Gerhard Weikum

Gerhard Weikum

MPI-INF, Scientific Director

Personal Website