Current Events
Past Seminars
This presentation introduces the Celonis Process Query Language (PQL), a powerful and intuitive data query language designed specifically for business users. We will explore PQL’s core data model, which leverages tables and a cycle-free join graph, and its “column-oriented” language structure that simplifies data retrieval and knowledge management. Key features like decoupled filtering, late filter application, and composability will be highlighted, alongside the concept of virtual tables generated by operators. The presentation will also delve into how PQL handles process-specific data through activity tables and specialized operators. Finally, we will touch upon the implementation and architecture, including the easy caching enabled by PQL’s design, concluding with an outlook on PQL’s evolution to natively query OCDMs.
Large-scale machine learning (ML) underpins many applications that profoundly transform our lives, but ML systems to execute these workloads are still in their infancy. In a first part of this talk, we give an overview of Apache SystemML as a representative ML system for declarative, large-scale ML. SystemML provides an R-like syntax and automatically compiles these high-level linear algebra programs into hybrid runtime plans of single-node, in-memory operations, and distributed operations on Spark. In a second part, we then present a selected research result on optimizing operator fusion plans. The opportunities for fused operators - in terms of fused chains of basis operators - are ubiquitous, and include fewer intermediates, scan sharing, and sparsity exploitation across operators. However, existing fusion heuristics struggle to find good plans for complex operator DAGs or hybrid plans. Therefore, we introduce an exact yet practical cost-based optimization framework for fusion plans, including techniques for candidate exploration, candidate selection, and code generation of local and distributed operations over dense, sparse, and compressed data. Finally, we share some lessons learned and ongoing work on properly supporting the entire end-to-end data science lifecycle.
How do we select content that will become viral in a whole network after we share it with friends or followers? Significant research activity has been dedicated to the problem of strategically selecting a seed set of initial adopters so as to maximize a meme’s spread in a network. Yet this line of work assumes that the success of such a campaign depends solely on the choice of a tunable set of initiators, regardless of how users perceive the propagated meme, which is fixed. Yet in many real-world settings, the opposite holds: a meme’s propagation depends on users’ perceptions of its tunable characteristics, while the set of initiators is fixed.
We address the natural problem that arises in such circumstances: suggest content, expressed as a limited set of attributes, for a creative promotion campaign that starts out from a given seed set of initiators, so as to maximize its expected spread over a social network. To our knowledge, no previous work addresses this problem. We find that the problem is NP-hard and inapproximable. As a tight approximation guarantee is not admissible, we design an efficient heuristic, Explore-Update, as well as a conventional Greedy solution. Our experimental evaluation demonstrates that Explore-Update selects near-optimal attribute sets with real data, achieves 30% higher spread than baselines, and runs an order of magnitude faster than Greedy.
SQL-99 allows for nested subqueries at nearly all places within a query. From a user’s point of view, nested queries can greatly simplify the formulation of complex queries. However, nested queries that are correlated with the outer queries frequently lead to dependent joins with nested loops evaluations and thus poor performance. Existing systems therefore use a number of heuristics to unnest these queries, i.e., de-correlate them. These unnesting techniques can greatly speed up query processing, but are usually limited to certain classes of queries. To the best of our knowledge no existing system can de-correlate queries in the general case. We present a generic approach for unnesting arbitrary queries. As a result, the de-correlated queries allow for much simpler and much more efficient query evaluation.
Search results about a given query topic are typically unstructured making it hard to understand the relationships between the different sources of information. Thus, there is a need for organizing search results to help users to (1) gain more insights about query topics, and (2) have an easy access to information sources that trigger their interests. This is particularly helpful for ambiguous queries or faceted topics that involve a variety of sub-topics, meanings, versions, arguments, opinions, and many other aspects. In this talk, I present techniques that exploit existing knowledge bases to enhance information search. I first show how to exploit Wikipedia for query expansion and search results diversification. Then, I proceed with the organization of information sources allowing an effective navigation through knowledge facets.
Keynote at the Thesis Development Workshop of the Doctoral College GIScience.
NoSQL-Datenbanken sind gerade in der Webentwicklung zunehmend beliebt. Oft sind es die großen Datenmengen, die es zu verwalten gilt, mitunter sind diese Systeme aber auch wegen ihrer Schema-Flexibilität für agile Entwicklungsteams interessant. Indem viele NoSQL-Datenbanken keine Unterstützung für die Definition, Einhaltung und Wartung eines globalen Schemas bieten, verlagern sich klassische Aufgaben des Datenbankmanagementsystems in die Anwendungssoftware. Dieser Vortrag gibt einen Überblick über konkrete Herausforderungen, die sich in der Praxis beim Entwurf eines Datenmodells für Key-Value- und Dokumenten-Datenbanken ergeben. Dazu zählen eine Modellierung, die atomare Updates ermöglicht, das Vermeiden von Hot-Spot-Datenobjekten, wie sie durch hochfrequente, parallele Schreibzugriffe gegen dasselbe Objekt verursacht werden, sowie Strategien zum Umgang mit kontinuierlicher Schema-Evolution. Der Vortrag zeigt auf, dass gerade die Datenbank-Community mit ihrem Erfahrungsschatz im Schema-Management und ihrem breiten Fundus an formalen Methoden hier einen wertvollen Beitrag leisten kann.
Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this presentation is the study of similarity queries that contain one or multiple first-class similarity database operators, e.g., Similarity Selection, Similarity Join, and Similarity Group-by. We will present implementation techniques of several similarity operators; a comprehensive conceptual evaluation model for similarity queries; and a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries. We will also discuss techniques to implement similarity operators using the MapReduce framework to process massive datasets.