Shaoxu Song

Research Topics

Differential dependencies (DDs) [TODS11] is an extension of functional dependencies (FDs). It specifies constraints on difference/distance, instead of equality in traditional data dependency notations like FDs. Informally, a differential dependency states that if two tuples have distances on attributes X agreeing with a certain differential function, then their distances on attributes Y should also agree with the corresponding differential function on Y. For example, two records of customers, which share similar Name values (not equal), probably have similar Address values as well. Differential dependencies are useful in various applications, such as data cleaning, data partition, query optimization, record linkage, etc.
The determination of differential functions, in particular, the distance/similarity thresholds of DDs, is studied in [ICDE12] [TKDE14].
Data repairing with DDs, by modeling the data and constraints as graphs [VLDB14].

Business processes continuously generate huge volume of event data, ranging from traditional enterprise office automation systems or scientific workflows to recent Web services and online transactions. For various reasons, such as forgot to submit when manually recording event logs, suffered from system failures, or mess after collecting the events from heterogeneous execution environment, the event data are often dirty.
Event matching, with opaque names, dislocated traces and composite events [SIGMOD14], using complex event patterns as discriminative features [ICDE14].
Recovering missing events [VLDB13], repairing erroneous events [ICDE15].
Repairing timeseries under speed constraints [SIGMOD15].

A novel type of Comparable Dependencies (CDs) is proposed for specifying integrity constraints over heterogeneous data [ICDE11] [VLDBJ13]. It considers the comparable correspondences between synonym values (e.g., Apple vs. Apple Inc) on comparable attributes (e.g., manufacturer vs. producer) from heterogeneous sources.
Discovery of Matching Dependencies (MDs) [CIKM09] [DKE13], concise set of Relative Candidate Keys [VLDB14].
Query optimization on heterogeneous data [TKDE11].
Some other aspects of heterogeneous data, such as similarity measures [CIKM10] [CIKM07] [INS14] [DASFAA07], index [WWWJ13], etc.

[TKDE11] studies the optimization of probabilistic inference queries over Bayesian Networks by using RDBMS facilities, such as materialized views.
[SIGMOD10] considers consistent query answering in inconsistent probabilistic databases.