Entity resolution (ER) is an important problem in data cleaning. It is about iden- tifying and merging records in a database that represent the same external entity. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. An ER process supported by MDs over a dirty instance may lead to multiple clean instances.
In this thesis, we first present disjunctive answer set programs that capture through their models the class of alternative clean instances obtained after an ER process based on MDs. With these programs, we can obtain clean answers to queries by skeptically reasoning from the program. As an important practical case of ER, we provide a declarative reconstruction of the so-called union-case ER methodology, as presented through a generic approach to ER, the so-called Swoosh approach. We extend our ASP-based account of the union-case of Swoosh with negative rules.
In this work, we extend MDs to relational MDs, which capture more application semantics, and identify classes of relational MDs for which the proposed declarative specifications for ER via MDs can be automatically rewritten into stratified Datalog programs.
We also show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs using machine learn- ing (ML) techniques; (b) Use of relational MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for all activities related to data processing, and the specification and enforcement of MDs.