Declarative Entity Resolution Via Matching Dependencies and Combining Matching Dependencies With Machine Learning for Entity Resolution

It appears your Web browser is not configured to display PDF files. Download adobe Acrobat or click here to download the PDF file.

Click here to download the PDF file.

Supplemental Files: 

Creator: 

Bahmani, Zeinab

Date: 

2017

Abstract: 

Entity resolution (ER) is an important problem in data cleaning. It is about iden- tifying and merging records in a database that represent the same external entity. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. An ER process supported by MDs over a dirty instance may lead to multiple clean instances.

In this thesis, we first present disjunctive answer set programs that capture through their models the class of alternative clean instances obtained after an ER process based on MDs. With these programs, we can obtain clean answers to queries by skeptically reasoning from the program. As an important practical case of ER, we provide a declarative reconstruction of the so-called union-case ER methodology, as presented through a generic approach to ER, the so-called Swoosh approach. We extend our ASP-based account of the union-case of Swoosh with negative rules.

In this work, we extend MDs to relational MDs, which capture more application semantics, and identify classes of relational MDs for which the proposed declarative specifications for ER via MDs can be automatically rewritten into stratified Datalog programs.

We also show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs using machine learn- ing (ML) techniques; (b) Use of relational MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for all activities related to data processing, and the specification and enforcement of MDs.

Subject: 

Computer Science

Language: 

English

Publisher: 

Carleton University

Contributor: 

Author: 
Zeinab Bahmani

Thesis Degree Name: 

Doctor of Philosophy: 
Ph.D.

Thesis Degree Level: 

Doctoral

Thesis Degree Discipline: 

Computer Science

Parent Collection: 

Theses and Dissertations

Items in CURVE are protected by copyright, with all rights reserved, unless otherwise indicated. They are made available with permission from the author(s).