A Revival of Data Dependencies for Improving Data Quality

Wednesday 14th January 2009, 6:30 pm

Speaker: Professor Wenfei Fan, School of Informatics, The University of Edinburgh.

Venue: Room G.07, University of Edinburgh Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB - map (click on Informatics Forum in the list of buildings).

This talk is free of charge. Refreshments available from 6:00 pm.

Synopsis

This is a repeat of Professor Fan's BCS Roger Needham Lecture 2008.

Recent statistics reveal that 1%-5% of real-world data in enterprises is dirty: inconsistent, inaccurate, incomplete and/or stale.

The prevalent use of Internet has been increasing the risks, in an unprecedent scale, of creating and propagating dirty data. Dirty data is estimated to cost US industry alone billions of dollars a year.

There is no reason to believe that the scale of the problem is any different in the UK, or in any other society that is dependent on information technology. This highlights the need for principled approaches to improving data quality.

This talk presents a recent approach for detecting and repairing real-life dirty data. It is based on conditional dependencies, a revision of database dependencies by enforcing bindings of semantically related data values.

As opposed to traditional database dependencies that were developed for improving the quality of schema, conditional dependencies provide a theory for improving the quality of the data.

Based on the theory practical techniques have been developed for cleaning dirty data, which effectively reduce human efforts and improve data quality. The techniques have drawn attention from industries in the UK and beyond.

About the speaker

 

Professor Wenfei Fan's is part of the Database Group at the Laboratory for Foundations of Computer Science within the School of Informatics at the University of Edinburgh. He is also a Research Scientist at Bell Laboratories, Alcatel-Lucent.

He received his PhD from the University of Pennsylvania, and his MS and BS from Peking University.

He is a recipient of the:

  • BCS Roger Needham Award (2008)
  • Yangtze River Scholar Award (also known as the Chang Jiang Scholar Award) (2007)
  • The ICDE Best Paper Award (2007)
  • Outstanding Overseas Young Scholar Award (2003)
  • Best Paper of the Year Award from Computer Networks (2002)
  • Career Award in 2001

His current research interests include data quality, data integration, integrity constraints, distributed query processing, Web services and XML.

In the past few years his research has resulted in several practical developments:

  • A system for efficiently mapping relational databases into an XML documents with a specified type using a generalization of attribute grammars and a new theory of tree transducers. This is currently in use for generating exportable scientific data sets.
  • A system for imposing secure views of an XML document based on the DTD of that document.
  • The use of partial evaluation (borrowed from functional programming) for the efficient evaluation of boolean queries on distributed data.
  • A method of data cleaning based on conditional functional dependencies, which Wenfei proposed for this purpose. This is a nice application of his early ideas on constraints in databases to the highly practical problem of data cleaning (unclean data is estimated to cost US companies alone more than billions of dollars a year).