Title
Probabilistic Uncertainty Management in the DataRing
Description
Sources of uncertainty in data abound: noisy measurements, data resulting from imperfect automatic systems such as information extraction or natural language processing, data by nature imprecise such as a human-made diagnostic, etc. In the context of an autonomous, heterogeneous, decentralized system such as the one investigated in the DataRing project [1], uncertainty also originates from essentially imperfect schema matchings, doubts about the actual presence of a fact or of a whole document on a given peer, or redundancy and contradiction of the information present in different peers. One possible way, among the most natural, to represent this uncertainty is through probabilistic databases.
The objective of this PhD position is to find formal models for the representation and efficient querying of probabilistic databases in a peer-to-peer environment, and to build corresponding prototype systems.
Because of the heterogeneous nature of the information shared in the DataRing, semi-structured (i.e., XML) models should be favored, though the simplicity of the flat-tuple representation of the relational model can also be an inspiration. Previously studied probabilistic semi-structured models [2,3,4] can be a basis for the proposed work. Particular aspects of interest include: - management of the various forms of uncertainty; - routing and distributed computation of probabilistic queries over the peer-to-peer network; - corroboration of information across sources; - ranking of query results and top-k query processing.
Supervision
The 3-year PhD thesis will be supervised by Pierre Senellart and Talel Abdessalem in the Computer Science and Networking Department at TELECOM ParisTech, in interaction with the other partners of the ANR DataRing project, notably Serge Abiteboul's Gemo team at INRIA Saclay.
TELECOM ParisTech, formerly known as ENST, is the leading French engineering school specialized in information technology, and is located inside Paris.
Conditions
Starting date: beginning 2009 (flexible). Prerequisites for applying: Master's degree in computer science (or equivalent diploma), background in applied and theoretical database management. Revenue: ~1500 € monthly net revenue, over 3 years
Please contact Pierre Senellart <pierre.senellart@telecom-paristech.fr> for any information and for applications.
References
[1] S. Abiteboul and N. Polyzotis, The Data Ring: Community Content Sharing. In Proc. CIDR, January 2007, Asilomar, USA.
[2] P. Senellart and S. Abiteboul, On the complexity of managing probabilistic XML data. In Proc. PODS, June 2007, Beijing, China.
[3] B. Kimelfeld, Y. Kosharovski, and Y. Sagiv, Query efficiency in probabilistic XML models. In Proc. SIGMOD, June 2008, Vancouver, Canada.
[4] S. Cohen, B. Kimelfeld, and Y. Sagiv, Incorporating constraints in probabilistic XML. In Proc. PODS, June 2008, Vancouver, Canada.