Data on the web : from relations to semistructured data and XML

The Web is causing a revolution in how we represent, retrieve, and process information Its growth has given us a universally accessible database, but in the form of a largely unorganized collection of documents. This is changing, thanks to the simultaneous emergence of new ways of representing data: from within the Web community, XML; and from within the database community, semistructured data. The convergence of these two approaches has rendered them nearly identical. Now, there is a concerted effort to develop effective techniques for retrieving and processing both kinds of data.

Data on the Web is the only comprehensive, up-to-date examination of these rapidly evolving retrieval and processing strategies, which are of critical importance for almost all Web- and data-intensive enterprises. This book offers detailed solutions to a wide range of practical problems while equipping you with a keen understanding of the fundamental issues including data models, query languages, and schemas involved in their design, implementation, and optimization. You'll find it to be compelling reading, whether your interest is that of a practitioner involved in a database-driven Web enterprise or a researcher in computer science or related field.

Author Notes

Serge Abiteboul is Senior Researcher at I.N.R.I.A. and a professor at the Ecole Polytechnique. He received his Ph.D. in computer science from the University of Southern California in 1982 and his These d'Etat from the University of Paris XI in 1986. His recent research has focused on object databases, digital libraries, Semistructured data, data integration, and electronic commerce. Peter Buneman is a professor in the Computer and Information Science Department at the University of Pennsylvania. He earned his undergraduate degree from Cambridge and his Ph.D. from the University of Warwick. His research interests include databases, programming languages, cognitive science, and classification theory. Dan Suciu is a researcher at ATandT Labs who received his Ph.D. from the University of Pennsylvania in 1995. He has devoted his recent research and publications to various aspects of semistructured data, organizing several workshops on the topic, and serving on the committees of ICDT, PODS, and EDBT.

Reviews 1

Library Journal Review

Most data on the web are not well structured, making the search and retrieval process difficult since the spiders, robots, and other search engines don't really understand the context of the data they are indexing and storing. This very advanced book examines the new retrieval and processing techniques as semistructured data and XML (as a data transfer language) that aim to merge a document-based web with a data-driven infrastructure. Hardcore programmers will want this. Recommended for university and large public libraries. (c) Copyright 2010. Library Journals LLC, a wholly owned subsidiary of Media Source, Inc. No redistribution permitted.

Foreword	p. v
Acknowledgments	p. xiii
1 Introduction	p. 1
1.1 Audience	p. 2
1.2 Web Data and the Two Cultures	p. 2
1.3 Organization	p. 8
I Data Model	p. 9
2 A Syntax for Data	p. 11
2.1 Base Types	p. 13
2.2 Representing Relational Databases	p. 14
2.3 Representing Object Databases	p. 15
2.4 Specification of Syntax	p. 18
2.5 The Object Exchange Model (OEM)	p. 19
2.6 Object Databases	p. 19
2.7 Other Representations	p. 22
2.7.1 ACeDB	p. 22
2.8 Terminology	p. 24
2.9 Bibliographic Remarks	p. 26
3 XML	p. 27
3.1 Basic Syntax	p. 29
3.1.1 XML Elements	p. 29
3.1.2 XML Attributes	p. 31
3.1.3 Well-Formed XML Documents	p. 32
3.2 XML and Semistructured Data	p. 32
3.2.1 XML Graph Model	p. 33
3.2.2 XML References	p. 33
3.2.3 Order	p. 34
3.2.4 Mixing Elements and Text	p. 36
3.2.5 Other XML Constructs	p. 37
3.3 Document Type Definitions	p. 38
3.3.1 A Simple DTD	p. 38
3.3.2 DTDs as Grammars	p. 39
3.3.3 DTDs as Schemas	p. 39
3.3.4 Declaring Attributes in DTDs	p. 41
3.3.5 Valid XML Documents	p. 44
3.3.6 Limitations of DTDs as Schemas	p. 44
3.4 Document Navigation	p. 45
3.5 DCD	p. 46
3.6 Paraphernalia	p. 47
3.6.1 RDF	p. 47
3.6.2 Stylesheets	p. 48
3.6.3 SAX and DOM	p. 49
3.7 Bibliographic Remarks	p. 50
II Queries	p. 51
4 Query Languages	p. 53
4.1 Path Expressions	p. 55
4.2 A Core Language	p. 58
4.2.1 The Basic Syntax	p. 59
4.3 More on Lorel	p. 62
4.3.1 Less Essential Syntactic Sugaring	p. 64
4.4 UnQL	p. 64
4.5 Label and Path Variables	p. 66
4.5.1 Paths as Data	p. 68
4.6 Mixing with Structured Data	p. 68
4.7 Bibliographic Remarks	p. 71
5 Query Languages for XML	p. 73
5.1 XML-QL	p. 73
5.1.1 Constructing New XML Data	p. 74
5.1.2 Processing Optional Elements with Nested Queries	p. 76
5.1.3 Grouping with Nested Queries	p. 77
5.1.4 Binding Elements and Contents	p. 78
5.1.5 Querying Attributes	p. 78
5.1.6 Joining Elements by Value	p. 79
5.1.7 Tag Variables	p. 79
5.1.8 Regular Path Expressions	p. 80
5.1.9 Order	p. 81
5.2 XSL	p. 83
5.3 Bibliographic Remarks	p. 89
6 Interpretation and Advanced Features	p. 91
6.1 First-Order Interpretation	p. 92
6.2 Object Creation	p. 96
6.3 Graphical Languages	p. 100
6.4 Structural Recursion	p. 101
6.4.1 Structural Recursion on Trees	p. 101
6.4.2 XSL and Structural Recursion	p. 104
6.4.3 Bisimulation in Semistructured Data	p. 106
6.4.4 Structural Recursion on Cyclic Data	p. 111
6.5 StruQL	p. 115
6.6 Bibliographic Remarks	p. 117
III Types	p. 119
7 Typing Semistructured Data	p. 121
7.1 What Is Typing Good For?	p. 123
7.1.1 Browsing and Querying Data	p. 123
7.1.2 Optimizing Query Evaluation	p. 124
7.1.3 Improving Storage	p. 125
7.2 Analyzing the Problem	p. 126
7.3 Schema Formalisms	p. 127
7.3.1 Logic	p. 127
7.3.2 Datalog	p. 129
7.3.3 Simulation	p. 132
7.3.4 Comparison between Datalog Rules and Simulation	p. 139
7.4 Extracting Schemas from Data	p. 141
7.4.1 Data Guides	p. 141
7.4.2 Extracting Datalog Rules from Data	p. 147
7.5 Inferring Schemas from Queries	p. 151
7.6 Sharing, Multiplicity, and Order	p. 154
7.6.1 Sharing	p. 154
7.6.2 Attribute Multiplicity	p. 155
7.6.3 Order	p. 156
7.7 Path Constraints	p. 157
7.7.1 Constraints in Relational Databases	p. 158
7.7.2 Constraints in Object-Oriented Databases	p. 158
7.7.3 Path Constraints in Semistructured Data	p. 160
7.7.4 The Constraint Inference Problem	p. 162
7.7.5 Constraints in XML	p. 163
7.8 Bibliographic Remarks	p. 164
IV Systems	p. 165
8 Query Processing	p. 167
8.1 Architecture	p. 167
8.2 Semistructured Data Servers	p. 171
8.2.1 Storage	p. 171
8.2.2 Indexing	p. 179
8.2.3 Distributed Evaluation	p. 189
8.3 Mediators for Semistructured Data	p. 197
8.3.1 A Simple Mediator: Converting Relational Data to XML	p. 198
8.3.2 Mediators for Data Integration	p. 200
8.4 Incremental Maintenance	p. 207
8.5 Bibliographic Remarks	p. 209
9 The Lore System	p. 211
9.1 Architecture	p. 212
9.2 Query Processing and Indexes	p. 213
9.3 Other Aspects of Lore	p. 216
9.3.1 The Data Guide	p. 216
9.3.2 Managing External Data	p. 217
9.3.3 Proximity Search	p. 217
9.3.4 Views	p. 217
9.3.5 Dynamic OEM and Chorel	p. 218
9.3.6 Mixing Structured and Semistructured in Ozone	p. 218
9.4 Bibliographic Remarks	p. 219
10 Strudel	p. 221
10.1 An Example	p. 222
10.1.1 Data Management	p. 224
10.1.2 Structure Management	p. 227
10.1.3 Management of the Graphical Presentation	p. 227
10.2 Advantages of Declarative Web Site Design	p. 232
10.3 Bibliographic Remarks	p. 233
11 Database Products Supporting XML	p. 235
11.1 Architecture	p. 236
11.2 Storage	p. 236
11.3 Application Programming Interface	p. 238
11.4 Query language	p. 239
11.5 Scalability	p. 239
11.6 Bibliographic Remarks	p. 239
Bibliography	p. 241
Index	p. 249
About the Authors	p. 258

Available:*

On Order

Summary

Summary

Author Notes

Reviews 1

Library Journal Review

Table of Contents