New York Area DB/IR Day

April 15, 2005

Student Posters

Student	University	Email	Title
Emiran Curtmola	UCSD	ecurtmola@cs.ucsd.edu	Implementation and Open Research Issues in XML Full-Text Search
Ilya Pevzner	NYU	pevzner@cs.nyu.edu	Resolving Attribute Value Inconsistencies
Ziyang Wang	NYU	ziyang@cs.nyu.edu	Web Daily News Assistant: finding what's new on Your Web
Milan Manavat	Stony Brook	sion@cs.stonybrook.edu	Data Outsourcing with Privacy in the Presence of a Secure Co-Processor
Cristina Schmidt	Rutgers	cristins@caip.rutgers.edu	SQUID: Flexible P2P Information Discovery Infrastructure
Julia Stoyanovich	Columbia	Jds1@cs.columbia.edu	A Faceted Query Engine Applied to Archaeology
Erich Schmidt	Princeton	eschmidt@cs.princeton.edu	Internet-scale Distributed Persistent Search
John Cieslewicz	Columbia	johnc@cs.columbia.edu	Improving Database Performance on Simultaneous Multithreading Processors
Chavdar Botev	Cornell	cbotev@cs.cornell.edu	Integrating Full-Text Search and Structured Search in an XML Database System
Ashwin Machanavajjhala & Dan Kife	Cornell	mvnak@cs.cornell.edu dkifer@cs.cornell.edu	Beyond k-anonymity: New Scheme for Privacy Preserving Data Publishing
Fan Yang	Cornell	yangf@cs.cornell.edu	Hilda: high level language for data driven web-based applications
Mingsheng Hong	Cornell	mshong@cs.cornell.edu	Cayuga Stream Processing System
Prakash Linga	Cornell	linga@cs.cornell.edu	An Indexing Framework for P2P Systems
Wisam Dakka	Columbia	wisam@cs.columbia.edu	Summarization-Aware Search for Online News Articles
Alpa Shah	Columbia	ajs248@cs.columbia.edu	Relational Query Processing Over Text Documents
Amélie Marian	Columbia	amelie@cs.columbia.edu	Top-k Queries over Structured Web Data

Abstracts

Emiran Curtmola

Implementation and Open Research Issues in XML Full-Text Search

The increase of large XML repositories being made available lately has created and determined the need to search both the structure and text content of XML documents. While current XML query processing languages, XPath 2.0 and XQuery 1.0 which are the W3C recommended standards for querying XML documents, operate on structured XML data, they are limited in expressing full-text queries.Recently, the W3C has been working on XQuery Full-Text, a language that extends XPath and XQuery with fully composable full-text search primitives such as phrase matching, Boolean connectives, keyword-distance, stemming and thesauri.

In this poster, I will describe the data model and the query semantics as well as different query evaluation strategies for XQuery Full-Text. I will also discuss the architecture of GalaTex, the first conformant implementation of XQuery Full-Text, which uses Galax as a complete XQuery processor. GalaTex is initially focused on completeness and conformance rather than on efficiency. However, its main benefit is to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research ideas in XML full-text search. I will discuss ideas on optimizing XML queries over both structure and text, providing a logical framework for evaluating top-K answers based on score pruning, and full-text query equivalence.

A demonstration of GalaTex is provided at http://www.galaxquery.com/galatex/ and will also be available along with this poster.

Ilya Pevzner

Resolving Attribute Value Inconsistencies

Resolving inconsistencies in data is a problem of critical practical importance. Inconsistent data arises whenever an attribute takes on multiple, inconsistent, values. This may occur when a particular entity is stored multiple times in one database, or in multiple databases that are combined. Two types of such inconsistencies can be identified: contextdependent inconsistencies, such as mismatched attribute domains, and context-independent, such as input or measurement errors. Most existing research in the area focuses on context-dependent conflicts, and few works that address context-independent conflicts are based on data quality characteristics of sources of the conflicting data. In practice, such characteristics are often unavailable. We propose an approach based on a probabilistic learning that would allow to automatically resolve context-independent conflicts. The initial implementation uses supervised Bayesian learning with maximum likelihood estimation.

Ziyang Wang

Web Daily News Assistant: Find What's New on Your Web

Large-scale web search engines update their web index slowly and can only serve long life information well. To search new updates and present valuable new information on the Web requires a much higher update rate of the local index and more complex evaluating strategies than existing search technologies. We proposed and developed Web Daily News Assistant, a novel online search service on the scale of community web (a localized web site or web domain) that can quickly and automatically find what is updated on the Web and present valuable new information to users. Finding valuable new information in the hypertext media is seldom addressed in current literature. We exploit novel solutions for web change detection, new information extraction and evaluation in our service. The current deployment of the preliminary engine, called Web Daily News Assistant @ New York University, is providing a daily news digest as well as the access to the complete change history of indexed web pages on university web.

Milan Manavat

Data Outsourcing with Privacy in the Presence of a Secure Co-Processor

In this work we explore mechanisms for outsourcing, updating and querying encrypted data in the presence of potentially malicious or compromised entities. We are mainly concerned with frameworks where clients store data encrypted on servers, and specialized query protocols allow ulterior access to that data.

Existing research efforts can handle specific query types (e.g., aggregation, ranges) but feature large, quickly degrading overheads and requirements of client-side processing for more complex queries (e.g., JOINS).

Additionally, the type of security assurances offered are only statistical, at the extreme converging to a total leak of information of the stored data.

In this research we ask whether a secure co-processor on the server side can offer significant overhead reductions and security guarantees for arbitrary query types. We propose to deploy cryptographically strong mechanisms that leverage the existence of a a secure co-processor on the server side.

In our initial efforts we look at arbitrary relational JOINs and how they can be performed securely over encrypted data with complete computational privacy.

Cristina Schmidt

Squid – Flexible P2P Information Discovery Infrastructure

Scalable information discovery in the absence of global knowledge of names and/or naming conventions remains a fundamental problem in large, decentralized, distributed environments. This is due to the heterogeneous nature and large volume of data and resources, their dynamism and the dynamism of the sharing environment (with nodes joining and leaving). As a result, an ideal information discovery system has to be efficient, fault-tolerant and self-organizing. Furthermore it has to offer guarantees and support flexible searches (using keywords, wildcards and range queries). Peerto- Peer (P2P) systems, whose key characteristics include decentralization, self-organization, dynamism and fault-tolerance, make naturally scalable and attractive solutions for information discovery systems. This poster presents Squid, a P2P information discovery system. Squid is essentially a content-based, Internet-scale distributed hash table (DHT), built on a structured overlay of nodes. Unlike most existing information discovery systems, Squid supports keyword searches, including wildcards, partial keywords and ranges, and guarantees that the information will be found if it exists in the system. The key innovation is a dimension reducing indexing scheme based on Space Filling Curves (SFC), which maintains data locality and effectively maps the multidimensional information space to physical peers. Further, Squid takes advantage of the recursive nature of SFCs to design an efficient query engine. The query engine decentralizes query resolution, allowing the querying process to proceed in parallel at multiple nodes, and typically the nodes that store matching data elements. Squid also defines a hybrid overlay which extends a structured overlay to be flexible and fault-tolerant.

Julia Stoyanovich

A Faceted Query Engine Applied to Archaeology

In this poster, we present a system for storing and querying faceted hierarchies. We have developed a general faceted domain model and a query language for hierarchically classified data. We present here the use of our system on two real archaeological datasets containing thousands of artifacts. Our system is a sharable, evolvable resource that can provide global access to sizeable datasets in queriable format, and can serve as a valuable tool for data analysis and research in many application domains. Faceted classification treats entities or groups of entities as collections of facets: clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics. For example, in an archaeology domain, we may classify artifacts along multiple orthogonal dimensions: by time period, by culture, by material, by geographic location, etc. Recent work describes search facilities for hierarchical data using faceted hierarchies. We extend this idea and develop a data model and a query language appropriate for such domains. Our primary goal was to create a data model and a query language that is understandable to users, while still allowing one to compose complex queries from simple pieces. A typical user has expert knowledge of the underlying domain, but does not know SQL.

Erich Schmidt

Internet-scale Distributed Persistent Search

Persistent search is available from many information providers today: one can set up persistent queries on Google, Yahoo or CNN, subscribe to RSS feeds from The New York Times, eWeek, or from aggregators like PubSub.com. However, all these services rely on either a small number of sources, or a traditional search engine's database, limited by a sequential (single server/location) architecture . Therefore, either their publication base is small, or the refresh rate is too low, limiting their usefulness. While some experimental search engines (grub.org) already use a distributed architecture to acquire information, they still rely on a single control node and lack a ranking mechanism needed to filter out lower quality publications. We propose a new architecture called Distributed Persistent Search (DPS), based on a distributed publish-subscribe framework that can achieve high publication throughput from a very large publication base. Observing that publications are much larger and more dynamic than queries in a persistent search system, we eliminate inter-server publication traffic at the expense of subscription replication. We also developed a method for ranking documents that is much more appropriate for persistent search than traditional ranking mechanisms used in search engines.

John Cieslewicz

Improving Database Performance on Simultaneous Multithreading Processors

Simultaneous multithreading (SMT) allows multiple threads to supply instructions to the instruction pipeline of a superscalar processor. Because threads share processor resources, an SMT system is inherently different from a multiprocessor system and, therefore, utilizing multiple threads on an SMT processor creates new challenges for database implementers. Three thread-based techniques to exploit SMT architectures on memory- resident data are investigated. First, we consider running independent operations in separate threads, a technique applied to conventional multi- processor systems. Second, we describe a novel implementation strategy in which indi- vidual operators are implemented in a multi- threaded fashion. Finally, we introduce a new data-structure called a work-ahead set that al- lows us to use one of the threads to aggres- sively preload data into the cache for use by the other thread.

Chavdar Botev

Integrating Full-Text Search and Structured Search in an XML Database System

We present the design and implementation of Quark, an XML database system that tightly integrates structured queries and full-text search (including ranked keyword search). Quark supports structured queries using XQuery and supports full-text queries using TeXQuery. TeXQuery is a language that we have developed. It allows complex fully-composable full-text search over XML collections and supports extensible scoring. TeXQuery is also the precursor of XQuery 1.0 and XPath 2.0 Full-Text standard, currently developed by the World Wide Web Consortium. Our TeXQuery implementation supports search conditions like word and phrase matching, proximity, match scoping, Boolean combinations of these, and various aspects like case-sensitivity, stemming, etc. Our implementation also supports a large class of scoring methods for full-text search, including many methods proposed in the literature, such as XXL, XSEarch, and XIRQL. The TeXQuery implementation fully utilizes the existing XQuery rewrite and optimization algorithms. Furthermore, our implementation enables the tight integration and joint optimization of structured queries and full-text search.

Ashwin Machanavajjhala & Dan Kifer

Beyond k-anonymity: New Scheme for Privacy Preserving Data Publishing

Abstract: Publishing summary information about sensitive data, while preserving the privacy of individuals that contributed to the data, is an important problem. In recent years, a definition of privacy called $k$-anonymity has gained importance. In a $k$-anonymized dataset, each record is indistinguishable from at least $k-1$ other records with respect to some ``identifying'' attributes. In this poster we show, with two simple attacks, that a $k$-anonymized dataset has some subtle, but severe privacy problems. First, we show that the attacker can sometimes discover the value of a sensitive attribute due to the lack of diversity in the sensitive attributes. Second, a real adversaries has background knowledge and we show $k$-anonymity does not guarantee privacy with respect to such an adversary. In order to address these issues, we give a detailed analysis of those attacks and we propose an alternative privacy definition which defends against them.

Fan Yang

Hilda: high level language for data driven applications

Data-driven web based applications contain web pages generated based on the content of the database. Such application systems span three conceptual layers: the database, the application logic, and the web interface. We propose Hilda, a high-level language for developing data-driven web applications and the goal is to simplify the development of such applications. The primary benefits of Hilda over existing development platforms are: (1) it uses a unified data model for all layers of the application, (2) it is declarative, (3) it models both application queries and updates, (4) it supports structured programming for web sites, (5) it enables conflict detection due to concurrent updates, and (6) it separates application logic from presentation. We also describe the implementation of a simple proof-of-concept Hilda compiler, which translates a Hilda application program into executable code

We have a project website : http://www.cs.cornell.edu/database/hilda.

Mingsheng Hong

Cayuga Stream Processing System

Recently there has been considerable research on Data Stream Management Systems (DSMS) to support analysis of data that arrives rapidly in high-speed streams. Most of these systems have very expressive query languages in order to cover a wide range of applications. In this work, we take a different approach. Instead of starting with a very powerful query language, we begin with a very simple language that has a well-known scalable implementation --- attribute-value publish/subscribe --- and augment it with a class of stateful subscriptions over sets of records. Our main contributions are a novel algebra for expressing stateful subscriptions, a corresponding transformation of algebra expressions into simple automata, and effective methods for multi-query optimization for sharing processing between queries. Our language can express surprisingly powerful queries. Nevertheless, due to its simple implementation, our system achieves very good performance.

Prakash Linga

An Indexing Framework for P2P Systems

We present a modularized storage and indexing framework that cleanly separates the functional components of a P2P system. This framework enables us to tailor the P2P infrastructure to the specific needs of various Internet applications, without having to devise completely new storage management and index structures for each application. In the context of this indexing framework, we present new techniques to guarantee correctness and availability of P2P range indices.

Wisam Dakka

Summarization-Aware Search for Online News Articles

Machine generated summaries of news article clusters can be informative in cluster-based search. News readers can utilize these summaries in the query results to get a general idea about the events associated with the clusters and to only explore clusters that are likely to fit their interest. Unfortunately, current document summarizers are still not suitable to be applied on the fly due to their unacceptably long running times, which rules out cluster-based search strategies that define documents clusters online, in a query-specific manner. In contrast, cluster-based search strategies that only return offline, query-independent document clusters are efficient, but can often return clusters of little relevance to the queries. We present a Hybrid search strategy to address the limitation of fully online and fully offline cluster-based search approaches. Specifically, our strategy attempts to maximize the use of offline clusters-and their associated summaries computed offline-resorting to online clustering and summarization only when strictly necessary. The Hybrid strategy uses supervised learning to identify offline clusters that are relevant to the user query. At the same time, our strategy forms new clusters with relevant documents from otherwise irrelevant offline clusters. We then utilize cluster ranking to determine the top-k offline and online clusters for the query, and to only summarize any new, online clusters among them. A thorough evaluation of the proposed techniques, involving user relevance judgments and real news articles from a variety of news sources, shows that the quality of our Hybrid results is high, and that these results are computed in substantially less time than with the fully online strategy.

Alpa Shah

Relational Query Processing Over Text Documents

Text documents often embed data that is structured in nature. For many applications, this data may be best exploited if it is available, at least conceptually, as a relational table that could be used to answer structure-aware queries or to run data mining tasks. In this poster, we describe an early-stage project at Columbia to allow relational-style query processing over natural-language text documents. For example, consider a user who requests the names of the CEOs of all companies whose headquarters are in the New York City area. Our goal is to be able to return to the user a reliable answer derived on the fly from, say, a repository of newspaper articles, where all the information that is needed to answer the query is present but "buried" in natural-language text, ready to be extracted. To address this challenging problem, we draw on ideas and tools from a variety of fields, most notably from information extraction (to derive the appropriate structured information from the text documents, hopefully with as little human supervision and training as possible), information retrieval (to identify the documents that are relevant to a particular user query and extraction task, for scalability), and databases (to identify efficient query plans among the many choices in a cost-based manner, as well as to eliminate noise in the query results by applying data cleaning techniques).

Amélie Marian

Top-k Queries over Structured Web Data

Traditionally, query processing strategies for structured (e.g., relational) data identify the ``exact matches'' for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy --often expressing user preferences and not hard Boolean constraints-- and are best answered with a ranked, or ``top-k,'' list of the best matching data objects. In this poster, I will present top-k query processing algorithms. for a common web-application scenario where the data is structured and only available through autonomous web services with heterogeneous access interfaces and constraints. For efficiency, my algorithms focus on the objects that are most likely to be in the top-k query answers, and discard --as early as possible-- objects that are guaranteed not to qualify for the answers. By taking into account the peculiarities of the sources and potentially choosing a different query execution plan for each individual candidate object, my adaptive algorithms significantly reduce the query processing time over previously existing algorithms, which select coarser ``global'' query execution plans.