Student |
University |
Email |
Title |
Emiran Curtmola |
UCSD |
Implementation and Open Research Issues in XML Full-Text Search |
|
Ilya Pevzner |
NYU |
||
Ziyang Wang |
NYU |
||
|
Stony Brook |
Data Outsourcing with Privacy in the Presence of a Secure Co-Processor |
|
Cristina Schmidt |
|
||
Julia Stoyanovich |
|
||
Erich Schmidt |
|
||
John Cieslewicz |
|
Improving Database Performance on Simultaneous Multithreading Processors |
|
Chavdar Botev |
Cornell |
Integrating Full-Text Search and Structured Search in an XML Database System |
|
Ashwin Machanavajjhala & Dan Kife |
Cornell |
dkifer@cs.cornell.edu |
Beyond k-anonymity: New Scheme for Privacy Preserving Data Publishing |
Fan Yang |
Cornell |
yangf@cs.cornell.edu |
Hilda: high level language for data driven web-based
applications |
Mingsheng Hong |
Cornell |
||
Prakash Linga |
Cornell |
||
Wisam Dakka |
|
||
Alpa Shah |
|
||
Amélie Marian |
|
amelie@cs.columbia.edu |
Emiran Curtmola
Implementation and Open Research Issues in XML Full-Text Search
The increase of large XML
repositories being made available lately has
In this poster, I will describe the
data model and the query semantics
A demonstration of GalaTex is provided at
Ilya Pevzner
Resolving
Attribute Value Inconsistencies
Resolving
inconsistencies in data is a problem of critical practical importance. Inconsistent
data arises whenever an attribute takes on multiple,
inconsistent, values. This may occur when a particular entity is
stored multiple times in one database, or in multiple
databases that are combined. Two types of such inconsistencies can be
identified: contextdependent
inconsistencies, such as mismatched attribute domains, and
context-independent, such as input or measurement errors.
Most existing research in the area focuses on
context-dependent conflicts, and few works that address
context-independent conflicts are based on data quality characteristics of sources of the conflicting data. In
practice, such characteristics are often unavailable.
We propose
an approach based on a probabilistic learning that would
allow to automatically resolve context-independent
conflicts. The initial implementation uses supervised Bayesian learning with maximum
likelihood estimation.
Ziyang Wang
Web Daily News Assistant: Find What's New on Your Web
Large-scale web search engines update their web index slowly and can only serve long life information well. To search new updates and present valuable new information on the Web requires a much higher update rate of the local index and more complex evaluating strategies than existing search technologies. We proposed and developed Web Daily News Assistant, a novel online search service on the scale of community web (a localized web site or web domain) that can quickly and automatically find what is updated on the Web and present valuable new information to users. Finding valuable new information in the hypertext media is seldom addressed in current literature. We exploit novel solutions for web change detection, new information extraction and evaluation in our service. The current deployment of the preliminary engine, called Web Daily News Assistant @ New York University, is providing a daily news digest as well as the access to the complete change history of indexed web pages on university web.
Data Outsourcing with Privacy in the Presence of a Secure Co-Processor
In this work we explore mechanisms
for outsourcing, updating
and querying encrypted data in the presence of potentially
malicious or compromised entities. We are mainly concerned
with frameworks where clients store data encrypted on servers,
and specialized query protocols allow ulterior access to that
data.
Existing research efforts can handle
specific query types (e.g., aggregation, ranges) but
feature large, quickly
degrading overheads and requirements of client-side
processing for more complex queries (e.g., JOINS).
Additionally, the type
of security assurances offered are
only statistical, at the extreme converging to a total
leak of information of the stored data.
In this research we ask whether a
secure co-processor on the
server side can offer significant overhead reductions and
security guarantees for arbitrary query types.
We propose
to deploy cryptographically strong mechanisms that leverage
the existence of a a secure co-processor on
the server side.
In our initial efforts we look at arbitrary relational JOINs and how they can be performed securely over encrypted data with complete computational privacy.
Cristina Schmidt
Squid
– Flexible P2P Information Discovery Infrastructure
Julia Stoyanovich
A Faceted Query Engine
Applied to Archaeology
Erich Schmidt
Internet-scale Distributed Persistent
Search
Persistent search is available from many information providers today: one can set up persistent queries on Google, Yahoo or CNN, subscribe to RSS feeds from The New York Times, eWeek, or from aggregators like PubSub.com. However, all these services rely on either a small number of sources, or a traditional search engine's database, limited by a sequential (single server/location) architecture . Therefore, either their publication base is small, or the refresh rate is too low, limiting their usefulness. While some experimental search engines (grub.org) already use a distributed architecture to acquire information, they still rely on a single control node and lack a ranking mechanism needed to filter out lower quality publications. We propose a new architecture called Distributed Persistent Search (DPS), based on a distributed publish-subscribe framework that can achieve high publication throughput from a very large publication base. Observing that publications are much larger and more dynamic than queries in a persistent search system, we eliminate inter-server publication traffic at the expense of subscription replication. We also developed a method for ranking documents that is much more appropriate for persistent search than traditional ranking mechanisms used in search engines.
John Cieslewicz
Improving Database Performance on Simultaneous Multithreading Processors
Simultaneous multithreading (SMT) allows multiple threads to
supply instructions to the instruction pipeline of a superscalar processor.
Because threads share processor resources, an SMT system is inherently different
from a multiprocessor system and, therefore, utilizing multiple threads on an
SMT processor creates new challenges for database implementers.
Three thread-based techniques to exploit SMT architectures on memory- resident
data are investigated. First, we consider running independent operations in
separate threads, a technique applied to conventional multi- processor systems.
Second, we describe a novel implementation strategy in which indi- vidual
operators are implemented in a multi- threaded fashion. Finally, we introduce a
new data-structure called a work-ahead set that al- lows us to use one of the
threads to aggres- sively preload data into the cache for use by the other
thread.
Chavdar Botev
Integrating Full-Text Search and
Structured Search in an XML Database System
Ashwin Machanavajjhala & Dan Kifer
Beyond k-anonymity: New Scheme for Privacy Preserving Data Publishing
Fan Yang
Hilda: high
level language for data driven applications
Data-driven web based applications contain web pages generated based on the content of the database. Such application systems span three conceptual layers: the database, the application logic, and the web interface. We propose Hilda, a high-level language for developing data-driven web applications and the goal is to simplify the development of such applications. The primary benefits of Hilda over existing development platforms are: (1) it uses a unified data model for all layers of the application, (2) it is declarative, (3) it models both application queries and updates, (4) it supports structured programming for web sites, (5) it enables conflict detection due to concurrent updates, and (6) it separates application logic from presentation. We also describe the implementation of a simple proof-of-concept Hilda compiler, which translates a Hilda application program into executable code
We have a project website : http://www.cs.cornell.edu/database/hilda.
Mingsheng Hong
Cayuga Stream
Processing System
Prakash Linga
An Indexing Framework for P2P
Systems
We present a modularized storage and
indexing framework that cleanly separates the functional components of a P2P
system. This framework enables us to tailor the P2P infrastructure to the
specific needs of various Internet applications, without having to devise
completely new storage management and index structures for each application. In
the context of this indexing framework, we present new techniques to guarantee
correctness and availability of P2P range indices.
Wisam Dakka
Summarization-Aware Search for Online News Articles
Relational Query Processing Over Text Documents
Text documents often embed data that is structured in nature. For many applications, this data may be best exploited if it is available, at least conceptually, as a relational table that could be used to answer structure-aware queries or to run data mining tasks. In this poster, we describe an early-stage project at Columbia to allow relational-style query processing over natural-language text documents. For example, consider a user who requests the names of the CEOs of all companies whose headquarters are in the New York City area. Our goal is to be able to return to the user a reliable answer derived on the fly from, say, a repository of newspaper articles, where all the information that is needed to answer the query is present but "buried" in natural-language text, ready to be extracted. To address this challenging problem, we draw on ideas and tools from a variety of fields, most notably from information extraction (to derive the appropriate structured information from the text documents, hopefully with as little human supervision and training as possible), information retrieval (to identify the documents that are relevant to a particular user query and extraction task, for scalability), and databases (to identify efficient query plans among the many choices in a cost-based manner, as well as to eliminate noise in the query results by applying data cleaning techniques).
Top-k Queries over Structured Web Data