Spring '08 North East DB/IR Day

Keynote Speech I: Mor Naaman

Data by the People, for the People [video]

What can we learn from social media and community-contributed collections of information on the web? The most salient attribute of social media is the creation of an environment that promotes user contributions in the form of authoring, curation, discussion and re-use of content. This activity generates large volumes of data, including some types of data that were not previously available. Even more importantly, design decisions in these applications can directly influence the users' motivations to participate, and hugely affect the resultant data. I will discuss the cycle of social media, and argue that a 'holistic' approach to social media systems, which includes design of applications and user research, can advance data mining and information retrieval systems.

Using Flickr as an example, I will describe a study in which we examine what motivates users to add tags and "geotags" to their photos. The new data enables extraction of meaningful (not to say "semantic") information from the Flickr collection. We use the extracted information, for example, to produce summaries and visualizations of the Flickr collection, making the repository more accessible and easier to search, browse and understand as it scales. In the process, the user input helps alleviate previously intractable problems in multimedia content analysis.

Speaker's Bio:

Mor Naaman is a research scientist at Yahoo! Advanced Development Research in Berkeley, where since 2005 he has been leading a team of research engineers and interns. His domains of interests include mobile and ubiquitous computing, interactive multimedia systems, and location- and context-aware computing. Mor received a Ph.D. in Computer Science from Stanford University. His research in the Stanford Infolab also focused on digital media, and in particular the management of digital photographs, thereby allowing (and requiring!) him to take photos throughout his academic career. Mor is a co-chair of the JCDL 2008 Program Committee, a co-chair of ACM Multimedia 2009's Grand Challenge, and a recipient of two JCDL best paper awards. In previous careers, Mor was a professional basketball player as well as a software developer and a college radio DJ. In subsequent careers, Mor hopes to be a professional backpacker and traveler.

Keynote Speech II: Susan Davidson

Provenance and Scientific Workflows [video]

Scientific workflow systems have become popular as a way of specifying and executing data-intensive analyses. As the number of intermediate and final data products produced by these in-silico experiments grows, there is an increasing need to record provenance information to answer questions such as: Who created this data product and when? What was the process used to create the data product? Were two data products derived from the same raw data? A number of specialized workshops, surveys and tutorials have recently appeared on this topic, which raise interesting challenges to the database community.

In this talk, I will discuss what workflow provenance is, and the difference from database-style provenance. I will also discuss several of the recent challenges that have been raised, in particular focusing user attention on meaningful provenance, managing the provenance of nested data, and understanding the difference between workflow runs.

Speaker's Bio:

Susan B. Davidson received the B.A. degree in Mathematics from Cornell University, Ithaca, NY, in 1978, and the M.A. and Ph.D. degrees in Electrical Engineering and Computer Science from Princeton University, Princeton, NJ, in 1980 and 1982. Dr. Davidson joined the University of Pennsylvania in 1982, and is now the Weiss Professor and Department Chair of Computer and Information Science. She is an ACM Fellow, a Fulbright scholar, founding co-Director of the Center for Bioinformatics (PCBI), and recently stepped down as Deputy Dean of the School of Engineering and Applied Science (SEAS).

Dr. Davidson has been heavily involved in curricular initiatives and degree programs in bionformatics. She also recently co-created the Advancing Women in Engineering (AWE) program, whose goal is to recruit and retain women in engineering.

Dr. Davidson's research interests include database systems, database modeling, distributed systems, and bioinformatics. Within bioinformatics she is best known for her work in data integration, XML query and update technologies, and more recently provenance in workflow systems.

Keynote Speech III: Jeff Naughton

Extracting Problems for Database and IR Researchers [video]

Storing, querying, and managing the evolution of extracted data sets poses an interesting set of challenges for information management systems. In particular, the schema may partially known, chaotic, and constantly evolving; incomplete, inconsistent, and/or incorrect data is normal and unavoidable; and users are likely to want to query the structure of this data without wanting to learn the schema of the data or a structured query language. It turns out that far from being unique to extracted data sets, all of these problems have long existed, unfortunately largely unsolved, in "classical" database applications. However, it is possible that a change in perspective about the requirements of their solution may allow better progress.

Speaker's Bio:

Jeff Naughton is a Professor of Computer Sciences at the University of Wisconsin, Madison. Professor Naughton received a B.S. degree from the University of Wisconsin-Madison in 1982 and a Ph.D. degree from Stanford University in 1987.

Professor Naughton was awarded a Presidential Young Investigator award in 1991 and is a Fellow of the ACM. He has published over 100 technical papers.