printlogo
http://www.ethz.ch/index_EN
Welcome to the Databases and Information Systems Group
 
print
  

iMeMex: DataSpace Management System

Jungle-iMeMex_small

Projektbeschreibung

Heutige Computer Workstations stellen Tausende von Anwendungen
zur Verfügung, die Daten in Hunderttausenden von Dateien
auf dem Dateisystem des zugrundeliegenden Betriebssystems speichern.
Für die Verarbeitung dieser Dateien wird Programmlogik
von jeder Anwendung neuerfunden und resultiert in einem Dschungel
von Datenverarbeitungslösungen und Daten. Für den Nutzer
ist es sehr schwierig, in diesem Dschungel Informationen zu organisieren,
wiederzufinden oder gar verschiedene Dateien für gemeinsame
Anfragen zu nutzen.

Derzeit erweitern zwar Betriebssystemhersteller wie Microsoft (Longhorn)
und Apple (Tiger) ihre Systeme um effiziente Suchtechnologien, die
die Leistungsfähigkeit von Google Web Search auf den Desktop bringt. Doch diese
Ansätze lassen viele Probleme, mit denen ein Nutzer heute konfrontiert ist, ungelöst.

In diesem Projekt entwickeln wir ein generisches System iMeMex, das über
obige Ansätze hinaus geht und das Problem der Informationsintegration
strukturierter (XML, Relationen, etc.) und unstrukturierter Daten (Text)
auf generische Weise löst. iMeMex ist eine erweiterbare Plattform ähnlich
wie die Eclipse-Plattform. Das System ist mittels eines Plugin-Konzeptes erweiterbar.
Mehr Informationen zu diesem Projekt finden sich auf www.imemex.org.

Open Topics

For all projects the implementation should be done in Java 1.5.

Combine del.icio.us and Google (Semester thesis)

Write an iTrails plugin that connects to del.icio.us and downloads all the bookmarks.
Register a data source for the URL of each bookmark. iMeMex will then index the full text
of all the web documents. Afterwards create a trail for each tag that associates the
tagname with the web documents.
Extend our AJAX user interface to provide a mechanism to query by tagnames and keywords.
For each result obtained show the tags (trails) that point to this result. Allow the user
to navigate to new tags (neighborhood queries).

Improving XML search with iTrails (MSc thesis)

Extend iQL and its parser to support NEXI as a subset. TopX is a state-of-the-art XML/IR
search engine. Provide a TopX plugin that sends queries compiled in iMeMex to TopX in mediation
mode and obtains query results. Index the standard INEX Wikipedia dataset with iMeMex/TopX and
evaluate precision and recall performance.
Afterwards, check if there are meaningful trails that could be manually created for the INEX
data set. Work also on automatic trail definition from the data. For example, create trails from
structured sources, like ontologies, dictionaries, etc, or from keyword co-occurrence summaries
(e.g. detect that vegetable oil and oil refinery are searches for different types of oil).
How much can you improve precision and recall of the queries in INEX with your trails?
Your work will enable us to evaluate how much the iTrails integration framework improves on a
state-of-the-art XML/IR search engine on a standard dataset.

Differential Indexing / Update streams (MSc / Semester thesis)

Create a generic mechanism to allow all iMeMex indexes to be updated, as long as
they provide functionality for bulk insert and for scanning the contents of the index.
Updates should be collected in a differential file. Queries are answered by merging
the differential file with the index responses at query time. When differential files
become big, we should perform a merge operation to integrate the updates in the
differential file with the original index. This means creating a new index from the
contents of the old one and from the differential file.
To integrate and test differential indexing in iMeMex, create update streams in the
system. These update streams are push-operator plans. For some data sources (e.g.
windows ntfs), we already have plugins that detect the update events. They would be
pushed through the update stream. The update stream should lead to all structures that
need updates (materialized views and indexes).

Bringing public and personal search closer together (Semester thesis)

Implement plugins that allow users to query public web sites with iMeMex.
Possible example web sites are Google, SBB, Amazon (open search protocol), map.search.ch,
www.weisseseitech.ch, ...
Define an encoding from iQL to the "query language" of a particular web site, e.g. REST or
open search.

Support for contacts (Semester thesis)

Extract contacts from addressbook files (e.g. vcard 3.0) and email messages.
Mine emails to detect hot topics of discussion. Relate contacts with their
discussed topics assigning weights and use trails to store these relationships.
Finally use the Enron dataset to evaluate your work.
Your work will enable people to search for other people or for who was
discussing a certain topic.

Write iMeMex plugins in Python (or another scripting language) (Semester thesis)

Enable iMeMex plugins to be written in scripting languages, e.g. JPython.
Implement two proof-of-concept plugins of your own choice.
Some suggestions:
- Content converters to extract text from HTML / OpenOffice.org / Microsoft Word
- A content converter or a data source plugin to read Microsoft Outlook contacts
- Content converters that know how to extract metadata from different image file
types (e.g. JPEG, GIF, PNG) and sound file types (e.g. MP3, Ogg, AAC, WMA).
- Data source plugins to access flickr or picasa

Kontakt:

Dr. Jens Dittrich (jens.dittrich at inf dot ethz dot ch)

 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | Disclaimer | 17 September 2007
top