Denoyer Ludovic & Gallinari Patrick (2006). « The Wikipedia XML corpus ». SIGIR Forum, vol. 40, n° 1, p. 64–69.
Added by: Laure Endrizzi 2007-11-25 15:27:33    Last edited by: Laure Endrizzi 2007-11-25 15:45:04
Categories: 4. interfaces et modes de consultation
Keywords: documents structurés, extraction d'information, interface, recherche d'information, Wikipedia
Creators: Denoyer, Gallinari
Collection: SIGIR Forum

Wikipedia is a well know free content, multilingual encyclopedia written collaboratively by contributors around the world. Anybody can edit an article using a wiki markup language that offers a simplified alternative to HTML. This encyclopedia is composed of millions of articles in different languages.
Content-oriented XML retrieval is an area of Information Retrieval (IR) research that is receiving an increasing interest. There already exists a very active community in the IR/ XML domain which started to work on XML search engines and XML textual data. This community is mainly organized since 2002 around the INEX initiative (INitiative for the Evaluation of XML Retrieval) which is funded by the DELOS network of excellence on Digital Libraries.
In this article, we describe a set of XML collections based on Wikipedia. These collections can be used in a large variety of XML IR/Machine Learning tasks like ad-hoc retrieval, categorization, clustering or structure mapping. These corpora are currently used for both, INEX 2006 and the XML Document Mining Challenge. The article provides a description of the corpus.
