================================================================== DIP2016Corpus v1.0 ================================================================== As presented in SIGIR 2016 article "New Collection Announcement: Focused Retrieval Over the Web" Please use the following citation: @InProceedings{Habernal.et.al.SIGIR.2016, author = {Habernal, Ivan and Sukhareva, Maria and Raiber, Fiana and Shtok, Anna and Kurland, Oren and Ronen, Hadar and Bar-Ilan, Judit and Gurevych, Iryna}, title = {{New Collection Announcement: Focused Retrieval Over the Web}}, booktitle = {Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval}, month = {July}, year = {2016}, publisher = {ACM}, address = {New York, NY, USA}, pages = {701--704}, series = {SIGIR '16}, location = {Pisa, Italy}, url = {http://dl.acm.org/citation.cfm?doid=2911451.2914682}, doi = {10.1145/2911451.2914682} } ------------------------------------------------------------------ Content ------------------------------------------------------------------ There are two folders available: * Step10AggregatedCleanGoldData - Contains intermediate data with original plain text, votes from Amazon Mechanical Turk workers, additional instruction to label relevant/irrelevant sentences, etc. * DIP2016Corpus - The final clean exported corpus ------------------------------------------------------------------ DIP2016Corpus Data format ------------------------------------------------------------------ The data are split into 49 files, one file per query. The files are in a XML utf-8 format: cellphone for 12 years old kid ... and each document contains the ClueWeb ID and a list of annotated sentences: You should also know that The Sacramento .... You are more likely to see inappropriate comments before our staff does, so we ask that you click the "Report Abuse" link to submit those comments for moderator review. You also may notify us via email at feedback@sacbee.com . ... Brandon Gonzales, 12, has been using a cellphone since he was 10. Almost all his Sutter Middle School friends have cellphones, too. His mom, Elizabeth Gonzales, likes knowing that he can call home at any time. ... ------------------------------------------------------------------ Usage ------------------------------------------------------------------ * The annotations are licensed under CC-BY 4.0. The original content from ClueWeb12 keeps its original license. * Please cite the SIGIR 2016 article if you use the data in any of your work. ------------------------------------------------------------------ Processing software ------------------------------------------------------------------ * The software package used for preparing this data can be found at the following GitHub repository: https://github.com/UKPLab/sigir2016-collection-for-focused-retrieval