One Million Posts: A Data Set of German Online Discussions

by Dietmar Schabus, Marcin Skowron, Martin Trapp
Abstract:
In this paper we introduce a new data set consisting of user comments posted to the website of a German-language Austrian newspaper. Professional forum moderators have annotated 11,773 posts according to seven categories they considered crucial for the efficient moderation of online discussions in the context of news articles. In addition to this taxonomy and annotated posts, the data set contains one million unlabeled posts. Our experimental results using six methods establish a first baseline for predicting these categories. The data and our code are available for research purposes from https://ofai.github.io/million-post-corpus.
Reference:
Dietmar Schabus, Marcin Skowron, Martin Trapp, “One Million Posts: A Data Set of German Online Discussions”, In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Tokyo, Japan, pp. 1241-1244, 2017.
Bibtex Entry:
@InProceedings{Schabus2017,
  Title                    = {One Million Posts: A Data Set of German Online Discussions},
  Author                   = {Dietmar Schabus and Marcin Skowron and Martin Trapp},
  Booktitle                = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)},
  Year                     = {2017},
  pages                    = {1241--1244},
  Address                  = {Tokyo, Japan},
  Month                    = aug,

  Abstract                 = {In this paper we introduce a new data set consisting of user comments posted to the website of a German-language Austrian newspaper. Professional forum moderators have annotated 11,773 posts according to seven categories they considered crucial for the efficient moderation of online discussions in the context of news articles. In addition to this taxonomy and annotated posts, the data set contains one million unlabeled posts. Our experimental results using six methods establish a first baseline for predicting these categories.
The data and our code are available for research purposes from https://ofai.github.io/million-post-corpus.},
  Doi                      = {10.1145/3077136.3080711},
}