Prashant Khare, Gregoire Burel, Diana Maynard and Harith Alani.
Abstract: Many citizens nowadays flock to social media during crises to share or acquire the latest information about the event. Due to the sheer volume of data that is typically circulated during such events, it is necessary to have the ability to efficiently filter out irrelevant posts, and thus focus attention to the posts that are truly of relevance to the crisis. Recent research experimented with various statistical, and semantic, methods to automatically classify relevant and irrelevant posts to a given crisis or set of crises. However, it is unclear how such approaches perform when the posts about a crisis are generated in different languages. The typical approach is train the model for each language, but this is costly, time consuming, and not a viable option for rapidly evolving crisis situations.
In this paper we test statistical and semantic classification approaches on cross-lingual datasets from 30 crisis events, consisting of posts written mainly in English, Spanish, and Italian. We experiment with scenarios where the model is trained on one language, and tested on another, and where the data is translated to a single language. We show that the addition of semantic features extracted from external knowledge bases show increases in accuracy over the statistical model.
Keywords: semantics; cross lingual; multilingual; crisis informatics; tweet classification