/www/twita.dipinfo.di.unito.it/docs

About

TWITA is a collection of tweets identified as being written in the Italian langauge.

In its first version, containing data from 2012 to 2015, the collection of tweets has been harvested using a two-pass language identification, aiming for general Italian language. We used cURL to download from the Twitter Streaming API searching for a list of representative words:

vita Roma forza alla quanto amore Milano Italia fare grazie
della anche periodo bene scuola dopo tutto ancora tutti fatto

The list consists of the most frequent lemma in the ItWaC corpus; all words that were frequent in other languages (English, Spanish and Portuguese, for the most part) were filtered out (e.g. come). As a second step, the tweets were passed to the language identification software langid.py to detect Italian language.

At the time of its first publishing in 2013, the resource contained about 100 million tweets in Italian, from February 2012 to February 2013.

The automatic collection, however, continued, and in 2015 was transferred from the University of Groningen to the University of Turin. From June 2018, a new filter based on the five Italian vowels has been added to the pipeline, along with the language filter provided by the Twitter API, which was not previously available, in order to limit the number of accidentally captured tweets in other languages. In the latest version of the data collection pipeline, a Python script employing the tweepy library gathers JSON tweets using the following filter:

track=[”a”,”e”,”i”,”o”,”u”]
languages=[”it”]

In September 2018, the collection comprised more than 500 million tweets in the Italian language, spanning 7 years (57 months) from February 2012 to July 2018.

If you use TWITA, please cite the following paper describing the resource:

@inproceedings{DBLP:conf/clic-it/BasileLS18,
  author    = {Valerio Basile and
               Mirko Lai and
               Manuela Sanguinetti},
  title     = {Long-term Social Media Data Collection at the University of Turin},
  booktitle = {Proceedings of the Fifth Italian Conference on Computational Linguistics
               (CLiC-it 2018), Torino, Italy, December 10-12, 2018.},
  year      = {2018},
  crossref  = {DBLP:conf/clic-it/2018},
  url       = {http://ceur-ws.org/Vol-2253/paper48.pdf},
  timestamp = {Mon, 17 Dec 2018 17:18:40 +0100},
  biburl    = {https://dblp.org/rec/bib/conf/clic-it/BasileLS18},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}