TWITA is a collection of tweets identified as being written in the Italian langauge.
In its first version, containing data from 2012 to 2015, the collection of
tweets has been harvested using a two-pass language identification,
aiming for general Italian language. We used cURL
to download from the
Twitter Streaming API searching for a list of representative words:
vita Roma forza alla quanto amore Milano Italia fare grazie della anche periodo bene scuola dopo tutto ancora tutti fattoThe list consists of the most frequent lemma in the ItWaC corpus; all words that were frequent in other languages (English, Spanish and Portuguese, for the most part) were filtered out (e.g. come). As a second step, the tweets were passed to the language identification software langid.py to detect Italian language.
track=[”a”,”e”,”i”,”o”,”u”] languages=[”it”]In September 2018, the collection comprised more than 500 million tweets in the Italian language, spanning 7 years (57 months) from February 2012 to July 2018.
If you use TWITA, please cite the following paper describing the resource:
@inproceedings{DBLP:conf/clic-it/BasileLS18, author = {Valerio Basile and Mirko Lai and Manuela Sanguinetti}, title = {Long-term Social Media Data Collection at the University of Turin}, booktitle = {Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), Torino, Italy, December 10-12, 2018.}, year = {2018}, crossref = {DBLP:conf/clic-it/2018}, url = {http://ceur-ws.org/Vol-2253/paper48.pdf}, timestamp = {Mon, 17 Dec 2018 17:18:40 +0100}, biburl = {https://dblp.org/rec/bib/conf/clic-it/BasileLS18}, bibsource = {dblp computer science bibliography, https://dblp.org} }