FROM PEC TO PEC24: A NEW REFERENCE CORPUS FOR ITALIAN
DOI:
https://doi.org/10.54103/2037-3597/29101Abstract
This article introduces the PEC24, an extension of the Perugia corpus, as a new reference corpus for Italian. The update mainly concerned the size of the corpus, which now consists of approximately 47 million tokens, with an addition of over 100,000 texts. The PEC24 maintains the same structure as its predecessor, divided into 10 sections, representing ten different written and spoken genres. In this article, after reviewing the spoken, written, and web corpora available for the Italian language, the internal composition of each section of the corpus will be described, followed by an explanation of how the corpus was annotated. Further, as the PEC24 is available and searchable online, examples of how it can be queried will be illustrated. In conclusion, the PEC24 represents a significant advancement in the panorama of Italian corpora, offering a representative and more comprehensive resource for linguistic research and corpus-bases studies.
Dal PEC al PEC24: un nuovo corpus di riferimento per l’italiano
Questo articolo presenta un nuovo corpus di riferimento per l’italiano, il PEC24, un’estensione del Perugia Corpus. Il principale aggiornamento ha riguardato le dimensioni del corpus, che oggi conta circa 47 milioni di token, con l’aggiunta di oltre 100.000 testi. Il PEC24 mantiene la stessa struttura del suo predecessore, suddivisa in 10 sezioni che rappresentano dieci generi diversi di italiano scritto e parlato. Dopo una panoramica sui corpora di parlato, scritto e web disponibili per l’italiano, l’articolo descrive la composizione interna di ciascuna sezione del PEC24 e illustra il processo di annotazione. Inoltre, poiché il corpus è disponibile online e interrogabile, verranno mostrati alcuni esempi di interrogazione. In conclusione, il PEC24 rappresenta uno sviluppo significativo nel panorama dei corpora italiani, offrendo una risorsa più rappresentativa e articolata per la ricerca linguistica e gli studi basati sui corpora.
Downloads
Riferimenti bibliografici
Ädel A. (2020), “Corpus Compilation”, in Paquot M., Gries S. Th. (eds.), A Practical Handbook of Corpus Linguistics, Springer, Cham, pp. 3-24 : https://doi.org/10.1007/978-3-030-46216-1.
Anthony L. (2024), “Breaking new ground – AI-enhanced concordance analysis”, in Reading Concordances in the 21st Century (RC21) Blog: https://blog.bham.ac.uk/rc21/2024/10/28/laurence-anthony-breaking-newground-ai-enhanced-concordance-analysis/.
Aston G., Burnard L. (1998), The BNC handbook: Exploring the British National Corpus with SARA, Edinburgh University Press, Edinburgh.
Baker P. (2006), Using Corpora in Discourse Analysis, Continuum, London-New York.
Bambini V., Trevisan M. (2012), “EsploraCoLFIS: Un’interfaccia Web per ricerche sul Corpus e Lessico di Frequenza dell’Italiano Scritto”, in Ricci I., Bertini C. (eds.), Quaderni del Laboratorio di Linguistica della Scuola Normale Superiore, Scuola Normale Superiore, Pisa, XI, pp. 1-16.
Barbera M., Corino E., Onesti C. (eds.) (2007), Corpora e linguistica in rete, Guerra Edizioni, Perugia.
Barbera M., Corino E., Marello C., Onesti C. (2022), “Corpora.unito.it”, in Cresti E., Moneglia M. (eds.), Corpora e Studi Linguistici. Atti del LIV Congresso Internazionale di Studi della Società di Linguistica Italiana, SLI, Officinaventuno, Milano, pp. 199-205: https://doi.org/10.17469/O2106SLI000013.
Baroni M., Bernardini S., Comastri F., Piccioni L., Volpi A., Aston G., Mazzoleni M. (2004), “Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian”, in Lino M. T., Xavier M. F., Ferreira F., Costa R., Silva R. (eds.), Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), European Language Resources Association (ELRA), Lisbona, pp. 1771-1774: https://aclanthology.org/L04-1128/.
Baroni M., Ueyama M. (2006), “Building general- and special-purpose corpora by web crawling”, in Proceedings of the 13th National Institute for Japanese Language International Symposium: Language Corpora Their Compilation and Application, National Institute for Japanese Language, Tokyo, pp. 31-40.
Baroni M., Kilgarriff A., Pomikalek J., Rychlý P. (2006), “WebBootCaT: Instant domainspecific corpora to support human translators”, in Hansen V., Maegaard B. (eds.), Proceedings of the 11th Annual Conference of the European Association for Machine Translation, European Association for Machine Translation, pp. 247-252: https://aclanthology.org/2006.eamt-1.31/.
Baroni M., Bernardini S., Ferraresi A., Zanchetta E. (2009), “The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora”, in Language Resources and Evaluation, 43, 3, pp. 209-226: https://doi.org/10.1007/s10579-009-9081-4.
Benzitoun C., Debaisieux J. M., Deulofeu H. J. (2016), “Le projet orféo: un corpus d’étude pour le français contemporain”, in Corpus, 15: https://doi.org/10.4000/corpus.2936.
Bertinetto P. M., Burani C., Laudanna A., Marconi L., Ratti D., Rolando C., Thornton A. M. (2005), Corpus e Lessico di Frequenza dell’Italiano Scritto (CoLFIS): https://linguistica.sns.it/CoLFIS/Home.htm.
Bhreathnach Ú., Měchura M., Ó Cleircín G., Ó Meachair M., Ó Raghallaigh B., Scannell K., Uí Dhonnchadha E. (2024), Corpas Náisiúnta na Gaeilge – National Corpus of Irish, DCU: https://www.corpas.ie/en/cng/.
Biffi M. (2010), “Il LIT – Lessico Italiano Televisivo”, in Mauroni E., Piotti M. (eds.), L’italiano televisivo 1976-2006, Accademia della Crusca, Firenze, pp. 35-70.
BNC Consortium (2007), The British National Corpus, XML Edition, Oxford Text Archive: http://hdl.handle.net/20.500.14106/2554.
Bortolini U., Tagliavini C., Zampolli A. (1972), Lessico di frequenza della lingua italiana contemporanea, Garzanti, Milano.
Brezina V., Hawtin A., McEnery T. (2021), “The Written British National Corpus 2014 – Design and comparability”, in Text & Talk, 41, 5-6, pp. 595-615: https://doi.org/10.1515/text-2020-0052.
Burnard L., Bauman S. (2014), TEI P5: Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative Consortium, Charlottesville.
Cerruti M., Ballarè S. (2021), “ParlaTO: Corpus del parlato di Torino”, in Bollettino dell’Atlante Linguistico Italiano (BALI), 44, pp. 171-196.
Cialdini F. (2016). “L’aggiornamento della banca dati LIT e il DIA-LIT”, in Alfieri G., Biffi M., Giuliano M., Motta D. (eds.), Il portale della TV e la TV dei portali. Atti del Convegno Firenze, Accademia della Crusca, 8 marzo 2013, Bonanno Editore, Acireale-Roma, pp. 31-47.
Coccia D. (2023), «Ah beh, sì beh». I Segnali Discorsivi nella lingua della canzone italiana e nell’insegnamento dell’italiano L2, Tesi Magistrale UniStraPg.
Cresti E. (2000), Corpus di Italiano Parlato, Accademia della Crusca, Firenze.
Cresti E. (2020), “The pragmatic analysis of speech and its illocutionary classification according to the Language into Act Theory”, in Izre’él S., Mello H., Panunzi A., Raso T. (eds.), In search of basic units of spoken language: A corpus-driven approach, John Benjamins, Amsterdam, pp. 181-219.
Cresti E., Moneglia M. (eds.) (2005), C-ORAL-ROM. Integrated reference corpora for spoken romance languages, John Benjamins, Amsterdam.
Cresti E., Gregori L., Moneglia M., Nicolas Martinez C., Panunzi A. (2022), “The LABLITA speech resources”, in Cresti E., Moneglia M. (eds.), Corpora e Studi Linguistici. Atti del LIV Congresso Internazionale di Studi della Società di Linguistica Italiana, SLI, Officinaventuno, Milano, pp. 85-108: https://doi.org/10.17469/O2106SLI000005.
De Mauro T. (1980), Guida all’uso delle parole, Editori Riuniti, Roma.
De Mauro T., Mancini F., Vedovelli M., Voghera M. (1993), Lessico di frequenza dell’italiano parlato, Etaslibri, Milano.
EAGLES (1996), Preliminary recommendations on corpus typology. EAGLES Document EAGTCWG-CTYP/P: https://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html.
Ellis N. C. (2017), “Cognition, corpora, and computing: Triangulating research in usagebased language learning”, in Language Learning, 67, S1, pp. 40-65: https://doi.org/10.1111/lang.12215.
Erjavec T., Kopp M., Ljubešić N. et al. (2024), “ParlaMint II: advancing comparable parliamentary corpora across Europe”, in Lang Resources & Evaluation, 2024: https://doi.org/10.1007/s10579-024-09798-w.
Evert S. (2006), “How random is a corpus? The library metaphor”, in Zeitschrift für Anglistik und Amerikanistik, 54, 2, pp. 177-190.
Forti L. (2023), Corpus Use in Italian Language Pedagogy: Exploring the Effects of Data-driven learning, Routledge, London-New York: https://doi.org/10.4324/9781003137320.
Francis N. W., Kučera H. (1964), “A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers”, Brown University, Providence.
Goria E., Mauri C. (2018), “Il corpus KIParla: Una nuova risorsa per lo studio dell’italiano parlato”, in Masini F., Tamburini F. (eds.), CLUB Working Papers in Linguistics, Vol. 2, Alma Mater Studiorum Università di Bologna, Bologna, pp. 96-116.
Goslin J., Galluzzi C., Romani C. (2014), “PhonItalia: a phonological lexicon for Italian”, in Behav Res, 46, pp. 872-886: https://doi.org/10.3758/s13428-013-0400-8.
Grandi N., Ballarè S., Chiusaroli F., Gallina F., Pascoli M., Pistolesi E. (2023a), Corpus UniverS-Ita. University of Bologna, Bologna: https://corpora.ficlit.unibo.it/CUSP/.
Grandi N., Ballarè S., Chiusaroli F., Gallina F., Pascoli M., Pistolesi E. (2023b), Corpus UniverS-Ita-ProUniv, University of Bologna, Bologna:
https://corpora.ficlit.unibo.it/CUSP/.
Grandi N., Ballarè S., Chiusaroli F., Gallina F., Pascoli M., Pistolesi E. (2023c), Corpus UniverS-Ita-ProGior, University of Bologna, Bologna:
https://corpora.ficlit.unibo.it/CUSP/.
Granger S., Gilquin G., Meunier F. (eds.) (2015), The Cambridge Handbook of Learner Corpus Research, Cambridge University Press, Cambridge: https://doi.org/10.1017/CBO9781139649414.
Hardie A. (2012), “CQPweb – combining power, flexibility and usability in a corpus analysis tool”, in International Journal of Corpus Linguistics, 17, 3, pp. 380-409: https://doi.org/10.1075/ijcl.17.3.04har.
Hardie A. (2014), “Modest XML for Corpora: Not a standard, but a suggestion”, in ICAME Journal, 38, pp. 73-103: https://doi.org/10.2478/icame-2014-0004.
ISTAT (1993), Indagine multiscopo sulle famiglie Anni 1987-1991. Vol. 7: Letture, Mass Media e Linguaggio, ISTAT, Roma.
Jakubíček M., Kilgarriff A., Kovář V., Rychlý P., Suchomel V. (2013), “The TenTen corpus family”, in Abstract book of the 7th international corpus linguistics conference CL2013, Lancaster, UK, pp. 125-127: https://www.sketchengine.eu/wpcontent/%20uploads/The_TenTen_Corpus_2013.pdf.
Jakubíček M., Kovář V., Rychlý P., Suchomel V. (2020), “Current challenges in web corpus building”, in Barbaresi A., Bildhauer F., Schäfer R., Stemle E. (eds.), Proceedings of the 12th Web as Corpus Workshop, European Language Resources
Association (ELRA), pp. 1-4: https://aclanthology.org/2020.wac-1.1.pdf.
Kilgarriff A., Baisa V., Bušta J., Jakubíček M., Kovář V., Michelfeit J., Rychlý P., Suchomel V. (2014), “The Sketch Engine: Ten years on”, in Lexicography ASIALEX, 1, pp. 7- 36: https://doi.org/10.1007/s40607-014-0009-9.
Laippala V., Rönnqvist S., Hellström S., Luotolahti J., Repo L., Salmela A., Skantsi V., Pyysalo S. (2020), “From web crawl to clean register-annotated corpora”, in Barbaresi A., Bildhauer F., Schäfer R., Stemle E. (eds.), Proceedings of the 12th Web as Corpus Workshop, European Language Resources Association (ELRA), pp. 14-22: https://aclanthology.org/2020.wac-1.3.pdf.
Laudanna A., Thornton A. M., Brown G., Burani C., Marconi L. (1995), “Un corpus dell’italiano scritto contemporaneo dalla parte del ricevente”, in Bolasco S., Lebart L., Salem A. (eds.), III Giornate internazionali di Analisi Statistica dei Dati Testuali, Volume I, Cisu, Roma, pp. 103-109: https://www.istc.cnr.it/sites/default/files/uploads/jadt95.pdf.
Love R., Dembry C., Hardie A., Brezina V., McEnery T. (2017), “The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations”, in International Journal of Corpus Linguistics, 22, 3, pp. 319-344: https://doi.org/10.1075/ijcl.22.3.02lov.
Lyding V., Stemle E., Borghetti C., Brunello M., Castagnoli S., Dell’Orletta F., Dittmann H., Lenci A., Pirrelli V. (2014), “The PAISÀ corpus of Italian web texts”, in Bildhauer F., Schäfer R. (eds.), Proceedings of the 9th Web as Corpus Workshop (WaC9)@EACL 2014, European Chapter of the Association for Computational Linguistics (EACL), pp. 36-43: https://aclanthology.org/W14-0406/.
Maraschio N., Antonini A. Bellucci P., Fanfani M., Stefanelli S., Avesani C., Pratesi M. (1997), “Il progetto LIR. I lessici di frequenza dell’italiano radiofonico”, in Bollettino d’informazioni, VII, 1-2, pp. 53-94.
Maraschio N., Stefanelli S., Buccioni S., Biffi M. (2004), “Dal corpus LIR: Prove e confronti lessicali”, in Albano Leoni F., Cutugno F., Pettorino M., Savy R. (eds.), Il parlato italiano. Atti del Convegno nazionale di Napoli (13-15 febbraio 2003), M. D’Auria, Napoli, pp. 1-36.
Mauri C., Ballarè S., Goria E., Cerruti M., Suriano F. (2019), “KIParla corpus: A new resource for spoken Italian”, in Bernardi R., Navigli R., Semeraro G. (eds.), Proceedings of the 6th Italian Conference on Computational Linguistics (CLiC-it 2019) (Vol. 2481), CEUR-WS.org: https://dblp.org/rec/conf/clic-it/MauriBGCS19.html.
Mauri C., Ballarè S., Zucchini E. (2024a), Modulo KIPasti, Università di Bologna, Bologna: https://kiparla.it/kipasti/.
Mauri C., Ballarè S., Zucchini E. (2024b), Modulo ParlaBO, Università di Bologna, Bologna: https://kiparla.it/parlabo/.
McEnery T., Hardie A. (2011), Corpus Linguistics: Method, Theory and Practice, Cambridge University Press, Cambridge: https://doi.org/10.1017/CBO9780511981395.
McEnery A., Brookes G. (2024), “Corpus Linguistics and the Social Sciences”, in Corpus Linguistics and Linguistic Theory, 20, 3, pp. 591-613: https://doi.org/10.1515/cllt2024-0036.
Orrù P. (2017), Il discorso sulle migrazioni nell’Italia contemporanea. Un’analisi linguistico-discorsiva sulla stampa (2000-2010), FrancoAngeli, Milano.
Paquot M., Gries S. Th. (eds.) (2020), A Practical Handbook of Corpus Linguistics, Springer International Publishing, Cham: https://doi.org/10.1007/978-3-030-46216-1.
Pérez-Paredes P., Alcaraz-Calero J. M. (2009), “Developing annotation solutions for online Data Driven Learning”, ReCALL, 21, 1, pp. 55-75: https://doi.org/10.1017/S0958344009000093.
Przepiórkowski A., Górski R., Łaziński M., Pęzik P. (2009), “Recent Developments in the National Corpus of Polish”, in Levická J., Garabík R. (eds.), NLP, Corpus Linguistics, Corpus Based Grammar Research, Proceedings of the Fifth International Conference, Slovko 2009, Smolenice, Slovakia, 25-27 November 2009, Tribun, Bratislava, pp. 302-309: http://korpus.juls.savba.sk/~slovko/2009/Proceedings_Slovko_2009.pdf.
Real Academia Española (1994), Corpus de referencia del español actual (CREA), Real Academia Española: https://www.rae.es/banco-de-datos/crea.
Real Academia Española (2013), Corpus del Español del Siglo XXI, Real Academia Española: https://www.rae.es/banco-de-datos/corpes-xxi.
Rossini Favretti R., Tamburini F., De Santis C. (2002), “A corpus of written Italian: a defined and a dynamic model”, in Wilson A., Rayson P., McEnery T. (eds.), A Rainbow of Corpora: Corpus Linguistics and the Languages of the World, Lincom-Europa, Munich.
Rundell M., Stock P. (1992), “The Corpus Revolution.”, in English Today, 8, 3, pp. 21-32: https://doi.org/10.1017/S0266078400006520.
Rychlý P. (2007), “Manatee/Bonito – A modular corpus manager”, in Sojka P., Horák A. (eds.), Proceedings of the First Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2007), Masaryk University, Brno, pp. 65-70: https://nlp.fi.muni.cz/raslan/raslan07.pdf.
Salesky E., Wiesner M., Bremerman J., Cattoni R., Negri M., Turchi M., Oard D. W., Post M. (2021), “The multilingual tedx corpus for speech recognition and translation”: https://doi.org/10.48550/arXiv.2102.01757.
Savchuk S. O., Arkhangelskiy T., Bonch-Osmolovskaya A. A., Donina O. V., Kuznetsova Y. N., Lyashevskaya O. N., Orekhov B. V., Podryachikova M. V. (2024), “Russian National Corpus 2.0: New opportunities and development prospects”, in Voprosy Jazykoznanija, 2, pp. 7-34: https://doi.org/10.31857/0373-658X.2024.2.7-34.
Savy R., Cutugno F. (2009), “CLIPS: diatopic, diamesic and diaphasic variations of spoken Italian”, in Mahlberg M., González-Díaz V., Smith C., Online Proceedings of the 5th Corpus Linguistics Conference, July 20-23, 2009, University of Liverpool, Liverpool, UK: http://ucrel.lancs.ac.uk/publications/cl2009/.
Schmid H. (1994), “Probabilistic part-of-speech tagging using decision trees”, in Proceedings of the International Conference on New Methods in Language Processing, Manchester, U.K.
Siepmann D., Bürgel C., Diwersy S. (2016), “Le Corpus de référence du français contemporain (CRFC), un corpus massif du français largement diversifié par genres”, in Neveu F., Bergounioux G., Côté M.-H., Fournier J.-M., Hriba L., Prévost S. (eds.), SHS Web of Conferences, Volume 27: 5e Congrès Mondial de Linguistique Française, Tours, France, 4-8 juillet 2016, EDP Sciences: https://doi.org/10.1051/shsconf/20162711002.
Sinclair J. (1991), Corpus, concordance, collocation, Oxford University Press, Oxford.
Sobrero A., Tempesta I. (2007), Definizione delle caratteristiche generali del corpus: informatori, località, CLIPS project document retrieved at: http://www.clips.unina.it/docs.
Spina S. (2001), Fare i conti con le parole. Introduzione alla linguistica dei corpora, Guerra, Perugia.
Spina S. (2005), “Il Corpus di Italiano Televisivo (CiT): struttura e annotazione”, in Burr E. (ed.), Traditione & Innovazione. Il parlato: teoria – corpora – linguistica dei corpora, Atti del VI Convegno Internazionale della SILFI, Franco Cesati Editore, Firenze, pp. 413-426.
Spina S. (2010), “AIWL: una lista di frequenza dell’italiano accademico”, in Bolasco S., Chiari I., Giuliano L. (eds.), Statistical Analysis of Textual Data, Proceedings of the 10th Conference JADT, Editrice universitaria LED, Milano, pp. 1317-1325.
Spina S. (2014), “Il Perugia Corpus: una risorsa di riferimento per l’italiano. Composizione, annotazione e valutazione”, in Basili R., Lenci A., Magnini B. (eds.), Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014, Pisa University Press, Pisa, pp. 354-359.
Spina S., Forti L., Zanda F. (2020), “Verso un corpus di riferimento dell’italiano parlato dialogico: il modello BNC2014”, in Rivista italiana di dialettologia, XLIV, 44, pp. 89-106.
Spina S., Fioravanti I., Zanda F., Forti L., Perri D., Gervasi O. (under review), “A multimethod approach to the development of a Learner Dictionary of Collocations: corpus-based measures and human evaluation”, in Corpus linguistics and linguistic theory.
Stammerjohann H. (1970), “Strukturen der Rede: Beobachtungen an der Umgangssprache von Florenz”, in Studi di Filologia Italiana, 28, pp. 295-397.
Stefanowitsch A. (2020), Corpus linguistics: A guide to the methodology, Textbooks in Language Sciences 7, Language Science Press, Berlin: https://doi.org/10.5281/zenodo.3735822.
Talamo L., Celata C., Bertinetto P. M. (2016), “DerIvaTario: An annotated lexicon of Italian derivatives”, in Word Structure, 9, 1, pp. 72-102: https://doi.org/10.3366/word.2016.0087.
Tamburini F. (2002), “A dynamic model for reference corpora structure definition”, in González Rodríguez M., Suarez Araujo C. P. (eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (LREC2002), European Language Resources Association (ELRA), Las Palmas, pp. 1847-1850.
Tamburini F. (2022), “I corpora del FICLIT, Università di Bologna: CORIS/CODIS, BoLC e DiaCORIS”, in Cresti E., Moneglia M. (eds.), Corpora e Studi Linguistici. Atti del LIV Congresso Internazionale di Studi della Società di Linguistica Italiana, SLI, Officinaventuno, Milano, pp. 189-197: https://doi.org/10.17469/O2106SLI000012.
Tyne H., Spina S. (eds.). Applying corpora in teaching and learning Romance languages, John Benjamins, Amsterdam.
Voghera M., Iacobini C., Savy R., Cutugno F., De Rosa A., Alfano I. (2014), “VoLIP: A searchable Italian spoken corpus”, in Veselovská L., Janebová M. (eds.), Complex Visibles Out There, Proceedings of the Olomouc Linguistics Colloquium 2014, Language Use and Linguistic Structure, Palacký University, Olomouc, pp. 627-640.
Wilkinson M., Dumontier, M., Aalbersberg I. et al. (2016), “The FAIR guiding principles for scientific data management and stewardship”, in Scientific Data, 3, 160018: https://doi.org/10.1038/sdata.2016.18.
Wulff S., Baker P. (2020), “Analyzing Concordances”, in Paquot M., Gries S. Th. (eds.), A Practical Handbook of Corpus Linguistic, Springer, Cham, pp. 161-179: https://doi.org/10.1007/978-3-030-46216-1_8.
Zampolli A. (1991), “Towards reusable linguistic resources”, in Kunze J., Reimann D. (eds.), Proceedings of the Fifth Conference of the European Chapter of the Association for Computational Linguistics (EACL 1991), Association for Computational Linguistics: https://aclanthology.org/E91-1001/.
Dowloads
Pubblicato
Come citare
Fascicolo
Sezione
Licenza
Copyright (c) 2025 Stefania Spina, Fabio Zanda, Irene Fioravanti

Questo lavoro è fornito con la licenza Creative Commons Attribuzione - Condividi allo stesso modo 4.0.


