Factsheet

Resource reviewed
Title “La Repubblica” Corpus
Editorshelp

Bibliographical description of responsible personnel. Please indicate names as “forename surname”.
Marco Baroni, Silvia Bernardini, Sara Castagnoli, Federica Comastri, Lorenzo Piccioni, Alessandra Volpi, Guy Aston, Marco Mazzoleni, Eros Zanchetta
URI https://corpora.dipintra.it/public/run.cgi/first?corpname=repubblica
Publication Datehelp

Format should be either yyyy, yyyyff. or yyyy-yyyy, e.g. 2007 or 2007-2013
2004ff.
Date of last access 01.08.2017
Reviewer
Surname Sierig
First Name Rebecca
Organization University of Leipzig
Place Leipzig
Email rebecca.sierig (at) uni-leipzig.de
Personnel
Contributors Marco Baroni
Silvia Bernardini
Sara Castagnoli
Federica Comastri
Lorenzo Piccioni
Alessandra Volpi
Guy Aston
Marco Mazzoleni
Eros Zanchetta
General Information
Bibliographic description
Can the text collection be identified in terms similar to traditional bibliographic descriptions (title, responsible editors, institution, date(s) of publication, identifier/address)?
yes
Contributors
Are the contributors (editors, institutions, associates) of the project documented?
yes
Contacts
Is contact information given?
yes
Aims
Documentation
Is there a description of the aims and contents of the text collection?
yes
Purposehelp

We assume that the text collection is published to be used in a certain context and that the intended usage scenario can be inferred (if it is not described), so if the purpose of the text collection is not stated explicitly, please answer according to your own impression.
What is the purpose of the text collection?
Research, Teaching
Kind of researchhelp

In many cases both qualitative and quantitative research will be possible. Please choose the methodological orientation that is prevailing form your point of view.
What kind of research does the collection allow to conduct primarily?
Qualitative research
Self-classification
How does the text collection classify itself (e.g. in its title or documentation)?
Corpus
Field of research
To which field(s) of research does the text collection contribute?
Linguistics, other: Translational Studies
Content
Erahelp

Classics: before 500 CE. Medieval: 501 CE until 1500 CE. Early Modern: 1501 CE until 1800 CE. Modern: 1801 CE until 1945. Contemporary: 1945 until today.
What era(s) do the texts belong to?
Contemporary
Languagehelp

Please choose the language name(s) that correspond(s) best to the language(s) of the texts, e.g. for Old English choose English; for Mexican Spanish choose Spanish. Choose ‘other’ if none of the given language names matches the language(s) in question. If you wish to specify the language(s) further, you can give an additional explanation in the ‘note’ field.
What languages are the texts in?
Italian
Types of texthelp

Literary works: artistic creations like poems, novels, plays, transmitted in one or several documents. Private documents: personal or family documents or papers, e.g. letters and diaries. Essays: formal or informal, argumentative or reflective writings. Newspaper/journal articles: articles discussing recent news of general interest or a specific topic, published in a newspaper or journal. Charters: texts documenting a legal fact by using a special form supporting its validity, e.g. the Magna Carta. Inscriptions: texts on stone, metal, wood, etc., fixed through carving, engraving or embossing. Files/records: classified collections of documents, e.g. business and trial records, personnel and administration files. Protocols: written records of meetings. Scientific papers: (previously published) scientific literature, e.g. articles, monographs or reference works. Speech transcripts: written records of speech, transcribed orthographically or phonetically, for example.
What kind of texts are in the collection?
Newspaper/journal articles
Additional informationhelp

Information that is derived from the texts (e.g. analytical data or visualizations) is not considered here. Commentary: (scholarly) commentary on the content of the documents or regarding other textual phenomena. Context material: additional (textual) sources which put the texts in the collection into context but are not considered part of the collection itself. Facsimiles: any copy of historical documents.
What kind of information is published in addition to the texts?
Context material, other
Composition
Documentation
Are the principles and decisions regarding the design of the text collection, its composition and the selection of texts documented?
yes
Selectionhelp

Selection criteria: factors that guided the selection of texts for the collection and the composition of the text collection. Language: one or several standard languages or other language varieties (including for example sociolects). Author: one or several persons or figures that are supposed to have authored the texts. Country: one or several countries or geographical regions. Epoch: one or several (historical) epochs or other kinds of time periods. Genre: literary (e.g. novella, fabel) or non-literary genres/text types (letter, statute, recipe, chat log, etc.). Topic: one or several thematic aspects of the texts. Style: one or several writing styles characteristic for authors, periods, schools, text types, etc. (e.g. romantic or realist novels, satirical texts, Victorian style). Linguistic characteristics: language properties other than style (e.g. the presence of certain grammatical constructions).
What selection criteria have been chosen for the text collection?
Language, Epoch, Genre
Sizehelp

If the size of the text collection is not given explicitly but can be inferred, please choose appropriate numbers, otherwise choose ‘unknown’.
Texts/records
How large is the text collection in number of texts/records?
> 1000
Tokenshelp

Tokens: Sequences of strings delimited by whitespace or punctuation and roughly corresponding to words.
How large is the text collection in number of tokens?
> 10 Mio.
Structure
Does the text collection have identifiable sub-collections or components?
yes
Data acquisition and integration
Text recording
Does the text collection record or transcribe the textual data for the first time?
yes
Text integration
What kind of material has been taken over from other sources?
Full texts, Metadata
Quality assurancehelp

Choose ‘yes’ if there has been a quality check for which results are reported, regardless of whether corrections have been made or not.
Has the quality of the data (transcriptions, metadata, annotations, etc.) been checked?
yes
Typologyhelp

General purpose collection: a text collection of a very general nature (e.g. Wikisource, Project Gutenberg); often created in a collaborative fashion; with no specific or very loose selection criteria; usually not bound to a certain time frame for its creation and completion. Corpus: a collection of texts that has been created according to some selection criteria (language, author, country, epoch, genre, topic, style, etc.) which makes it more specific than a general purpose collection; not necessarily aiming at completeness or representativeness; e.g. the ‘Corpus of English Religious Prose’, ‘Letters of 1916’, ‘Corpus of Literary Modernism’. Collection of records: a collection of texts that are held together out of organisational reasons, e.g. a collection of historical documents that has been kept in the same archive. Canon: collection of works that is considered most important for a certain period, culture or discipline (e.g. the biblical canon, the canon of English 19th century literature); might be formally approved or authoritative and subject to debate and revision. Complete works/œuvre: collection of all works by a single author (e.g. complete works of Mark Twain). Reference corpus: collection of texts that have been selected in order to be representative for a certain genre or language (e.g. reference corpus of New High German Language). Contrastive corpus: a collection of texts aiming at the systematic comparison of its sub-components, to get to a description of differences and similarities between them (e.g. FinDe, a contrastive corpus of Finnish and German). Parallel corpus: a collection of texts which are contrasted with other versions, often translations (e.g. the Parallel Bible Corpus). A parallel corpus can be considered a certain kind of contrastive corpus. Diachronic corpus: a collection of texts that have been selected in order to reflect evolution over time (e.g. the Diachronic Corpus of Present-Day Spoken English (c. 1960-1980)).
Considering aims and methods of the text collection, how would you classify it further? For definitions please consider the help-texts.
Corpus
Data Modelling
Text treatmenthelp

Normalized transcription: if the orthography has been normalized according to a chosen standard (e.g. ’seyn’ to ’sein’). Orthographic transcription: a transcription that employs the standard spelling system of each target language (e.g. the surname “Pushkin” in English orthographic transcriptions of the Russian surname “Пу́шкин”). Phonetic/phonemic transcription: a transcription that is the visual representation of speech sounds or phones (e.g. [ˈpuʂkʲɪn]) or a phonemic transcription (e.g. /ˈpʊʃkɪn/). Diplomatic transcription: a transcription of the document taking into account features like spelling, punctuation, abbreviations, deletions, insertions, alterations, etc. Transliteration: A conversion of a text from one script to another (e.g. “Russia” in Cyrillic script, “Россия”, is transliterated as “Rossiya” in Latin script). Edited text: A reading text as constituted by the editor(s), based on text-critical procedures like recensio, examinatio, emendatio, correction, normalization, modernization etc. Translated text: Any translations into languages different from that of the original text. Summarized text: A summary of the source text. Sampled text transcriptions: parts of texts that have been selected and transcribed to represent whole texts (e.g. out of theoretical considerations or for statistical reasons).
How are the textual sources represented in the digital collection?
Normalized transcription
Basic formathelp

Plain text: a pure sequence of character codes supported by the underlying standard (ASCII, Unicode). XML: Extensible Markup Language, a general markup language that defines a set of rules for encoding documents. HTML: Hypertext Markup Language, a standard markup language for web pages.
In which basic format are the texts encoded?
XML
Annotations
Annotation typehelp

Semantic annotations: e.g. key words, links to (controlled) vocabularies, norm data. Linguistic annotations: additional information about linguistic characteristics of the texts, e.g. lemmata or PoS-tags. Editorial annotations: e.g. editorial comments and/or text-critical components such as the apparatus criticus. Structural information: e.g. markup to capture the textual structure (e.g. headings, chapters) and layout information (e.g. paragraphs, indents).
With what information are the texts further enriched?
Semantic annotations, Linguistic annotations, Structural information
Annotation integrationhelp

Please choose ‘not applicable’ if there are no annotations.
How are the annotations linked to the texts themselves?
not applicable
Metadata
Metadata typehelp

Descriptive: to describe and identify a resource, e.g. unique identifier, physical, bibliographic and content related attributes (such as medium, dimensions, author, title, publication year, genre, topic). Structural: information about the internal structure of a resource (such as parts, volumes, chapters, sections, pages). Administrative: for example technical details, access rights, history of changes.
What kind of metadata are included in the text collection?
Descriptive, Structural
Metadata level
On which level are the metadata included?
Whole collection, Individual texts
Data schemas and standards
Schemashelp

General standardized schema: TEI All, TEI Lite, TCF, EAD, etc. Customized standard schema: a project specific customization of a standardized schema, e.g. a certain RDFS(chema) or the DTABf. Project specific schema: a schema that does not conform to any standard vocabulary, e.g. a custom XML dialect.
What kind of data/metadata/annotation schemas are used for the text collection?
unknown
Standardshelp

TEI: Text Encoding Initiative, cf. http://www.tei-c.org CEI: Charters Encoding Initiative, cf. https://www.cei.lmu.de EAD: Encoded Archival Description, cf. https://www.loc.gov/ead/ (X)CES: Corpus Encoding Standard (in XML), cf. https://www.cs.vassar.edu/CES/ and http://www.xces.org/ Dublin Core: a set of vocabulary terms for the description of web resources; cf. http://dublincore.org/ EDM: Europeana Data Model, cf. http://pro.europeana.eu/page/edm-documentation METS: Metadata Encoding and Transmission Standard, cf. http://www.loc.gov/standards/mets/ MODS: Metadata Object Description Schema, cf. www.loc.gov/mods/ SKOS: Simple Knowledge Organization System, cf. https://www.w3.org/2004/02/skos/ OWL: Web Ontology Language, cf. https://www.w3.org/OWL/ IMDI: Isle Metadata Initiative, cf. https://tla.mpi.nl/imdi-metadata/ CMDI: Component Metadata Infrastructure, cf. https://www.clarin.eu/content/component-metadata TCF: Text Corpus Format, cf. https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format OLAC: Open Language Archive Metadata, cf. http://www.language-archives.org/OLAC/metadata-20080531.html EAGLES: Guidelines of the Expert Advisory Group on Language Engineering Standards, cf. http://www.ilc.cnr.it/EAGLES/browse.html standardized PoS tagset(s): Part-of-Speech tagsets that have been standardized, for example the ‘Part-of-Speech Tagging Guidelines for the Penn Treebank Project’.
Which standards for text encoding, metadata and annotation are used in the text collection?
TEI, EAGLES
Provision
Accessability of the basic data
Is the textual data accessible in a source format (e.g. XML, TXT)?
no
Download
Can the entire raw data of the project be downloaded (as a whole)?
no
Technical interfaceshelp

OAI-PMH: Protocol for Metadata Harvesting; a protocol for harvesting metadata descriptions of items in a collection. REST: Representational State Transfer; a paradigm for the architecture of so-called RESTful web services. SPARQL endpoint: a SPARQL Protocol and RDF Query Language endpoint to retrieve data stored in the RDF format. General API: an Application Programming Interface other than OAI-PMH or REST.
Are there technical interfaces which allow the reuse of the data of the text collection in other contexts?
none
Analytical data
Besides the textual data, does the project provide analytical data (e.g. statistics) to download or harvest?
yes
Reuse
Can you use the data with other tools useful for this kind of content?
yes
User Interface
Interface provisionhelp

For example, a website created for the presentation of the texts or a software developed for the display and usage of the text collection in question is considered a dedicated user interface, while a general repository (e.g. a library publication server), versioning platform (e.g. GitHub) or archive (e.g. Zenodo) is not.
Does the text collection have a dedicated user interface designed for the collection at hand in which the texts of the collection are represented and/or in which the data is analyzable?
yes
User Interface questions
Usability
From your point of view, is the interface of the text collection clearly arranged and easy to navigate so that the user can quickly identify the purpose, the content and the main access methods of the resource?
no
Acces modes
Browsing
Does the project offer the possibility to browse the contents by simple browsing options or advanced structured access via indices (e.g. by author, year, genre)?
yes
Fulltext search
Does the project offer a fulltext search?
no
Advanced searchhelp

Does the project offer an advanced search?
yes
Analysis
Tools
Does the text collection integrate tools for analyses of the data?
yes
Customization
Can the user alter the interface in order to affect the outcomes of representation and analysis of the text collection (besides basic search functionalities), e.g. by applying his or her own queries or by choosing analysis parameters?
yes
Visualization
Does the text collection provide particular visualizations of the data?
Charts
Personalization
Is there a personalisation mode that enables the users e.g. to create their own sub-collections of the existing text collection?
yes
Preservation
Documentation
Does the text collection provide sufficient documentation about the project in general as well as about the aims, contents and methods of the text collection?
no
Open Accesshelp

Are the contents of the presentation freely accessible without subscription fee?
Is the text collection Open Access?
no
Rights
Declared
Are the rights to (re)use the content declared?
yes
Licensehelp

CC0: Creative Commons license CC0 applied. CC-BY: Creative Commons license CC-BY applied. CC-BY-ND: Creative Commons license CC-BY-ND applied. CC-BY-NC: Creative Commons license CC-BY-NC applied. CC-BY-SA: Creative Commons license CC-BY-SA applied. CC-BY-NC-ND: Creative Commons license CC-BY-NC-ND applied. CC-BY-NC-SA: Creative Commons license CC-BY-NC-SA applied. PDM: Work is in the Public Domain.
Under what license are the contents released?
other
Persistent identification and addressinghelp

DOI: Digital Object Identifier according to the definition of The International DOI Foundation. The DOIs should be resolvable through http://doi.org/. ARK: Archival Resource Key according to the definition of the California Digital Library. An ARK URL contains the label: ‘ark’ after the URL’s hostname. URN: Uniform Resource Name using the urn: scheme. URNs always start with the label ’urn:’. PURL.ORG: Persistent Uniform Resource Locator using the PURL concept and administered by the Online Computer Library Centre. other service: Choose this if an external service other than the above options is used. Persistent URLs: Choose this if the project promises permanent URLs or uses a local resolving system between URLs and underlying technical addresses but does not use any of the external services mentioned in the options. none: Choose this if no persistent identifiers and addressing system are used at all.
Are there persistent identifiers and an addressing system for the text collection and/or parts/objects of it and which mechanism is used to that end?
Persistent URLs
Citation
Does the text collection supply citation guidelines?
yes
Archiving of the datahelp

Choose yes if you have reason to believe that the archiving and long term sustainability of the data is cared for (e.g. because the data is part of a platform that cares for these aspects), even if the documentation makes no explicit statement about it.
Does the documentation include information about the long term sustainability of the basic data (archiving of the data)?
no
Institutional curationhelp

Select yes, if there is either an explicit claim that continuous maintenance for the project is provided by some institution or you have strong reason to believe that this is the case, even if it is not explicitly claimed, otherwise select no.
Does the project provide information about institutional support for the curation and sustainability of the project?
no
Completionhelp

Choose ‘yes’ if you consider the collection complete. Choose ‘no’ if further additions and modifications are promised for the text collection to be completed.
Is the text collection completed?
yes