Corpus

The DISCOWER corpus describes 2136 tokens, i.e. abstracts in three layers (abstract texts, basic abstract and elaborated abstracts; see Constructions for details). The data can be accessed through a dedicated browser. Abstract texts can also be downloaded directly.

The browser allows the user to search and filter tokens by selected data values: technical data (documenting e.g., their origin) and attribute-related data (implementing our theoretical underpinnings; see Foundations for details). It also allows the user to access and download .txt files with the exemplars of abstract texts.

Our database can also be downloaded directly as one of two .xml spreadsheets: a core data spreadsheet, or an expanded data spreadsheet with extra columns that give further details on or feed into the core data. You can also download the .txt files directly in tagged or untagged versions (.zip, UTF-8).

Please note that, since we study abstracts as multi-mode constructions (see Constructions), i.e. we recognize such features as spaces or fontface, the plain .txt files are only their approximations, intended to be helpful for linguistic analysis.

More on the text files

For ease of processing with corpus tools, the .txt files reduce an abstract text to plain-text format. The files, in UTF-8 encoding, formatted without newline characters (i.e. no line breaks), do not preserve information on the original font and fontface, line length, word divisions, line and paragraph spacing, colour, etc. Instead, selected features of the original form are documented by the attribute data pertaining to each abstract token (see Attributes), and a selected few which could be documented in the text files without making them difficult to read by humans or use with corpus tools are also marked in the files with the following tags: [IND], [LAB], [PAR], [LIST] and [NOTE]. The tags are excluded from the word count provided by the online browser.

Detailed descriptions of the tags

The indentation tag [IND]

The indentation tag marks the indentation of the first paragraph of the abstract text, and the first paragraph only. Other indentations, which serve to mark further paragraphs in multi-paragraph abstract texts, or to mark itemized lists, use [PAR] and [LIST] tags respectively.

The label tag [LAB]

The label tag marks the presence of a run-in abstract label, i.e. the word 'abstract' or 'summary' that appears immediately before the beginning of the abstract text's first line. These labels themselves are not part of the abstract text, but they cause it to develop a first-line indentation of a kind; thus, their presence is marked with the [LAB] tag. Occasionally a paragraph opening with such a label is also indented in the usual way (with white space), which results in two tags, [LAB] [IND], side by side.

The paragraph tag [PAR]

The paragraph tag marks paragraph divisions, regardless of the way in which the original signalled them: with a new line, a new line plus indentation, paragraph spacing, or a (partially) empty line at the end of a paragraph. The [PAR] tags appear at the division between consecutive paragraphs; for instance, a two-paragraph abstract text will feature one [PAR] tag, before the first word of its second paragraph.

The itemized list tag [LIST]

The itemized list tag is used in those cases where the abstract text features a list (bulleted, numbered or otherwise) with each item formatted as a separate line. As the text file does not contain line breaks or bullet-point characters, the tag marks the start of each item that forms such a list. Although such listed items share some features with paragraphs, they were not included in the total paragraph count for each abstract text. Some abstract texts in the corpus feature lists set in running text, i.e. without line divisions at each item, as in a) item one, or 1. item one, or (i) item one, etc. The [LIST] tag is not used for such lists, as itemization labels such as a) item one, 1. item one, (i) item one, etc. are preserved in the files.

The footnote tag [NOTE]

The footnote tag [NOTE] was used to replace superscripted numbers or letters appearing in the abstract text, including right after its final full-stop. While the superscripted numbers/letters were usually not preceded by spaces, such spaces were added before the [NOTE] tag, but not after it, so that the tag can be immediately followed by punctuation, as in [NOTE]. The superscripted characters themselves, whether numbers, letters, or asterisks, are not included in the text files.

Corpus data

The data used in the online browser and the spreadsheets document the linguistic and paralinguistic features of each basic and elaborated abstract. Data referring to a particular layer (the basic abstract or the elaborated abstract) are marked accordingly.

Both the online browser and the spreadsheets also include some technical data, pertaining to the organization of the corpus (e.g., file name) and each abstract's origin (e.g. discipline, journal, or year), allowing the user to identify and filter particular tokens.

To learn about particular data:

when using the browser to filter the tokens, hover the mouse cursor over the question marks next to the names of data filters to display a tooltip: a short reminder of what each term represents, referring to typical cases.
read the Attributes page to find out about crucial data and concepts linking to the foundational commitments of our project, and examples of typical cases from our corpus.
read the Constructions page to learn about the Composition data and the conventions used to document the text type-based parts and their relationships within elaborated and basic abstracts.
use the list of all data below, featuring their format (i.e. type of values) and details of their usage.

Data list

The order of the data listed below corresponds to their order in the spreadsheets.

Data names in ALL-CAPS identify attributes.

Data names in boldface mean particular data can be used to filter the online database.

Data names in italicised font denote data only contained in the expanded spreadsheet.

Unless indicated otherwise, the word "abstract" refers collectively to the three abstract strata.

Technical data

Data	Format	About
Row	numbers	Row number in the spreadsheet, to facilitate checking whether the data have been sorted and, if necessary, return to the default token order.
File name	text	Unique file names identifying corpus tokens. The names start with a journal prefix (a letter code identical for all tokens originating in the same journal).
Journal title	text	Titles of the journals where particular abstracts were originally published. Where the journals have alternative Polish- and English-language titles, or changed their titles during the period of our interest, we use a single form of the title throughout.
Discipline	law / linguistics / literary studies	The three disciplines under study: law, linguistics and literary studies. According to the official Ministry list issued in 2019, in effect when we started the project, all the journals represented single disciplines (rather than combinations of several disciplines), although in some cases, this classification was retroactively changed① in 2021. We reacted the changes inclusively: we retroactively accepted newly eligible journals in the corpus② and where the changed list reassigned some journals already in our database to new disciplines③ so that they were no longer single-discipline journals as per our initial criteria, we kept the data in the corpus.
Journal link (current and past issues)	text	A link pointing to a list of volumes, issue archive, or whichever part of the journal website makes it easiest to find the past and present issues, which may be useful if links to individual articles change. It also allows the user to easily go back to see their wider context, such as the whole issue. Please note that, with time, as the journal websites moved or were restructured, we updated the links, so that the present links do not necessarily reflect the original context as we found them when compiling the corpus.
*Ministry score 2021*	number	The 2021 score① awarded to each journal by the Polish Ministry of Science and Higher Education②.
Authors	text	The name(s) of the author(s) of the article to which a given abstract referred; presumably the author(s) of the abstract itself. The names appear in the order and form in which we found them in the journals.
More than one author?	yes / no	The article having one or more authors.
Number of countries	text	The number of countries represented by the author(s) according to their affiliation. A single author could have multiple affiliations, which is reflected by the number.
Countries of affiliation	text	The country or countries of affiliation① of the article author(s). We only state the affiliation once, however many authors were affiliated in a given country; for instance, an article written by two authors affiliated with Polish institutions and one with a German one is described as "Poland, Germany" rather than "Poland, Poland, Germany". For the sake of conciseness, we use abbreviations in some cases (UK, USA), and shorter versions of official country names in others (e.g. South Africa rather than the Republic of South Africa, Taiwan rather than the Republic of China).
Article title	text	The title of the article to which the abstract refers.
Issue	text	The issue of a given journal in which the abstract appears. As journals use different conventions in this regard, the issue numbers are not entirely consistent in terms of their form, which has been chosen in each case to help the user identify an issue on the journal's website.
Year	2018 / 2019 / 2020 /2021	The year of the issue in which a given article appears, as featured on the journal's cover or in its title on the website (not necessarily the actual date of its online publication). The corpus comprises tokens dating from 2018 to September 30, 2021.
Link 1: website context	text	A link to the article on the journal's website. Wherever possible, the link connects to an individual abstract text view on the journal's website, from which the PDF can be accessed. Please note that such views are not available on all journal websites, and that with time, as the journal websites moved or were restructured, we also updated the links, so that the present links do not necessarily reflect the original context we found when compiling the corpus.
Link 2: pdf file	text	Link to the article in the PDF format. Most PDF links refer to individual articles, but in some cases only full-issue PDFs are available. If a link could be found that enabled an embedded PDF to be viewed on a journal's website, this option was chosen as more convenient than providing a direct link that would download it. Please note that, with time, as the journal websites moved or were restructured, we updated the links, so that the present links do not necessarily reflect the original context as we found them when compiling the corpus.
*Number of languages*	number	The number of language versions in which the abstract text is available in the same PDF file.
Languages	text	The languages of the abstract texts accompanying a given article. The order in which the languages are listed corresponds to the order of appearance. If the versions are in close vicinity, the language names are joined with a plus, "+"; otherwise, where one language version appears at the top of the document and another is at its end, they are separated by a comma. For instance, "Polish, English" means the abstract text appears in two language versions, first in Polish, then in English, but at a distance from each other.
Number of words	number	Number of graphically distinct words in the abstract's text. Automatically generated with SpaCy. Does not include the tags.
Location	above / below	The abstract text's position relative to the article it concerns in the PDF file.
Label	abstract / summary / none	The presence of the "abstract" or "summary" label in the PDF file.

Attribute-related data

Name	Format	About
DISTINCT?	yes / no	See Attributes.
No objects (linguistic or paralinguistic) above?	yes / no	Feeds into "DISTINCT".①
No objects (linguistic or paralinguistic) below?	yes / no	Feeds into "DISTINCT".①
SELF-CONTAINED?	yes / no	See Attributes.
CLOSED?	yes / no	See Attributes.
Linguistic or paralinguistic objects in upper-left corner?	yes / no	Feeds into "CLOSED".①
Linguistic or paralinguistic objects in lower-right corner?	yes / no	Feeds into "CLOSED".①
SIMPLE?	yes / no	See Attributes.
Number of parts	yes / no	Feeds into "SIMPLE".①
Composition	text	A list of text type-based parts forming the basic or elaborated abstract, ordered to reflect their position on the page, following the direction in which we scanned the page (generally top to bottom and left to right). The symbols between part names indicate the direction of scanning, the absence (single symbols) or presence (double symbols) of extra white space, and the absence (straight lines) or presence (curly lines) of paralinguistic objects at part borders relevant to the direction of scanning.
SYNCHRONOUS?	yes / no	See Attributes.
CONTINUOUS?	yes / no	See Attributes.
COMPACT?	yes / no	See Attributes.
No extra white space at borders between parts?	yes / no	Feeds into "COMPACT".①
No paralinguistic objects at borders between parts?	yes / no	Feeds into "COMPACT".①
HOMOGENEOUS?	yes / no	See Attributes.
All parts use the same font colour (e.g., gray vs black)?	yes / no	Feeds into "HOMOGENEOUS".①
All parts use the same text background (e.g., grey vs white)?	yes / no	Feeds into "HOMOGENEOUS".①
All parts use similar font size?	yes / no	Feeds into "HOMOGENEOUS".①
NEUTRAL?	yes / no	See Attributes.
All parts use sentence case rather than all-caps?	yes / no	Feeds into "NEUTRAL".①
All parts use regular rather than bold fonts?	yes / no	Feeds into "NEUTRAL".①
All parts use roman rather than italic fonts?	yes / no	Feeds into "NEUTRAL".①
All parts use standard spacing rather than spaced-out characters?	yes / no	Feeds into "NEUTRAL".①

More information is available upon request: contact D. Guttfeld (contact details in the footer).