21 datasets found

f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Data from: ERA5 hourly data on single levels from 1940 to present
cds.climate.copernicus.eu
search-sandbox-2.test.dataone.org
+1more
grib
Updated Oct 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). ERA5 hourly data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.adbb2d47
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.adbb2d47
Dataset updated
Oct 2, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1940 - Sep 26, 2025
Description
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".
n
Data from: Long-term Data Collection at Select Antarctic Peninsula Visitor...
cmr.earthdata.nasa.gov
get.iedadata.org
+2more
Updated Nov 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Long-term Data Collection at Select Antarctic Peninsula Visitor Sites [Dataset]. https://cmr.earthdata.nasa.gov/search/concepts/C2532069330-AMD_USAPDC.html
Explore at:
Dataset updated
Nov 14, 2020
Time period covered
Aug 15, 2003 - Jul 31, 2008
Area covered
Antarctica, Antarctic Peninsula,
Description
The Antarctic Site Inventory Project has collected biological data and site-descriptive information in the Antarctic Peninsula region since 1994. This research effort has provided data on those sites which are visited by tourists on shipboard expeditions in the region. The aim is to obtain data on the population status of several key species of Antarctic seabirds, which might be affected by the cumulative impact resulting from visits to the sites. This project will continue the effort by focusing on two heavily-visited Antarctic Peninsula sites: Paulet Island, in the northwestern Weddell Sea and Petermann Island, in the Lemaire Channel near Anvers Island. These sites were selected because both rank among the ten most visited sites in Antarctica each year in terms of numbers of visitors and zodiac landings; both are diverse in species composition, and both are sensitive to potential environmental disruptions from visitors. These data collected focus on two important biological parameters for penguins and blue-eyed shags: (1) breeding population size (number of occupied nests) and (2) breeding success (number of chicks per occupied nests). A long-term data program will be supported, with studies at the two sites over a five-year period. The main focus will be at Petermann Island, selected for intensive study due to its visitor status and location in the region near Palmer Station. This will allow for comparative data with the Palmer Long Term Ecological Research program. Demographic data will be collected in accordance with Standard Methods established by the Convention for the Conservation of Antarctic Marine Living Resources Ecosystem Monitoring Program and thus will be comparable with similar data sets being collected by other international Antarctic Treaty nation research programs. While separating human-induced change from change resulting from a combination of environmental factors will be difficult, this work will provide a first step to identify potential impacts. These long-term data sets will contribute to a better understanding of biological processes in the entire region and will contribute valuable information to be used by the Antarctic Treaty Parties as they address issues in environmental stewardship in Antarctica.
IMAGE RPI Monthly Electron Density Values - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). IMAGE RPI Monthly Electron Density Values - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/image-rpi-monthly-electron-density-values
Explore at:
Dataset updated
Apr 8, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The electron density values listed in this file are derived from the IMAGE Radio Plasma Imager (B.W. Reinisch, PI) data using an automatic fitting program written by Phillip Webb with manual correction. The electron number densities were produced using an automated procedure (with manual correction when necessary) which attempted to self-consistently fit an enhancement in the IMAGE RPI Dynamic Spectra to either 1) the Upper Hybrid Resonance band, 2) the Z-mode or 3) the continuum edge. The automatic algorithm works by rules determined by comparison of the active and passive RPI data [Benson et al., GRL, vol. 31, L20803, doi:10.1029/2004GL020847, 2004]. The manual data points are not from frequencies chosen freely by a human. Rather the human specifies that the computer should search for a peak or continuum edge in a certain frequency region. Thus even the manual points are determined, in part, by the automatic algorithms. Of course that does not guarantee that the data points are right, but it does eliminate some human bias.
Lemmatized English Word2Vec data
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jan 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). Lemmatized English Word2Vec data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4421380?locale=bg
Explore at:
unknown(1209)Available download formats
Dataset updated
Jan 7, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
Description
Lemmatized English Word2Vec data This is a version of the original GoogleNews-vectors-negative300 Word2Vec embeddings for English. In addition, we provide the following modified files: - converted to conventional CSV format (and gzipped) - subclassified: for the most frequent 1.000.000 words: subclassified according to WordNet parts of speech: ADJ, ADV, NOUN, VERB, OTHER note that one embedding can be associated with multiple parts of speech for the remaining words: RARE: top 1.000.001 - 2.000.000 words VERY_RARE: top 2.000.001 - 3.000.000 words - WordNet lemmatization (via NLTK) in separate files (first lemma only) Note that this is not a product of original research, but a derived work, deposited here as a point of permanent reference and as a building stone of subsequent research. For such application, a publication independent from Google is necessary to guarantee stability against changes in their data releases. The original Word2vec code and data was published via https://code.google.com/archive/p/word2vec/ under an Apache License 2.0. We obtained the Word2vec data from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing on Jun 3, 2020. The Word2vec documentation included the following references: [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. [3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. The derived data is made available under the same license (Apache License 2.0). However, note that the content derived from WordNet (lemmas) are subject to the Princeton Wordnet license as stated in LICENSE.wordnet. Data provided by the Applied Computational Linguistics Lab of the Goethe University Frankfurt, Germany. Original data developed by Mikolov et al.
n
Data from: Predictive language processing revealing usage-based variation
narcis.nl
dataverse.nl
doc
Updated Jun 21, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verhagen, Véronique (Tilburg University); Mos, Maria (Tilburg University); Backus, Ad (Tilburg University); Schilperoord, Joost (Tilburg University) (2018). Predictive language processing revealing usage-based variation [Dataset]. http://doi.org/10.34894/4ezrro
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.34894/4ezrro
Dataset updated
Jun 21, 2018
Dataset provided by
DataverseNL
Authors
Verhagen, Véronique (Tilburg University); Mos, Maria (Tilburg University); Backus, Ad (Tilburg University); Schilperoord, Joost (Tilburg University)
Description
While theories on predictive processing posit that predictions are based on one’s prior experiences, experimental work has effectively ignored the fact that people differ from each other in their linguistic experiences and, consequently, in the predictions they generate. We examine usage-based variation by means of three groups of participants (recruiters, job-seekers, and people not (yet) looking for a job), two stimuli sets (word sequences characteristic of either job ads or news reports), and two experiments (a Completion task and a Voice Onset Time task). We show that differences in experiences with a particular register result in different expectations regarding word sequences characteristic of that register, thus pointing to differences in mental representations of language. Subsequently, we investigate to what extent different operationalizations of word predictability are accurate predictors of voice onset times. A measure of a participant’s own expectations proves to be a significant predictor of processing speed over and above word predictability measures based on amalgamated data. These findings point to actual individual differences and highlight the merits of going beyond amalgamated data. We thus demonstrate that is it feasible to empirically assess the variation implied in usage-based theories, and we advocate exploiting this opportunity.
m
Data from: Approach or Avoidance: How Does Employees’ Generative AI...
data.mendeley.com
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yihang yan (2025). Approach or Avoidance: How Does Employees’ Generative AI Awareness Shape Their Job Crafting Behavior? A Sensemaking Perspective [Dataset]. http://doi.org/10.17632/fn4snf9hcj.1
Explore at:
Unique identifier
https://doi.org/10.17632/fn4snf9hcj.1
Dataset updated
Jan 8, 2025
Authors
yihang yan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We engaged 315 participants from Credamo platform, a well-established and reliable source for large-scale data collection. We gathered survey data across three distinct time points, each separated by a two-week interval. During the first time point, we invited 600 participants to provide information on their gender, age, education level, and the frequency of using Gen AI. Additionally, we asked them to provide ratings on their individual regulatory focus, AI awareness and perceived CSR. We obtained 590 responses, which corresponds to a high response rate of 98%. During the second time point, we invited the 590 participants who responded at Time 1 to rate on their work passion. We received 540 responses, resulting in a response rate of 90%. At Time 3, we invited the 540 respondents to complete the final assessment, in which they rated their job crafting behaviors. In all, 316 participants returned their responses, yielding a response rate of 88%.
f
Raw data.
plos.figshare.com
xlsx
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mi Hwa Seo; Eun A. Kim; Hae Ran Kim (2025). Raw data. [Dataset]. http://doi.org/10.1371/journal.pone.0316654.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316654.s001
Dataset updated
Jan 3, 2025
Dataset provided by
PLOS ONE
Authors
Mi Hwa Seo; Eun A. Kim; Hae Ran Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundPerson-centered care focuses on individualized care that respects patients’ values, preferences, and autonomy. To enhance the quality of critical care nursing, institutions need to identify the factors influencing ICU nurses’ ability to provide person-centered care. This study explored the relationship between clinical judgment ability and person-centered care among intensive care unit (ICU) nurses, emphasizing how the ICU nursing work environment moderates this relation.MethodsA cross-sectional survey was conducted between September 4 and September 18, 2023, with 192 ICU nurses recruited from four general hospitals with a convenience sample (valid response rate = 97.4%). Participants completed online self-report structured questionnaires. The collected data were analyzed using hierarchical multiple regression and PROCESS macro Model 1, with a 95% bias-corrected bootstrap confidence interval to verify moderating effects.ResultsClinical judgment ability (β = .24, p < .001) and ICU nursing work environment (β = .50 p < .001) were found to be significant predictors of person-centered care. These two predictors explained the 47.0% of person-centered care in the final hierarchical regression model. Additionally, Clinical judgment (B = 0.28, p < .001, Boot. 95%CI = 0.13~0.42) and the ICU nursing work environment (B = 0.41, p < .001, Boot. 95%CI = 0.30~0.52) positively affected person-centered care, and the interaction term of clinical judgment and ICU nursing work environment (B = 0.16, p = .026, Boot. 95%CI = 0.02~0.30) also positively affected person-centered care. The moderating effect was particularly significant when the ICU nursing work environment score was 2.90 points (below 14.6%, above 85.4%) or higher on a scale of 1–5 and As the ICU nursing work environment score increased, the positive moderating effect also increased.ConclusionsThe ICU nurses’ clinical judgment ability positively affected person-centered care, and the nursing work environment moderated the relationship between clinical judgment ability and person-centered care. Therefore, strategies for enhancing person-centered care among ICU nurses should focus on developing educational programs to improve clinical judgment ability and implementing comprehensive efforts to effectively improve and manage the nursing work environment.
Price Paid Data
gov.uk
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HM Land Registry (2025). Price Paid Data [Dataset]. https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
Explore at:
Dataset updated
Sep 29, 2025
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
HM Land Registry
Description
Our Price Paid Data includes information on all property sales in England and Wales that are sold for value and are lodged with us for registration.

Get up to date with the permitted use of our Price Paid Data:
check what to consider when using or publishing our Price Paid Data

Using or publishing our Price Paid Data

If you use or publish our Price Paid Data, you must add the following attribution statement:

Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

Price Paid Data is released under the http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/">Open Government Licence (OGL). You need to make sure you understand the terms of the OGL before using the data.

Under the OGL, HM Land Registry permits you to use the Price Paid Data for commercial or non-commercial purposes. However, OGL does not cover the use of third party rights, which we are not authorised to license.

Price Paid Data contains address data processed against Ordnance Survey’s AddressBase Premium product, which incorporates Royal Mail’s PAF® database (Address Data). Royal Mail and Ordnance Survey permit your use of Address Data in the Price Paid Data:

for personal and/or non-commercial use

to display for the purpose of providing residential property price information services

If you want to use the Address Data in any other way, you must contact Royal Mail. Email address.management@royalmail.com.

Address data

The following fields comprise the address data included in Price Paid Data:

Postcode

PAON Primary Addressable Object Name (typically the house number or name)

SAON Secondary Addressable Object Name – if there is a sub-building, for example, the building is divided into flats, there will be a SAON

Street

Locality

Town/City

District

County

August 2025 data (current month)

The August 2025 release includes:

the first release of data for August 2025 (transactions received from the first to the last day of the month)

updates to earlier data releases

Standard Price Paid Data (SPPD) and Additional Price Paid Data (APPD) transactions

As we will be adding to the August data in future releases, we would not recommend using it in isolation as an indication of market or HM Land Registry activity. When the full dataset is viewed alongside the data we’ve previously published, it adds to the overall picture of market activity.

Your use of Price Paid Data is governed by conditions and by downloading the data you are agreeing to those conditions.

Google Chrome (Chrome 88 onwards) is blocking downloads of our Price Paid Data. Please use another internet browser while we resolve this issue. We apologise for any inconvenience caused.

We update the data on the 20th working day of each month. You can download the:

http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-monthly-update-new-version.csv">current month as a CSV file (CSV, 18.5MB)

http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-monthly-update.txt">current month as a text file (TXT, 17.9MB)

Single file

These include standard and additional price paid data transactions received at HM Land Registry from 1 January 1995 to the most current monthly data.

Your use of Price Paid Data is governed by conditions and by downloading the data you are agreeing to those conditions.

The data is updated monthly and the average size of this file is 3.7 GB, you can download:

http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.txt">the complete Price Paid T
o
Eye-Tracking Sentence Reading Task
osf.io
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oren Kobo; Tom Schonberg (2023). Eye-Tracking Sentence Reading Task [Dataset]. https://osf.io/xwfs3
Explore at:
Dataset updated
Oct 6, 2023
Dataset provided by
Center For Open Science
Authors
Oren Kobo; Tom Schonberg
Description
We aim to compare the processing of the experimental sentences between subjects with varied levels of tendency to depression (as measured by questionnaires) to probe whether we can predict emotional state based on an accessible, objective measure. Any difference may indicate alternate processing strategies or allocation of attention during reading that might causes a different distribution of probabilities to the possible sentence subsequent information. Participants will read sentences, one at a time, while collecting Eye-Tracking data. We will use thirty-two sets of four sentences. All sentences are constructed such that a specific position in the beginning of the sentence (denoted as the source word) has an interaction with an upcoming negative/positive word (denoted as the target word). The nature of this interaction is such that target word is either expected or surprising (but plausible) given the source word. Sets are in 2X2 design (Agreement X Sentiment) to split to four conditions with word frequencies controlled using a Hebrew dataset (Linzen, 2009). Every subject sees exactly one sentence from each set, and the same number of sentences from each of the four conditions. To prevent structural priming (Bock, 1986), we added 56 fillers that are regular probable sentence taken from previous studies. Participants take part in 20-minute long sessions while eye-tracking data are collected throughout the experiment using Eyelink1000 plus, using 1000Hz. Afterwards, they will be asked to fill out three questionnaires to measure different aspect of mental conditions: PHQ-9 (depression), state and trait anxiety (STAI; Spielberger 1983) to correlate with the eye-tracking measures collected at the target word during comprehension of the sentences and to validate our implicit measures. We will explore the processing differences as reflected in Eye-Tracking data when reading the target word and hypothesize that tendency to depression will interact with reading time and various known eye-tracking related linguistic measures, when processing negative/positive surprise at that point. Meaning, we will analyze the gaze pattern when reading the target words to probe whether participants with more depression (i.e. higher PHQ) process negative stimuli differently, while altering predictability of this sentiment given prior context (surprising/expected). The measures that will be extracted from the raw Eye-Tracking signal and be used for analysis are: total gaze distance covered, total gaze duration on ROI, first fixation duration, second fixation duration, regression path duration, skipping probability, word has first pass regression, number of fixations, pupil diameter. All those measures are with regard to the defined ROI of the specific presented sentences. The ROI are the location of the target and source words, meaning the exact range of pixels on screen on which the words are presented. Additionally, we will: 1. Explore the inter-relation between these features 2. 2. Build the following classifiers: Predict sentence type by raw Eye-tracking signal using Deep-Learning model (Bi-LSTM) ; Predict sentence type be extracted aforementioned features (with ML model such as random forest); Predict PHQ level by raw Eye-tracking signal using Deep-Learning model (Bi-LSTM) - Predict PHQ level by extracted aforementioned features (with ML model such as random forest). We will perform both within-subject and between-subject evaluation.

To ensure participants will pay attention and be engaged, they will be asked simple Yes/No comprehension questions following reading the sentence (both the experimental and the filler sentences). Participants with lower than predetermined threshold of 75% success ratio in these questions will be disqualified
e
ONS Opinions Survey, February 2010 - Dataset - B2FIND
b2find.eudat.eu
Updated Feb 15, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). ONS Opinions Survey, February 2010 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/33813e4c-79e5-522c-8025-5f8e24109780
Explore at:
Dataset updated
Feb 15, 2010
Description
Abstract copyright UK Data Service and data collection copyright owner.The Opinions and Lifestyle Survey (OPN) is an omnibus survey that collects data from respondents in Great Britain. Information is gathered on a range of subjects, commissioned both internally by the Office for National Statistics (ONS) and by external clients (other government departments, charities, non-profit organisations and academia).One individual respondent, aged 16 or over, is selected from each sampled private household to answer questions. Data are gathered on the respondent, their family, address, household, income and education, plus responses and opinions on a variety of subjects within commissioned modules. Each regular OPN survey consists of two elements. Core questions, covering demographic information, are asked together with non-core questions that vary depending on the module(s) fielded.The OPN collects timely data for research and policy analysis evaluation on the social impacts of recent topics of national importance, such as the coronavirus (COVID-19) pandemic and the cost of living. The OPN has expanded to include questions on other topics of national importance, such as health and the cost of living.For more information about the survey and its methodology, see the gov.uk OPN Quality and Methodology Information (QMI) webpage.Changes over timeUp to March 2018, the OPN was conducted as a face-to-face survey. From April 2018 to November 2019, the OPN changed to a mixed-mode design (online first with telephone interviewing where necessary). Mixed-mode collection allows respondents to complete the survey more flexibly and provides a more cost-effective service for module customers.In March 2020, the OPN was adapted to become a weekly survey used to collect data on the social impacts of the coronavirus (COVID-19) pandemic on the lives of people of Great Britain. These data are held under Secure Access conditions in SN 8635, ONS Opinions and Lifestyle Survey, Covid-19 Module, 2020-2022: Secure Access. (See below for information on other Secure Access OPN modules.)From August 2021, as coronavirus (COVID-19) restrictions were lifted across Great Britain, the OPN moved to fortnightly data collection, sampling around 5,000 households in each survey wave to ensure the survey remained sustainable. Secure Access OPN modulesBesides SN 8635 (the COVID-19 Module), other Secure Access OPN data includes sensitive modules run at various points from 1997-2019, including Census religion (SN 8078), cervical cancer screening (SN 8080), contact after separation (SN 8089), contraception (SN 8095), disability (SNs 8680 and 8096), general lifestyle (SN 8092), illness and activity (SN 8094), and non-resident parental contact (SN 8093). See the individual studies for further details and information on how to apply to use them. Main Topics: The non-core questions for this month were: Tobacco consumption (Module 210): this module was asked on behalf of HM Revenue and Customs to help estimate the amount of tobacco consumed as cigarettes. Due to the potentially sensitive nature of the data within this module, cases for respondents aged under 18 have been removed. Disability monitoring (Module 363): this module was asked on behalf of the Department for Work and Pensions (DWP) which is interested in information on disability and includes two questions that ask about awareness of the Disability Discrimination Act. The module aims to identify the scale of problems those with long-term illnesses or disabilities have accessing goods, facilities and services. This version of the data does not contain variables M363_3M, M363_6AM, M363_6bM, M363_7M, M363_26, M363_27, M363_28, and M363_29. The Special Licence version of the data is held under SN 6992. Road Pricing (Module MAE): this module was asked on behalf of the Department for Transport and asks for opinions on road pricing. Disability (Module MCA): this module was asked by the Office for National Statistics on behalf of the Centre for Health Analysis and Life Events and seeks information regarding health problems which are long-lasting in nature and cause problems with normal daily activities. Variables MCA_1b1M and MCA_2b2M have been recoded into smaller groupings. Later life (Module MCE): this module was asked by DWP on behalf of a number of other government departments which are interested in what people think of the support available to help older people to continue to live independently in later life. Health and work (Module MCP): this module was asked by DWP on behalf of the Health, Work and Well-being Delivery Unit. Questions relate to health, well-being and work. This version of the data does not contain variables MCP_14, MCP_15M, MCP_16 and MCP_17 as they are considered disclosive. Migration (Module MCR): this module was asked on behalf of ONS and looks at migration into the UK and patterns of movement around the UK after arrival. The UN recommendation for defining an international long-term migrant, 'a person who moves to a country other than that of his or her usual residence for a period of at least a year' was used. Multi-stage stratified random sample Face-to-face interview
f
Correlation among variables (N = 192).
plos.figshare.com
xls
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mi Hwa Seo; Eun A. Kim; Hae Ran Kim (2025). Correlation among variables (N = 192). [Dataset]. http://doi.org/10.1371/journal.pone.0316654.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316654.t003
Dataset updated
Jan 3, 2025
Dataset provided by
PLOS ONE
Authors
Mi Hwa Seo; Eun A. Kim; Hae Ran Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundPerson-centered care focuses on individualized care that respects patients’ values, preferences, and autonomy. To enhance the quality of critical care nursing, institutions need to identify the factors influencing ICU nurses’ ability to provide person-centered care. This study explored the relationship between clinical judgment ability and person-centered care among intensive care unit (ICU) nurses, emphasizing how the ICU nursing work environment moderates this relation.MethodsA cross-sectional survey was conducted between September 4 and September 18, 2023, with 192 ICU nurses recruited from four general hospitals with a convenience sample (valid response rate = 97.4%). Participants completed online self-report structured questionnaires. The collected data were analyzed using hierarchical multiple regression and PROCESS macro Model 1, with a 95% bias-corrected bootstrap confidence interval to verify moderating effects.ResultsClinical judgment ability (β = .24, p < .001) and ICU nursing work environment (β = .50 p < .001) were found to be significant predictors of person-centered care. These two predictors explained the 47.0% of person-centered care in the final hierarchical regression model. Additionally, Clinical judgment (B = 0.28, p < .001, Boot. 95%CI = 0.13~0.42) and the ICU nursing work environment (B = 0.41, p < .001, Boot. 95%CI = 0.30~0.52) positively affected person-centered care, and the interaction term of clinical judgment and ICU nursing work environment (B = 0.16, p = .026, Boot. 95%CI = 0.02~0.30) also positively affected person-centered care. The moderating effect was particularly significant when the ICU nursing work environment score was 2.90 points (below 14.6%, above 85.4%) or higher on a scale of 1–5 and As the ICU nursing work environment score increased, the positive moderating effect also increased.ConclusionsThe ICU nurses’ clinical judgment ability positively affected person-centered care, and the nursing work environment moderated the relationship between clinical judgment ability and person-centered care. Therefore, strategies for enhancing person-centered care among ICU nurses should focus on developing educational programs to improve clinical judgment ability and implementing comprehensive efforts to effectively improve and manage the nursing work environment.
Long-term unemployment rate, % of active population aged 15-74
ec.europa.eu
db.nomics.world
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Commission, Long-term unemployment rate, % of active population aged 15-74 [Dataset]. https://ec.europa.eu/eurostat/databrowser/view/une_ltu_a/default/table?lang=en
Explore at:
Dataset authored and provided by
European Commissionhttp://ec.europa.eu/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The long-term unemployment rate is the number of persons unemployed for 12 months or longer, expressed as a percentage of the labour force (the total number of people employed and unemployed). Unemployed persons are those aged 15 to 74 who meet all three of the following conditions: were not employed during the reference week; were available to start working within two weeks after the reference week; and have actively sought work in the four weeks prior to the reference week or have already found a job to begin within the next three months. The MIP auxiliary indicator is expressed as a percentage of the active population aged 15 to 74 years. In the table, the values are also presented as changes over a three-year period (in percentage points). The data source is the quarterly EU Labour Force Survey (EU-LFS), which covers the resident population living in private households.
Sign Language Gesture Images Dataset
kaggle.com
zip
Updated Sep 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Khan (2019). Sign Language Gesture Images Dataset [Dataset]. https://www.kaggle.com/datasets/ahmedkhanak1995/sign-language-gesture-images-dataset
Explore at:
zip(199984313 bytes)Available download formats
Dataset updated
Sep 10, 2019
Authors
Ahmed Khan
License
https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en
Description
Context

Sign Language is a communication language just like any other language which is used among deaf community. This dataset is a complete set of gestures which are used in sign language and can be used by other normal people for better understanding of the sign language gestures .

Content

The dataset consists of 37 different hand sign gestures which includes A-Z alphabet gestures, 0-9 number gestures and also a gesture for space which means how the deaf or dumb people represent space between two letter or two words while communicating. The dataset has two parts, that is two folders (1)-Gesture Image Data - which consists of the colored images of the hands for different gestures. Each gesture image is of size 50X50 and is in its specified folder name that is A-Z folders consists of A-Z gestures images and 0-9 folders consists of 0-9 gestures respectively, '_' folder consists of images of the gesture for space. Each gesture has 1500 images, so all together there are 37 gestures which means there 55,500 images for all gestures in the 1st folder and in the 2nd folder that is (2)-Gesture Image Pre-Processed Data which has the same number of folders and same number of images that is 55,500. The difference here is these images are threshold binary converted images for training and testing purpose. Convolutional Neural Network is well suited for this dataset for model training purpose and gesture prediction.

Acknowledgements

I wouldn't be here without the help of others. As this dataset is being created with the help of references of the work done on sign language in data science and also references from the work done on image processing.
e
Event construal and temporal distance in natural language - Dataset - B2FIND...
b2find.eudat.eu
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Event construal and temporal distance in natural language - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bf45ce3f-b925-5b81-b0a3-cba9f1bee2b3
Explore at:
Dataset updated
Apr 28, 2023
Description
Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory.This network project brings together economists, psychologists, computer and complexity scientists from three leading centres for behavioural social science at Nottingham, Warwick and UEA. This group will lead a research programme with two broad objectives: to develop and test cross-disciplinary models of human behaviour and behaviour change; to draw out their implications for the formulation and evaluation of public policy. Foundational research will focus on three inter-related themes: understanding individual behaviour and behaviour change; understanding social and interactive behaviour; rethinking the foundations of policy analysis. The project will explore implications of the basic science for policy via a series of applied projects connecting naturally with the three themes. These will include: the determinants of consumer credit behaviour; the formation of social values; strategies for evaluation of policies affecting health and safety. The research will integrate theoretical perspectives from multiple disciplines and utilise a wide range of complementary methodologies including: theoretical modeling of individuals, groups and complex systems; conceptual analysis; lab and field experiments; analysis of large data sets. The Network will promote high quality cross-disciplinary research and serve as a policy forum for understanding behaviour and behaviour change. Experimental data. In study 1, we collected and analyzed millions of time-indexed posts on Twitter. In this study we obtained a large number of tweets that referenced dates in the future, and were able to use these tweets to determine the concreteness of the language used to describe events at these dates. This allowed us to observe how psychological distance influences everyday discourse, and put the key assumptions of the CLT to a real-world test. In study 2, we analyzed word concreteness in news articles using the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). This corpus contains over 1.8 million NYT articles written between 1987 and 2007. Importantly for our purposes, these articles are tagged with keywords describing the topics of the articles. In this study we obtained all NYT articles written before and after the 1988, 1992, 1996, 2000, and 2004 US Presidential elections, which were tagged as pertaining to these elections. We subsequently tested how the concreteness of the words used in the articles varied as a function of temporal distance to the election they reference. We also performed this analysis with NYT articles referencing three popular public holidays. Unlike study 1 and prior work (such as Snefjella & Kuperman, 2015), study 2 allowed us to examine the influence of temporal distance in the past and in the future, while controlling for the exact time when specific events occurred.
f
S1 Data -
figshare.com
plos.figshare.com
xlsx
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus W. Lenizky; Sean K. Meehan (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0302989.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302989.s001
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Markus W. Lenizky; Sean K. Meehan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple sensorimotor loops converge in the motor cortex to create an adaptable system capable of context-specific sensorimotor control. Afferent inhibition provides a non-invasive tool to investigate the substrates by which procedural and cognitive control processes interact to shape motor corticospinal projections. Varying the transcranial magnetic stimulation properties during afferent inhibition can probe specific sensorimotor circuits that contribute to short- and long-latency periods of inhibition in response to the peripheral stimulation. The current study used short- (SAI) and long-latency (LAI) afferent inhibition to probe the influence of verbal and spatial working memory load on the specific sensorimotor circuits recruited by posterior-anterior (PA) and anterior-posterior (AP) TMS-induced current. Participants completed two sessions where SAI and LAI were assessed during the short-term maintenance of two- or six-item sets of letters (verbal) or stimulus locations (spatial). The only difference between the sessions was the direction of the induced current. PA SAI decreased as the verbal working memory load increased. In contrast, AP SAI was not modulated by verbal working memory load. Visuospatial working memory load did not affect PA or AP SAI. Neither PA LAI nor AP LAI were sensitive to verbal or spatial working memory load. The dissociation of short-latency PA and AP sensorimotor circuits and short- and long-latency PA sensorimotor circuits with increasing verbal working memory load support multiple convergent sensorimotor loops that provide distinct functional information to facilitate context-specific supraspinal control.
e
Children's People and Nature Survey for England, 2021-2023 - Dataset -...
b2find.eudat.eu
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Children's People and Nature Survey for England, 2021-2023 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/de4f372c-2db0-50b5-abee-bfe621024782
Explore at:
Dataset updated
Dec 6, 2024
Area covered
England
Description
Abstract copyright UK Data Service and data collection copyright owner.The Children’s People and Nature Survey (C-PaNS) provides information on how children and young people experience and think about the natural environment. Each year, the survey samples around 4,000 children and young people aged 8 -15 years across two survey waves, one in term time and one in holiday time.Waves 5 and 6 of the C-PaNS ran between the 16 and 23 August 2023 (during the school holidays) and 18 and 26 September 2023 (during term-time). Different versions of the C-PaNS are available from the UK Data Archive under Open Access (SN 9174) conditions, End User Licence (SN 9175), and Secure Access (SN 9176).The Secure Access version includes the same data as the End User Licence version, but includes more detailed variables including:age as a continuous variableincome (all categories)number of people living in household as a continuous variableethnicitydisabilityhome geography variables, including local authority district and urban/rural areaopen answers for thematic analysis in CS_Q14 and CS_Q15The Open Access version includes the same data as the End User Licence version, but does not include the following variables:age bandnumber of people living in household (Top coded to ‘6 and over’)access to private gardenincome (top coded to £50,000+)genderplaces withing walking distance from home Researchers are advised to review the Open Access and/or the End User Licence versions to determine if these are adequate prior to ordering the Secure Access version.Accredited official statistics are called National Statistics in the Statistics and Registration Service Act 2007. An explanation can be found on the Office for Statistics Regulation website.Natural England's statistical practice is regulated by the Office for Statistics Regulation (OSR). OSR sets the standards of trustworthiness, quality and value in the Code of Practice for Statistics that all producers of official statistics should adhere to. These accredited official statistics were independently reviewed by the Office for Statistics Regulation in January 2023. They comply with the standards of trustworthiness, quality and value in the Code of Practice for Statistics and should be labelled ‘accredited official statistics’.Users are welcome to contact Natural England directly at people_and_nature@naturalengland.org.uk with any comments about how they meet these standards. Alternatively, users can contact OSR by emailing regulation@statistics.gov.uk or via the OSR website.Since the latest review by the Office for Statistics Regulation, Natural England have continued to comply with the Code of Practice for Statistics, and have made the following improvements:Published a development plan with timetables for future work, which will be updated annuallyEnsured that users have opportunities to contribute to development planning through their biannual Research User GroupEnabled wider access to the data by publishing raw data sets through the UK Data ServiceProvided users with guidance on how statistics from their products can be compared with those produced in the devolved nationsPublished guidance on the differences between PaNS and MENEImproved estimates of the percentage of people visiting nature in the previous 14 days by reducing the amount of respondents answering ‘don’t know’.These data are available in Excel, SPSS, as well as Open Document Spreadsheet (ODS) formats. Main Topics: The Children's People and Nature Survey for England covers topics including: WellbeingTime spent outsideQuality of outdoor spacesOpportunities and barriers to spending time outsideEnvironmental concern and actionNature connectionCountryside code Quota sample Self-administered questionnaire: Computer-assisted (CASI)
Impact indicator: housing starts
data.wu.ac.at
ckan.publishing.service.gov.uk
+1more
html, sparql
Updated Feb 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Housing, Communities and Local Government (2018). Impact indicator: housing starts [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/OWIwODk1YzQtZGRmMC00Nzc5LWEwNDktNjAxNzVjN2M0NGUy
Explore at:
html, sparqlAvailable download formats
Dataset updated
Feb 26, 2018
Dataset provided by
Ministry of Housing, Communities and Local Government of the United Kingdomhttps://gov.uk/mhclg
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Total number of housing starts (seasonally adjusted)

How the figure is calculated:

Total housing starts are reported by local authority and private building control organisations after the end of each quarter. A start is counted from the point at which foundation work begins. The figures are seasonally adjusted to allow comparisons with previous quarters.

Why is this indicator in the business plan?

Increasing the supply of housing is a key part of DCLG policy. The house building figures are the most frequent and timely indicator of housing delivery.

How often is it updated?

Quarterly

Where does the data come from?

P2 quarterly house building returns by local authority building control departments; monthly information from the National House Building Council (NHBC) on the volume of building control inspections; and a quarterly survey of private building control companies. Published figures are at https://www.gov.uk/government/organisations/department-for-communities-and-local-government/series/house-building-statistics.

What area does the headline figure cover?

England

Are further breakdowns of the data available?

Yes, can be split by local authority area and by tenure

What does a change in this indicator show?

An increase in this indicator is good and shows more new houses are being started.

Time Lag

Figures are published within two months of the end of the reporting period.

Next available update

May 2015.

Type of Data

National Statistics.

Robustness and data limitations

The P2 figures from local authorities and figures from private building control companies include imputation for a small number of missing returns.

Seasonal factors for the house building time series are re-calculated annually back to 2000. This is usually done in the second quarter of the calendar year. Therefore the seasonally adjusted house building figures throughout the whole period change slightly at that time but are not marked as 'revised'.

Links to Further Information

https://www.gov.uk/government/organisations/department-for-communities-and-local-government/series/house-building-statistics

Contact Details

CorporatePerformance@communities.gsi.gov.uk
Data from: War of Words II: Enriched Models of Law-Making Processes
zenodo.org
txt, zip
Updated Apr 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Kristof; Aswin Suresh; Matthias Grossglauser; Patrick Thiran; Victor Kristof; Aswin Suresh; Matthias Grossglauser; Patrick Thiran (2021). War of Words II: Enriched Models of Law-Making Processes [Dataset]. http://doi.org/10.5281/zenodo.4709248
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4709248
Dataset updated
Apr 27, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Victor Kristof; Aswin Suresh; Matthias Grossglauser; Patrick Thiran; Victor Kristof; Aswin Suresh; Matthias Grossglauser; Patrick Thiran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains the dataset presented and used in the paper:

Victor Kristof, Aswin Suresh, Matthias Grossglauser, Patrick Thiran, War of Words II: Enriched Models of Law-Making Processes, The Web Conference 2021, April 19-23, 2021, Ljubljana, Slovenia

The code to process and use the dataset can be found on GitHub.

This is a follow-up work to War of Words: The Competitive Dynamics of Legislative Processes.

The dataset is split into two legislature periods of the European Parliament, the 7th (war-of-words-2-ep7.txt) and the 8th (war-of-words-2-ep8.txt) legislature. Here is a snippet to load the dataset (for EP8 in this example) in Python:

import json with open('path/to/war-of-words-2-ep8.txt') as f: dataset = [json.loads(l) for l in f.readlines()]

In the two text files, each line is a data point representing a conflict between edits. It is encoded as a JSON list of dictionaries, where each dictionary is an edit. Each edit has the following structure:

{ 'edit_id': 163187, // Unique edit identifier 'edit_type': 'insert', // One of 'insert', 'delete', or 'replace' 'accepted': True, // Label 'dossier_ref': 'ENVI-AD(2012)487738', // Reference to dossier (see below) 'dossier_type': 'opinion', // One of 'opinion' or 'report' 'date': '2017-03-02', // Date of vote of all amendments for this dossier 'legal_act': 'regulation', // One of 'regulation', 'directive', or 'decision' 'committee': 'BUDG', // Committee in which this edit was proposed 'outsider': False, // Whether the above committee is the reporting committee 'article_type': 'recital', // One of 7 article types 'source': 'BUDG-AM(2017)599742', // Reference to original document of the amendment 'justification': None, // The text of the optional justification (or None) 'edit_indices': {...}, // Indices of edit in the amendment (see below) 'text_original': [...], // Original text reported in the source document (see below) 'text_amended': [...], // Amended text reported in the source document (see below) 'authors': [ // List of authors { 'id': 88882, // Unique MEP identifier (see below) 'name': 'Victor NEGRESCU', // MEP full name 'gender': 'M', // Gender as reported on the Parliament database 'nationality': 'Romania', // One of 28 nationalities 'group': 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament', // One 9 political groups 'rapporteur': False // Whether the MEP is rapporteur for this dossier }, ], }

The text_original and text_amended keys contain the portion of text reported in the `source` document. The text is tokenized as a list of terms (words, numbers, punctuation, ...). These two keys are not the actual edit. This is because the amendments are reported as an edited paragraph (which also gives some context to the edit). An amendment contains one or more edits. To access the actual text of the edit, use the `edit_indices` key, which is a dictionary (such as `{'i1': 80, 'i2': 80, 'j1': 80, 'j2': 101}`). The `i1` and `i2` keys are the first and last indices of the change in the original text, and the `j1` and `j2` keys are the first and last indices of the amended text. Hence, you can access the text of the edit by doing:

idx = edit['edit_indices'] i1, i2, j1, j2 = idx['i1'], idx['i2'], idx['j1'], idx['j2'] old = edit['text_original'][i1:i2] new = edit['text_amended'][j1:j2] print(f'"{old}" is replaced by "{new}"')

If an edit is an insertion, then `i1 == i2`. If it is a deletion, then `j1 == j2`. Read the documentation of difflib to learn more about how these indices are obtained.

You can assume that:

Each data point has at least one edit.

If there is only one edit, then it is in conflict with the status quo (see Section 2 of the paper).

If there are two or more edits in conflict, then they are all in conflict against each other and they are in conflict with the status quo (see Section 2 of the paper).

At most one edit is accepted in each data point.

In each legislature, each edit has a unique identifier.

You can find the original documents where the amendments were proposed using the `source` key, which has the format `COMM-AM(YEAR)PENUMBER`. Use the following search tools for EP7 and EP8 (the PE number field should be enough, adding a "." to fit the required format).

The parliamentarians (MEPs, for Member of the European Parliament) have a unique identifier that you can use to get more details about them on the Parliament website: Go to https://www.europarl.europa.eu/meps/en/MEP_ID, where MEP_ID is the id of the MEP of interest.

Don't hesitate to reach out to me if you have any questions!

To cite this work:

@inproceedings{kristof2021war, author = {Kristof, Victor and Suresh, Aswin and Grossglauser, Matthias and Thiran, Patrick}, title = {War of Words II: Enriched Models for Law-Making Processes}, year = {2021}, booktitle = {Proceedings of The Web Conference 2021}, TODO: pages = {2803–2809}, numpages = {12}, location = {Ljubljana, Solvenia}, series = {WWW '21} }
f
Logistic regression.
plos.figshare.com
xls
Updated May 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bojian Wang; Yanwei Du; Pengyu Cao; Min Liu; Jinting Yang; Ningning Zhang; Wangshu Shao; Lijing Zhao; Rongyu Li; Lin Wang (2025). Logistic regression. [Dataset]. http://doi.org/10.1371/journal.pone.0318445.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0318445.t003
Dataset updated
May 7, 2025
Dataset provided by
PLOS ONE
Authors
Bojian Wang; Yanwei Du; Pengyu Cao; Min Liu; Jinting Yang; Ningning Zhang; Wangshu Shao; Lijing Zhao; Rongyu Li; Lin Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe aim of this study is to examine the critical variables that impact the long-term prognosis of patients with acute coronary syndrome (ACS) after percutaneous coronary intervention (PCI) and to create a multidimensional predictive risk assessment model that can serve as a theoretical basis for accurate cardiac rehabilitation.MethodsThe study involved ACS patients who received PCI at the First Hospital of Jilin University from June 2020 to March 2021. Participants were categorized into two groups: acute myocardial infarction (AMI) and unstable angina (UA), according to clinical data and angiographic findings. Hospitalization data, physical performance, exercise tolerance prior to discharge, average daily steps, major adverse cardiac events (MACE), and a follow-up period of 36 months were documented. The dates for accessing data for research purposes are February 10, 2022 (10/2/2022) to December 10, 2023 (10/12/2023).ResultsWe observed substantial increases in weight, fasting plasma glucose (FPG), total cholesterol, high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), white blood cell (WBC) count, neutrophil granulocyte count, monocyte count, hemoglobin (Hb) levels, aspartate aminotransferase (AST), and alanine aminotransferase (ALT) levels in the acute myocardial infarction (AMI) cohort relative to the unstable angina (UA) cohort. We found white blood cell count (WBC) (OR: 4.110) and the effective average number of daily steps (ANS) (OR: 2.689) as independent prognostic risk factors for acute myocardial infarction (AMI). The independent risk factors for unstable angina prognosis were white blood cell count (OR: 6.257), VO2 at anaerobic threshold (OR: 4.294), and effective autonomic nervous system function (OR: 4.097). The whole prognostic risk assessment score for acute myocardial infarction (AMI) is 5 points, with 0 points signifying low risk, 2–3 points representing intermediate risk, and 5 points indicating high risk. The overall prognostic risk assessment score for UA is 7 points, with 0–3 classified as low risk, 4–5 as intermediate risk, and 6–7 as high risk.ConclusionThis study developed a multimodal predictive model that integrates the inflammatory response after onset, physical performance and exercise tolerance before discharge, and daily activity after discharge to predict the long-term prognosis of patients with ACS. The multidimensional model is more effective than the single-factor model for assessing risk in ACS patients. This work also establishes a theoretical basis for improving the prognosis of potentially high-risk individuals with accurate and reasonable exercise prescriptions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1

Data and tools for studying isograms

Explore at:

application/x-sqlite3Available download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5245810.v1

Dataset updated

Jul 31, 2017

Dataset provided by

figshare

Authors

Florian Breit

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

Clear search

Close search

Google apps

Main menu

Data and tools for studying isograms

Data from: ERA5 hourly data on single levels from 1940 to present

Data from: Long-term Data Collection at Select Antarctic Peninsula Visitor...

IMAGE RPI Monthly Electron Density Values - Dataset - NASA Open Data Portal

Lemmatized English Word2Vec data

Data from: Predictive language processing revealing usage-based variation

Data from: Approach or Avoidance: How Does Employees’ Generative AI...

Raw data.

Price Paid Data

Using or publishing our Price Paid Data

Address data

August 2025 data (current month)

Single file

Eye-Tracking Sentence Reading Task

ONS Opinions Survey, February 2010 - Dataset - B2FIND

Correlation among variables (N = 192).

Long-term unemployment rate, % of active population aged 15-74

Sign Language Gesture Images Dataset

Context

Content

Acknowledgements

Event construal and temporal distance in natural language - Dataset - B2FIND...

S1 Data -

Children's People and Nature Survey for England, 2021-2023 - Dataset -...

Impact indicator: housing starts

How the figure is calculated:

Why is this indicator in the business plan?

How often is it updated?

Where does the data come from?

What area does the headline figure cover?

Are further breakdowns of the data available?

What does a change in this indicator show?

Time Lag

Next available update

Type of Data

Robustness and data limitations

Links to Further Information

Contact Details

Data from: War of Words II: Enriched Models of Law-Making Processes

Logistic regression.

Data and tools for studying isograms