100+ datasets found

l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
H
Replication Data for: A Definition-By-Example Approach and Visual Language...
dataverse.harvard.edu
Updated Dec 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous anonymous (2019). Replication Data for: A Definition-By-Example Approach and Visual Language for Activity Patterns [Dataset]. http://doi.org/10.7910/DVN/FTPF3Z
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FTPF3Z
Dataset updated
Dec 16, 2019
Dataset provided by
Harvard Dataverse
Authors
anonymous anonymous
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
public_supplementary_material.pdf includes the questionnaire, the tutorial, the instructions and tasks shown during the experiment and the visual and textual activity definitions for the tasks used for the experiment reported in our paper. data.xls includes all our raw data.
q
Biobyte 1 - Where are we in the data science landscape?
qubeshub.org
Updated Aug 6, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Donovan (2019). Biobyte 1 - Where are we in the data science landscape? [Dataset]. http://doi.org/10.25334/03VE-VK77
Explore at:
Unique identifier
https://doi.org/10.25334/03VE-VK77
Dataset updated
Aug 6, 2019
Dataset provided by
QUBES
Authors
Sam Donovan
Description
This short activity can be used to introduce the NAS Data Science For Undergraduates report's definition of data acumen and engage participants in a self assessment of how they connect with those 10 data science concepts.
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Rizun (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Andrea Miletič
Charalampos Alexopoulos
Anastasija Nikiforova
Magdalena Ciesielska
Nina Rizun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
d
Replication Data for: How faculty define quality, prestige, and impact of...
search.dataone.org
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morales, Esteban; Alperin, Juan Pablo (2023). Replication Data for: How faculty define quality, prestige, and impact of academic journals [Dataset]. http://doi.org/10.7910/DVN/2FNDXL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2FNDXL
Dataset updated
Nov 13, 2023
Dataset provided by
Harvard Dataverse
Authors
Morales, Esteban; Alperin, Juan Pablo
Description
Anonymized and coded survey respondents for open-ended questions asking for the definitions of three terms related to academic journals: High Quality, Prestigious, and High Impact. Each column represents a code, as described in Morales et al. (2021). A value of 1 indicates that this respondent's answer was deemed to include a reference to the concept described by the code, and a 0 indicates that the concept was not present in the response.
d
INTEGRAL Science Window Data
catalog.data.gov
s.cnmilf.com
+2more
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Science Window Data [Dataset]. https://catalog.data.gov/dataset/integral-science-window-data
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
High Energy Astrophysics Science Archive Research Center
Description
Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse
The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .
w
Data from: Forensic science : an illustrated dictionary
workwithdata.com
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2023). Forensic science : an illustrated dictionary [Dataset]. https://www.workwithdata.com/book/Forensic%20science%20:%20an%20illustrated%20dictionary_864990
Explore at:
Dataset updated
May 18, 2023
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explore Forensic science : an illustrated dictionary through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets
d
Pond data: physical, chemical, and biological characteristics with...
search.dataone.org
portal.edirepository.org
Updated Apr 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David C. Richardson; Meredith A. Holgerson; Matthew J. Farragher; Kathryn K. Hoffman; Katelyn B.S. King; Maria B. Alfonso; Mikkel R. Andersen; Kendra Spence Cheruveil; Kristen A. Coleman; Mary Jade Farruggia; Rocio L. Fernandez; Kelly L. Hondula; Gregorio A. Lopez Moreira M.; Katherine E. Paul; Benjamin L. Peierls; Joseph S. Rabaey; Steven Sadro; Maria Laura Sanchez; Robyn L. Smyth; Jon N. Sweetman (2022). Pond data: physical, chemical, and biological characteristics with scientific and United States of America state definitions from literature and legislative surveys [Dataset]. https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fedi%2F1014%2F1
Explore at:
Dataset updated
Apr 4, 2022
Dataset provided by
Environmental Data Initiative
Authors
David C. Richardson; Meredith A. Holgerson; Matthew J. Farragher; Kathryn K. Hoffman; Katelyn B.S. King; Maria B. Alfonso; Mikkel R. Andersen; Kendra Spence Cheruveil; Kristen A. Coleman; Mary Jade Farruggia; Rocio L. Fernandez; Kelly L. Hondula; Gregorio A. Lopez Moreira M.; Katherine E. Paul; Benjamin L. Peierls; Joseph S. Rabaey; Steven Sadro; Maria Laura Sanchez; Robyn L. Smyth; Jon N. Sweetman
Time period covered
Jan 1, 1946 - Apr 30, 2019
Area covered
Variables measured
ph, Link, Year, year, title, Author, author, Journal, journal, landuse, and 33 more
Description
Ponds are often identified by their small size and shallow depths, but the lack of a universal definition hampers science and weakens legal protection. In order to determine a working definition of ‘pond’, we conducted a literature search for scientific definitions, a U.S. state survey for management definitions, and looked at pond ecosystem function using data from the literature search. Our dataset includes physical, chemical, and biological data for 1327 waterbodies ≤ 20 ha in surface area and ≤ 9 m in maximum or mean depth from our literature review. These data have a global distribution, we include a table of latitudes and longitudes, and span many years (1946-2019). We have also included a table of 54 pond definitions from the literature review and a table of U.S. state definitions of ponds, wetlands, and lakes resulting from our survey.
d
Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...
catalog.data.gov
data.usgs.gov
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska Science Center, Molecular Ecology Laboratory [Dataset]. https://catalog.data.gov/dataset/dictionary-of-microsatellite-loci-from-the-u-s-geological-survey-alaska-science-center-mol
Explore at:
Dataset updated
Nov 10, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data package provides a dictionary of microsatellite locus primers used by the USGS Alaska Science Center, Molecular Ecology Laboratory (MEL). It is a look-up file of microsatellite locus names and citations to the original publication or source where additional information about the locus primers may be found.
WiDS data dictionary v2.xlsx
kaggle.com
Updated Feb 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VivekSingh (2018). WiDS data dictionary v2.xlsx [Dataset]. https://www.kaggle.com/datasets/viveksinghub/wids-data-dictionary-v2xlsx
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VivekSingh
Description
Dataset

This dataset was created by VivekSingh

Released under Data files © Original Authors

Contents
t
Data from: Trusted Research Environments: Analysis of Characteristics and...
researchdata.tuwien.ac.at
bin, csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
w
Data from: Purnell's concise dictionary of science
workwithdata.com
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Purnell's concise dictionary of science [Dataset]. https://www.workwithdata.com/object/purnell-s-concise-dictionary-of-science-book-by-robin-kerrod-1940
Explore at:
Dataset updated
Jul 1, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explore Purnell's concise dictionary of science through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets
d
Training Webinars on China Research Data: Sources, Tools and Applications
search.dataone.org
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spatial Data Lab (2024). Training Webinars on China Research Data: Sources, Tools and Applications [Dataset]. http://doi.org/10.7910/DVN/LN6OHH
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LN6OHH
Dataset updated
Mar 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Spatial Data Lab
Description
This webinar series introduce some research data with a focus on China and discuss the difference from the US data. Each webinar will cover the following topics: (1) data sources, data collection, data category, definition, description, and interpretation; (2) alternative data and derivable data from other data sources, especially some big data sources; (3) comparison of data difference between the US and China; (4) available tools for efficient data analysis; (5) discussions on pros and cons; and (6) data applications in research and teaching.
Data Science for Environmental Justice PBL Module: Air Pollution Data
figshare.com
pdf
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RN Uma; Marja H. Bakermans; Elisabeth Stoddard; Rakesh Malhotra; Alade Tokuta; Adrienne Smith; Rebecca Zulli Lowe (2025). Data Science for Environmental Justice PBL Module: Air Pollution Data [Dataset]. http://doi.org/10.6084/m9.figshare.24902889.v6
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24902889.v6
Dataset updated
May 23, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
RN Uma; Marja H. Bakermans; Elisabeth Stoddard; Rakesh Malhotra; Alade Tokuta; Adrienne Smith; Rebecca Zulli Lowe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a PBL module on Air Pollution to be used in an introductory environmental science course to motivate students to analyze related environmental justice issues.Original data was from the US EPA data on "State EJScreen Data at the Block Group Level" (EJSCREEN_2023_BG_StatePct_with_AS_CNMI_GU_VI.csv) which was downloaded from https://www.epa.gov/ejscreen/download-ejscreen-data on December 20, 2023. (Note: Access to the EJSCREEN tool was removed during February 2005).This data was processed and cleaned as described in the data provenance document.Lecture Slides, Activity Sheets and Instructor Notes are available here.The following files are included:Data Provenance and Data Dictionary: Data Provenance and Data Dictionary.pdfR Script for Data Processing: EJSCREEN_Data_Curation_NC_Summarized_by_County.RProcessed Dataset for North Carolina: EJScreen_State_BGLevel_NC_13Columns.csvCurated Data used in the Module - Summarized Dataset for North Carolina (summarized by county): EJScreen_State_BGLevel_NC_Summarized_By_County_13Columns.csvData Dictionary: Data_Dictionary_EJSCREEN_2023_BG_Columns.pdfOriginal Dataset from EPA/EJSCREEN from which Data was Extracted for North Carolina: DS4EJ_EJSCREEN_2023_BG_StatePct_with_AS_CNMI_GU_VI.csv
Z
Data from: Diversity in citations to a single study: Supplementary data set...
data.niaid.nih.gov
zenodo.org
Updated Aug 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leng, Rhodri Ivor (2021). Diversity in citations to a single study: Supplementary data set for citation context network analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5244799
Explore at:
Dataset updated
Aug 25, 2021
Dataset authored and provided by
Leng, Rhodri Ivor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This document describes the data set used for all analyses in 'Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited' accepted for publication in Quantitative Science Studies [1].

Data Collection

The data collection procedure has been fully described [1]. Concisely, the data set contains bibliometric data collected from Web of Science Core Collection via the University of Edinburgh’s Library subscription concerning all papers that cited a cohort study, Paul et al. [2], in the period <1985. This includes a full list of citing papers, and the citations between these papers. Additionally, it includes textual passages (citation contexts) from 343 citing papers, which were manually recovered from the full-text documents accessible via the University of Edinburgh’s Library subscription. These data have been cleaned, converted into network readable datasets, and are coded into particular classifications reflecting content, which are described fully in the supplied code book and within the manuscript [1].

Data description

All relevant data can be found in the attached file 'Supplementary_material_Leng_QSS_2021.xlsx', which contains the following five workbooks:

“Overview” includes a list of the content of the workbooks.

“Code Book” contains the coding rules and definitions used for the classification of findings and paper titles.

“Node attribute list” includes a workbook containing all node attributes for the citation network, which includes Paul et al. [2] and its citing papers as of 1984. Highlighted in yellow at the bottom of this workbook is two papers that were discarded due to duplication - remove these if analysing this dataset in a network analysis. The columns refer to:

Id, the node identifier

Label, the formal citation of the paper to which data within this row corresponds. Citation is in the following format: last name of first author, year of publication, journal of publication, volume number, start page, and DOI (if available).

Title, the paper title for the paper in question.

Publication_year, the year of publication.

Document_type, the document type (e.g. review, article)

WoS_ID, the paper’s unique Web of Science accession number.

Citation_context, a column specifying whether citation context data is available from that paper

Explanans, the title explanans terms for that paper;

Explanandum, the explanandum terms for that paper.

Combined_Title_Classification, the combined terms used for fig 2 of the published manuscript.

Serum_cholesterol_(SC), a column identifying papers that cited the serum cholesterol findings.

Blood_Pressure_(BP), a column identifying papers that cited the blood pressure findings.

Coffee_(C), a column identifying papers that cited the coffee findings.

Diet_(D), a column identifying papers that cited the dietary findings.

Smoking_(S), a column identifying papers that cited the smoking findings.

Alcohol_(A), a column identifying papers that cited the alcohol findings.

Physical_Activity_(PA), a column identifying papers that cited the physical activity findings.

Body_Fatness (BF), a column identifying papers that cited the body fatness findings.

Indegree, the number of within network citations to that paper, calculated for the network shown in Fig 4 of the manuscript.

Outdegree, the number of within network references of that paper as calculated for the network in Fig 4.

Main_component, a column specifying whether a node is contained in the largest weakly connect component as shown in Fig 4 of the manuscript.

Cluster, provides the cluster membership number as discussed within the manuscript (Fig 5).

“Edge list” includes a workbook including the edges for the network. The columns refer to:

Source, contains the node identifier of the citing paper.

Target, contains the node identifier of the cited paper.

“Citation context classification” includes a workbook containing the WoS accession number for the paper analysed, and any finding category discussed in that paper established via context analysis (see the code book for definitions). The columns refer to:

Id, the node identifier

Finding_Class, the findings discussed from Paul et al. within the body of the citing paper.

“Citation context data” includes a workbook containing the WoS accession number for papers in which citation context data was available, the citation context passages, the reference number or format of Paul et al. within the citing paper, and the finding categories discussed in those contexts (see code book for definitions). The columns refer to:

Id, the node identifier

Citation_context, the passage copied from the full text of the citing paper containing discussion of the findings of Paul et al.

Reference_in_citing_article, the reference number or format of Paul et al. within the citing paper.

Finding_class, the findings discussed from Paul et al. within the body of the citing paper.

Software recommended for analysis

For the analyses performed within the manuscript, Gephi version 0.9.2 was used [3], and both the edge and node lists are in a format that is easily read into this software. The Sci2 tool was used to parse data initially [4].

Notes

Leng, R. I. (Forthcoming). Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited. Quantitative Science Studies.

Paul, O., Lepper, M. H., Phelan, W. H., Dupertuis, G. W., Macmillan, A., McKean, H., et al. (1963). A longitudinal study of coronary heart disease. Circulation, 28, 20-31. https://doi.org/10.1161/01.cir.28.1.20.

Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu
Z
Survey: Open Science in Higher Education
data.niaid.nih.gov
zenodo.org
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blümel, Ina (2024). Survey: Open Science in Higher Education [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_400518
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Mazarakis, Athanasios
Blümel, Ina
Scherp, Ansgar
Heck, Tamara
Peters, Isabella
Weisel, Luzian
Heller, Lambert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open Science in (Higher) Education – data of the February 2017 survey

This data set contains:

Full raw (anonymised) data set (completed responses) of Open Science in (Higher) Education February 2017 survey. Data are in xlsx and sav format.

Survey questionnaires with variables and settings (German original and English translation) in pdf. The English questionnaire was not used in the February 2017 survey, but only serves as translation.

Readme file (txt)

Survey structure

The survey includes 24 questions and its structure can be separated in five major themes: material used in courses (5), OER awareness, usage and development (6), collaborative tools used in courses (2), assessment and participation options (5), demographics (4). The last two questions include an open text questions about general issues on the topics and singular open education experiences, and a request on forwarding the respondent's e-mail address for further questionings. The online survey was created with Limesurvey[1]. Several questions include filters, i.e. these questions were only shown if a participants did choose a specific answer beforehand ([n/a] in Excel file, [.] In SPSS).

Demographic questions

Demographic questions asked about the current position, the discipline, birth year and gender. The classification of research disciplines was adapted to general disciplines at German higher education institutions. As we wanted to have a broad classification, we summarised several disciplines and came up with the following list, including the option "other" for respondents who do not feel confident with the proposed classification:

Natural Sciences

Arts and Humanities or Social Sciences

Economics

Law

Medicine

Computer Sciences, Engineering, Technics

Other

The current job position classification was also chosen according to common positions in Germany, including positions with a teaching responsibility at higher education institutions. Here, we also included the option "other" for respondents who do not feel confident with the proposed classification:

Professor

Special education teacher

Academic/scientific assistant or research fellow (research and teaching)

Academic staff (teaching)

Student assistant

Other

We chose to have a free text (numerical) for asking about a respondent's year of birth because we did not want to pre-classify respondents' age intervals. It leaves us options to have different analysis on answers and possible correlations to the respondents' age. Asking about the country was left out as the survey was designed for academics in Germany.

Remark on OER question

Data from earlier surveys revealed that academics suffer confusion about the proper definition of OER[2]. Some seem to understand OER as free resources, or only refer to open source software (Allen & Seaman, 2016, p. 11). Allen and Seaman (2016) decided to give a broad explanation of OER, avoiding details to not tempt the participant to claim "aware". Thus, there is a danger of having a bias when giving an explanation. We decided not to give an explanation, but keep this question simple. We assume that either someone knows about OER or not. If they had not heard of the term before, they do not probably use OER (at least not consciously) or create them.

Data collection

The target group of the survey was academics at German institutions of higher education, mainly universities and universities of applied sciences. To reach them we sent the survey to diverse institutional-intern and extern mailing lists and via personal contacts. Included lists were discipline-based lists, lists deriving from higher education and higher education didactic communities as well as lists from open science and OER communities. Additionally, personal e-mails were sent to presidents and contact persons from those communities, and Twitter was used to spread the survey.

The survey was online from Feb 6th to March 3rd 2017, e-mails were mainly sent at the beginning and around mid-term.

Data clearance

We got 360 responses, whereof Limesurvey counted 208 completes and 152 incompletes. Two responses were marked as incomplete, but after checking them turned out to be complete, and we added them to the complete responses dataset. Thus, this data set includes 210 complete responses. From those 150 incomplete responses, 58 respondents did not answer 1st question, 40 respondents discontinued after 1st question. Data shows a constant decline in response answers, we did not detect any striking survey question with a high dropout rate. We deleted incomplete responses and they are not in this data set.

Due to data privacy reasons, we deleted seven variables automatically assigned by Limesurvey: submitdate, lastpage, startlanguage, startdate, datestamp, ipaddr, refurl. We also deleted answers to question No 24 (email address).

References

Allen, E., & Seaman, J. (2016). Opening the Textbook: Educational Resources in U.S. Higher Education, 2015-16.

First results of the survey are presented in the poster:

Heck, Tamara, Blümel, Ina, Heller, Lambert, Mazarakis, Athanasios, Peters, Isabella, Scherp, Ansgar, & Weisel, Luzian. (2017). Survey: Open Science in Higher Education. Zenodo. http://doi.org/10.5281/zenodo.400561

Contact:

Open Science in (Higher) Education working group, see http://www.leibniz-science20.de/forschung/projekte/laufende-projekte/open-science-in-higher-education/.

[1] https://www.limesurvey.org

[2] The survey question about the awareness of OER gave a broad explanation, avoiding details to not tempt the participant to claim "aware".
d
Replication Data for: Analysis of Social Media to support definition and...
search.dataone.org
dataverse.harvard.edu
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sotelo Docio, Susana; Benitez-Baleato, Jesus M (2023). Replication Data for: Analysis of Social Media to support definition and evaluation of tourism public policy. The case of the Way of Saint James. [Dataset]. http://doi.org/10.7910/DVN/CUFZKT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/CUFZKT
Dataset updated
Nov 13, 2023
Dataset provided by
Harvard Dataverse
Authors
Sotelo Docio, Susana; Benitez-Baleato, Jesus M
Description
Replication Data for: Digital Tracks: Application of Artificial Intelligence Technologies for Automatic Detection of Perceptions from Social Media. The case of the Saint James Way, with a focus on COVID-19
f
iCite Database Snapshot 2023-03
nih.figshare.com
bin
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-03 [Dataset]. http://doi.org/10.35092/yhjc22589044.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.35092/yhjc22589044.v1
Dataset updated
Jun 1, 2023
Dataset provided by
The NIH Figshare Archive
Authors
iCite; B. Ian Hutchins; George Santangelo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

Definitions for individual data fields:

pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

doi: Digital Object Identifier, if available

year: Year the article was published

title: Title of the article

authors: List of author names

journal: Journal name (ISO abbreviation)

is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

citation_count: Number of unique articles that have cited this one

citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

x_coord: X coordinate of the article on the Triangle of Biomedicine

y_coord: Y Coordinate of the article on the Triangle of Biomedicine

is_clinical: Flag indicating that this paper meets the definition of a clinical article.

cited_by_clin: PMIDs of clinical articles that this article has been cited by.

apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

cited_by: PMIDs of articles that have cited this one.

references: PMIDs of articles in this article's reference list.

Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

Comments and questions can be addressed to iCite@mail.nih.gov
f
iCite Database Snapshot 2023-06
nih.figshare.com
bin
Updated Jul 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-06 [Dataset]. http://doi.org/10.35092/yhjc23643690.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.35092/yhjc23643690.v1
Dataset updated
Jul 10, 2023
Dataset provided by
The NIH Figshare Archive
Authors
iCite; B. Ian Hutchins; George Santangelo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

Definitions for individual data fields:

pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

doi: Digital Object Identifier, if available

year: Year the article was published

title: Title of the article

authors: List of author names

journal: Journal name (ISO abbreviation)

is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

citation_count: Number of unique articles that have cited this one

citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

x_coord: X coordinate of the article on the Triangle of Biomedicine

y_coord: Y Coordinate of the article on the Triangle of Biomedicine

is_clinical: Flag indicating that this paper meets the definition of a clinical article.

cited_by_clin: PMIDs of clinical articles that this article has been cited by.

apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

cited_by: PMIDs of articles that have cited this one.

references: PMIDs of articles in this article's reference list.

Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

Comments and questions can be addressed to iCite@mail.nih.gov
e
Thesaurus for the definition of scientific and technological heritage
data.europa.eu
json-ld, rdf turtle +1
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministero dei Beni Culturali, Thesaurus for the definition of scientific and technological heritage [Dataset]. https://data.europa.eu/data/datasets/iccd_st
Explore at:
rdf xml, json-ld, rdf turtleAvailable download formats
Dataset authored and provided by
Ministero dei Beni Culturali
License
https://w3id.org/italia/controlled-vocabulary/licences/A33_CCBYSA30IThttps://w3id.org/italia/controlled-vocabulary/licences/A33_CCBYSA30IT
Description
Terminological tools. PST - Scientific and technological heritage. Thesaurus for the definition of the asset (in skos format).

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9746900.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Clear search

Close search

Google apps

Main menu

LScD (Leicester Scientific Dictionary)

Replication Data for: A Definition-By-Example Approach and Visual Language...

Biobyte 1 - Where are we in the data science landscape?

Dataset: A Systematic Literature Review on the topic of High-value datasets

Replication Data for: How faculty define quality, prestige, and impact of...

INTEGRAL Science Window Data

Data from: Forensic science : an illustrated dictionary

Pond data: physical, chemical, and biological characteristics with...

Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...

WiDS data dictionary v2.xlsx

Dataset

Contents

Data from: Trusted Research Environments: Analysis of Characteristics and...

Methodology

Technical details

Data from: Purnell's concise dictionary of science

Training Webinars on China Research Data: Sources, Tools and Applications

Data Science for Environmental Justice PBL Module: Air Pollution Data

Data from: Diversity in citations to a single study: Supplementary data set...

Survey: Open Science in Higher Education

Replication Data for: Analysis of Social Media to support definition and...

iCite Database Snapshot 2023-03

iCite Database Snapshot 2023-06

Thesaurus for the definition of scientific and technological heritage

LScD (Leicester Scientific Dictionary)See More Versions

LScD (Leicester Scientific Dictionary)