100+ datasets found
  1. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  2. H

    Replication Data for: A Definition-By-Example Approach and Visual Language...

    • dataverse.harvard.edu
    Updated Dec 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous anonymous (2019). Replication Data for: A Definition-By-Example Approach and Visual Language for Activity Patterns [Dataset]. http://doi.org/10.7910/DVN/FTPF3Z
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    anonymous anonymous
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    public_supplementary_material.pdf includes the questionnaire, the tutorial, the instructions and tasks shown during the experiment and the visual and textual activity definitions for the tasks used for the experiment reported in our paper. data.xls includes all our raw data.

  3. q

    Biobyte 1 - Where are we in the data science landscape?

    • qubeshub.org
    Updated Aug 6, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Donovan (2019). Biobyte 1 - Where are we in the data science landscape? [Dataset]. http://doi.org/10.25334/03VE-VK77
    Explore at:
    Dataset updated
    Aug 6, 2019
    Dataset provided by
    QUBES
    Authors
    Sam Donovan
    Description

    This short activity can be used to introduce the NAS Data Science For Undergraduates report's definition of data acumen and engage participants in a self assessment of how they connect with those 10 data science concepts.

  4. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Rizun (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Andrea Miletič
    Charalampos Alexopoulos
    Anastasija Nikiforova
    Magdalena Ciesielska
    Nina Rizun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  5. d

    Replication Data for: How faculty define quality, prestige, and impact of...

    • search.dataone.org
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morales, Esteban; Alperin, Juan Pablo (2023). Replication Data for: How faculty define quality, prestige, and impact of academic journals [Dataset]. http://doi.org/10.7910/DVN/2FNDXL
    Explore at:
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Morales, Esteban; Alperin, Juan Pablo
    Description

    Anonymized and coded survey respondents for open-ended questions asking for the definitions of three terms related to academic journals: High Quality, Prestigious, and High Impact. Each column represents a code, as described in Morales et al. (2021). A value of 1 indicates that this respondent's answer was deemed to include a reference to the concept described by the code, and a 0 indicates that the concept was not present in the response.

  6. d

    INTEGRAL Science Window Data

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Science Window Data [Dataset]. https://catalog.data.gov/dataset/integral-science-window-data
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    High Energy Astrophysics Science Archive Research Center
    Description

    Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse

    The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .

  7. w

    Data from: Forensic science : an illustrated dictionary

    • workwithdata.com
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2023). Forensic science : an illustrated dictionary [Dataset]. https://www.workwithdata.com/book/Forensic%20science%20:%20an%20illustrated%20dictionary_864990
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore Forensic science : an illustrated dictionary through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets

  8. d

    Pond data: physical, chemical, and biological characteristics with...

    • search.dataone.org
    • portal.edirepository.org
    Updated Apr 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David C. Richardson; Meredith A. Holgerson; Matthew J. Farragher; Kathryn K. Hoffman; Katelyn B.S. King; Maria B. Alfonso; Mikkel R. Andersen; Kendra Spence Cheruveil; Kristen A. Coleman; Mary Jade Farruggia; Rocio L. Fernandez; Kelly L. Hondula; Gregorio A. Lopez Moreira M.; Katherine E. Paul; Benjamin L. Peierls; Joseph S. Rabaey; Steven Sadro; Maria Laura Sanchez; Robyn L. Smyth; Jon N. Sweetman (2022). Pond data: physical, chemical, and biological characteristics with scientific and United States of America state definitions from literature and legislative surveys [Dataset]. https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fedi%2F1014%2F1
    Explore at:
    Dataset updated
    Apr 4, 2022
    Dataset provided by
    Environmental Data Initiative
    Authors
    David C. Richardson; Meredith A. Holgerson; Matthew J. Farragher; Kathryn K. Hoffman; Katelyn B.S. King; Maria B. Alfonso; Mikkel R. Andersen; Kendra Spence Cheruveil; Kristen A. Coleman; Mary Jade Farruggia; Rocio L. Fernandez; Kelly L. Hondula; Gregorio A. Lopez Moreira M.; Katherine E. Paul; Benjamin L. Peierls; Joseph S. Rabaey; Steven Sadro; Maria Laura Sanchez; Robyn L. Smyth; Jon N. Sweetman
    Time period covered
    Jan 1, 1946 - Apr 30, 2019
    Area covered
    Variables measured
    ph, Link, Year, year, title, Author, author, Journal, journal, landuse, and 33 more
    Description

    Ponds are often identified by their small size and shallow depths, but the lack of a universal definition hampers science and weakens legal protection. In order to determine a working definition of ‘pond’, we conducted a literature search for scientific definitions, a U.S. state survey for management definitions, and looked at pond ecosystem function using data from the literature search. Our dataset includes physical, chemical, and biological data for 1327 waterbodies ≤ 20 ha in surface area and ≤ 9 m in maximum or mean depth from our literature review. These data have a global distribution, we include a table of latitudes and longitudes, and span many years (1946-2019). We have also included a table of 54 pond definitions from the literature review and a table of U.S. state definitions of ponds, wetlands, and lakes resulting from our survey.

  9. d

    Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska Science Center, Molecular Ecology Laboratory [Dataset]. https://catalog.data.gov/dataset/dictionary-of-microsatellite-loci-from-the-u-s-geological-survey-alaska-science-center-mol
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This data package provides a dictionary of microsatellite locus primers used by the USGS Alaska Science Center, Molecular Ecology Laboratory (MEL). It is a look-up file of microsatellite locus names and citations to the original publication or source where additional information about the locus primers may be found.

  10. WiDS data dictionary v2.xlsx

    • kaggle.com
    Updated Feb 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VivekSingh (2018). WiDS data dictionary v2.xlsx [Dataset]. https://www.kaggle.com/datasets/viveksinghub/wids-data-dictionary-v2xlsx
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    VivekSingh
    Description

    Dataset

    This dataset was created by VivekSingh

    Released under Data files © Original Authors

    Contents

  11. t

    Data from: Trusted Research Environments: Analysis of Characteristics and...

    • researchdata.tuwien.ac.at
    bin, csv
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

    Methodology

    We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

    • Peer-reviewed articles where available,
    • TRE websites,
    • TRE metadata catalogs.

    The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

    Technical details

    This dataset consists of five comma-separated values (.csv) files describing our inventory:

    • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
    • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
    • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
    • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
    • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

    Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

    • schema.sql: Schema definition file to create the tables and views used in the analysis.

    The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

  12. w

    Data from: Purnell's concise dictionary of science

    • workwithdata.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Purnell's concise dictionary of science [Dataset]. https://www.workwithdata.com/object/purnell-s-concise-dictionary-of-science-book-by-robin-kerrod-1940
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore Purnell's concise dictionary of science through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets

  13. d

    Training Webinars on China Research Data: Sources, Tools and Applications

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spatial Data Lab (2024). Training Webinars on China Research Data: Sources, Tools and Applications [Dataset]. http://doi.org/10.7910/DVN/LN6OHH
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Spatial Data Lab
    Description

    This webinar series introduce some research data with a focus on China and discuss the difference from the US data. Each webinar will cover the following topics: (1) data sources, data collection, data category, definition, description, and interpretation; (2) alternative data and derivable data from other data sources, especially some big data sources; (3) comparison of data difference between the US and China; (4) available tools for efficient data analysis; (5) discussions on pros and cons; and (6) data applications in research and teaching.

  14. Data Science for Environmental Justice PBL Module: Air Pollution Data

    • figshare.com
    pdf
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RN Uma; Marja H. Bakermans; Elisabeth Stoddard; Rakesh Malhotra; Alade Tokuta; Adrienne Smith; Rebecca Zulli Lowe (2025). Data Science for Environmental Justice PBL Module: Air Pollution Data [Dataset]. http://doi.org/10.6084/m9.figshare.24902889.v6
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    RN Uma; Marja H. Bakermans; Elisabeth Stoddard; Rakesh Malhotra; Alade Tokuta; Adrienne Smith; Rebecca Zulli Lowe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a PBL module on Air Pollution to be used in an introductory environmental science course to motivate students to analyze related environmental justice issues.Original data was from the US EPA data on "State EJScreen Data at the Block Group Level" (EJSCREEN_2023_BG_StatePct_with_AS_CNMI_GU_VI.csv) which was downloaded from https://www.epa.gov/ejscreen/download-ejscreen-data on December 20, 2023. (Note: Access to the EJSCREEN tool was removed during February 2005).This data was processed and cleaned as described in the data provenance document.Lecture Slides, Activity Sheets and Instructor Notes are available here.The following files are included:Data Provenance and Data Dictionary: Data Provenance and Data Dictionary.pdfR Script for Data Processing: EJSCREEN_Data_Curation_NC_Summarized_by_County.RProcessed Dataset for North Carolina: EJScreen_State_BGLevel_NC_13Columns.csvCurated Data used in the Module - Summarized Dataset for North Carolina (summarized by county): EJScreen_State_BGLevel_NC_Summarized_By_County_13Columns.csvData Dictionary: Data_Dictionary_EJSCREEN_2023_BG_Columns.pdfOriginal Dataset from EPA/EJSCREEN from which Data was Extracted for North Carolina: DS4EJ_EJSCREEN_2023_BG_StatePct_with_AS_CNMI_GU_VI.csv

  15. Z

    Data from: Diversity in citations to a single study: Supplementary data set...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leng, Rhodri Ivor (2021). Diversity in citations to a single study: Supplementary data set for citation context network analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5244799
    Explore at:
    Dataset updated
    Aug 25, 2021
    Dataset authored and provided by
    Leng, Rhodri Ivor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This document describes the data set used for all analyses in 'Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited' accepted for publication in Quantitative Science Studies [1].

    Data Collection

    The data collection procedure has been fully described [1]. Concisely, the data set contains bibliometric data collected from Web of Science Core Collection via the University of Edinburgh’s Library subscription concerning all papers that cited a cohort study, Paul et al. [2], in the period <1985. This includes a full list of citing papers, and the citations between these papers. Additionally, it includes textual passages (citation contexts) from 343 citing papers, which were manually recovered from the full-text documents accessible via the University of Edinburgh’s Library subscription. These data have been cleaned, converted into network readable datasets, and are coded into particular classifications reflecting content, which are described fully in the supplied code book and within the manuscript [1].

    Data description

    All relevant data can be found in the attached file 'Supplementary_material_Leng_QSS_2021.xlsx', which contains the following five workbooks:

    “Overview” includes a list of the content of the workbooks.

    “Code Book” contains the coding rules and definitions used for the classification of findings and paper titles.

    “Node attribute list” includes a workbook containing all node attributes for the citation network, which includes Paul et al. [2] and its citing papers as of 1984. Highlighted in yellow at the bottom of this workbook is two papers that were discarded due to duplication - remove these if analysing this dataset in a network analysis. The columns refer to:

    Id, the node identifier

    Label, the formal citation of the paper to which data within this row corresponds. Citation is in the following format: last name of first author, year of publication, journal of publication, volume number, start page, and DOI (if available).

    Title, the paper title for the paper in question.

    Publication_year, the year of publication.

    Document_type, the document type (e.g. review, article)

    WoS_ID, the paper’s unique Web of Science accession number.

    Citation_context, a column specifying whether citation context data is available from that paper

    Explanans, the title explanans terms for that paper;

    Explanandum, the explanandum terms for that paper.

    Combined_Title_Classification, the combined terms used for fig 2 of the published manuscript.

    Serum_cholesterol_(SC), a column identifying papers that cited the serum cholesterol findings.

    Blood_Pressure_(BP), a column identifying papers that cited the blood pressure findings.

    Coffee_(C), a column identifying papers that cited the coffee findings.

    Diet_(D), a column identifying papers that cited the dietary findings.

    Smoking_(S), a column identifying papers that cited the smoking findings.

    Alcohol_(A), a column identifying papers that cited the alcohol findings.

    Physical_Activity_(PA), a column identifying papers that cited the physical activity findings.

    Body_Fatness (BF), a column identifying papers that cited the body fatness findings.

    Indegree, the number of within network citations to that paper, calculated for the network shown in Fig 4 of the manuscript.

    Outdegree, the number of within network references of that paper as calculated for the network in Fig 4.

    Main_component, a column specifying whether a node is contained in the largest weakly connect component as shown in Fig 4 of the manuscript.

    Cluster, provides the cluster membership number as discussed within the manuscript (Fig 5).

    “Edge list” includes a workbook including the edges for the network. The columns refer to:

    Source, contains the node identifier of the citing paper.

    Target, contains the node identifier of the cited paper.

    “Citation context classification” includes a workbook containing the WoS accession number for the paper analysed, and any finding category discussed in that paper established via context analysis (see the code book for definitions). The columns refer to:

    Id, the node identifier

    Finding_Class, the findings discussed from Paul et al. within the body of the citing paper.

    “Citation context data” includes a workbook containing the WoS accession number for papers in which citation context data was available, the citation context passages, the reference number or format of Paul et al. within the citing paper, and the finding categories discussed in those contexts (see code book for definitions). The columns refer to:

    Id, the node identifier

    Citation_context, the passage copied from the full text of the citing paper containing discussion of the findings of Paul et al.

    Reference_in_citing_article, the reference number or format of Paul et al. within the citing paper.

    Finding_class, the findings discussed from Paul et al. within the body of the citing paper.

    Software recommended for analysis

    For the analyses performed within the manuscript, Gephi version 0.9.2 was used [3], and both the edge and node lists are in a format that is easily read into this software. The Sci2 tool was used to parse data initially [4].

    Notes

    Leng, R. I. (Forthcoming). Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited. Quantitative Science Studies.

    Paul, O., Lepper, M. H., Phelan, W. H., Dupertuis, G. W., Macmillan, A., McKean, H., et al. (1963). A longitudinal study of coronary heart disease. Circulation, 28, 20-31. https://doi.org/10.1161/01.cir.28.1.20.

    Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

    Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu

  16. Z

    Survey: Open Science in Higher Education

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blümel, Ina (2024). Survey: Open Science in Higher Education [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_400518
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Mazarakis, Athanasios
    Blümel, Ina
    Scherp, Ansgar
    Heck, Tamara
    Peters, Isabella
    Weisel, Luzian
    Heller, Lambert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open Science in (Higher) Education – data of the February 2017 survey

    This data set contains:

    Full raw (anonymised) data set (completed responses) of Open Science in (Higher) Education February 2017 survey. Data are in xlsx and sav format.

    Survey questionnaires with variables and settings (German original and English translation) in pdf. The English questionnaire was not used in the February 2017 survey, but only serves as translation.

    Readme file (txt)

    Survey structure

    The survey includes 24 questions and its structure can be separated in five major themes: material used in courses (5), OER awareness, usage and development (6), collaborative tools used in courses (2), assessment and participation options (5), demographics (4). The last two questions include an open text questions about general issues on the topics and singular open education experiences, and a request on forwarding the respondent's e-mail address for further questionings. The online survey was created with Limesurvey[1]. Several questions include filters, i.e. these questions were only shown if a participants did choose a specific answer beforehand ([n/a] in Excel file, [.] In SPSS).

    Demographic questions

    Demographic questions asked about the current position, the discipline, birth year and gender. The classification of research disciplines was adapted to general disciplines at German higher education institutions. As we wanted to have a broad classification, we summarised several disciplines and came up with the following list, including the option "other" for respondents who do not feel confident with the proposed classification:

    Natural Sciences

    Arts and Humanities or Social Sciences

    Economics

    Law

    Medicine

    Computer Sciences, Engineering, Technics

    Other

    The current job position classification was also chosen according to common positions in Germany, including positions with a teaching responsibility at higher education institutions. Here, we also included the option "other" for respondents who do not feel confident with the proposed classification:

    Professor

    Special education teacher

    Academic/scientific assistant or research fellow (research and teaching)

    Academic staff (teaching)

    Student assistant

    Other

    We chose to have a free text (numerical) for asking about a respondent's year of birth because we did not want to pre-classify respondents' age intervals. It leaves us options to have different analysis on answers and possible correlations to the respondents' age. Asking about the country was left out as the survey was designed for academics in Germany.

    Remark on OER question

    Data from earlier surveys revealed that academics suffer confusion about the proper definition of OER[2]. Some seem to understand OER as free resources, or only refer to open source software (Allen & Seaman, 2016, p. 11). Allen and Seaman (2016) decided to give a broad explanation of OER, avoiding details to not tempt the participant to claim "aware". Thus, there is a danger of having a bias when giving an explanation. We decided not to give an explanation, but keep this question simple. We assume that either someone knows about OER or not. If they had not heard of the term before, they do not probably use OER (at least not consciously) or create them.

    Data collection

    The target group of the survey was academics at German institutions of higher education, mainly universities and universities of applied sciences. To reach them we sent the survey to diverse institutional-intern and extern mailing lists and via personal contacts. Included lists were discipline-based lists, lists deriving from higher education and higher education didactic communities as well as lists from open science and OER communities. Additionally, personal e-mails were sent to presidents and contact persons from those communities, and Twitter was used to spread the survey.

    The survey was online from Feb 6th to March 3rd 2017, e-mails were mainly sent at the beginning and around mid-term.

    Data clearance

    We got 360 responses, whereof Limesurvey counted 208 completes and 152 incompletes. Two responses were marked as incomplete, but after checking them turned out to be complete, and we added them to the complete responses dataset. Thus, this data set includes 210 complete responses. From those 150 incomplete responses, 58 respondents did not answer 1st question, 40 respondents discontinued after 1st question. Data shows a constant decline in response answers, we did not detect any striking survey question with a high dropout rate. We deleted incomplete responses and they are not in this data set.

    Due to data privacy reasons, we deleted seven variables automatically assigned by Limesurvey: submitdate, lastpage, startlanguage, startdate, datestamp, ipaddr, refurl. We also deleted answers to question No 24 (email address).

    References

    Allen, E., & Seaman, J. (2016). Opening the Textbook: Educational Resources in U.S. Higher Education, 2015-16.

    First results of the survey are presented in the poster:

    Heck, Tamara, Blümel, Ina, Heller, Lambert, Mazarakis, Athanasios, Peters, Isabella, Scherp, Ansgar, & Weisel, Luzian. (2017). Survey: Open Science in Higher Education. Zenodo. http://doi.org/10.5281/zenodo.400561

    Contact:

    Open Science in (Higher) Education working group, see http://www.leibniz-science20.de/forschung/projekte/laufende-projekte/open-science-in-higher-education/.

    [1] https://www.limesurvey.org

    [2] The survey question about the awareness of OER gave a broad explanation, avoiding details to not tempt the participant to claim "aware".

  17. d

    Replication Data for: Analysis of Social Media to support definition and...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sotelo Docio, Susana; Benitez-Baleato, Jesus M (2023). Replication Data for: Analysis of Social Media to support definition and evaluation of tourism public policy. The case of the Way of Saint James. [Dataset]. http://doi.org/10.7910/DVN/CUFZKT
    Explore at:
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Sotelo Docio, Susana; Benitez-Baleato, Jesus M
    Description

    Replication Data for: Digital Tracks: Application of Artificial Intelligence Technologies for Automatic Detection of Perceptions from Social Media. The case of the Saint James Way, with a focus on COVID-19

  18. f

    iCite Database Snapshot 2023-03

    • nih.figshare.com
    bin
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-03 [Dataset]. http://doi.org/10.35092/yhjc22589044.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    The NIH Figshare Archive
    Authors
    iCite; B. Ian Hutchins; George Santangelo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  19. f

    iCite Database Snapshot 2023-06

    • nih.figshare.com
    bin
    Updated Jul 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-06 [Dataset]. http://doi.org/10.35092/yhjc23643690.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 10, 2023
    Dataset provided by
    The NIH Figshare Archive
    Authors
    iCite; B. Ian Hutchins; George Santangelo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  20. e

    Thesaurus for the definition of scientific and technological heritage

    • data.europa.eu
    json-ld, rdf turtle +1
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministero dei Beni Culturali, Thesaurus for the definition of scientific and technological heritage [Dataset]. https://data.europa.eu/data/datasets/iccd_st
    Explore at:
    rdf xml, json-ld, rdf turtleAvailable download formats
    Dataset authored and provided by
    Ministero dei Beni Culturali
    License

    https://w3id.org/italia/controlled-vocabulary/licences/A33_CCBYSA30IThttps://w3id.org/italia/controlled-vocabulary/licences/A33_CCBYSA30IT

    Description

    Terminological tools. PST - Scientific and technological heritage. Thesaurus for the definition of the asset (in skos format).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Leicester
Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Search
Clear search
Close search
Google apps
Main menu