67 datasets found
  1. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  2. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Charalampos Alexopoulos
    Magdalena Ciesielska
    Nina Rizun
    Andrea Miletič
    Anastasija Nikiforova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  3. Data Science Glossary For QA

    • kaggle.com
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofianesun (2024). Data Science Glossary For QA [Dataset]. https://www.kaggle.com/datasets/sofianesun/data-science-glossary-for-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sofianesun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A dataset for the 1st task Explain or teach basic data science concepts of the competition Google – AI Assistants for Data Tasks with Gemma. This dataset contains several glossaries of Data Science, where every sample contains two keys term(vocab name) and definition.

  4. w

    Dataset of book subjects that contain Dictionary of science

    • workwithdata.com
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Dictionary of science [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Dictionary+of+science&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 2 rows and is filtered where the books is Dictionary of science. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  5. Scientific JOURNALS Indicators & Info - SCImagoJR

    • kaggle.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Jalaali (2025). Scientific JOURNALS Indicators & Info - SCImagoJR [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/scimagojr-scientific-journals-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Kaggle
    Authors
    Ali Jalaali
    Description


    The SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.). These indicators can be used to assess and analyze scientific domains. Journals can be compared or analysed separately.


    💬Also have a look at
    💡 COUNTRIES Research & Science Dataset - SCImagoJR
    💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR

    • Journals can be grouped by subject area (27 major thematic areas), subject category (309 specific subject categories) or by country.
    • Citation data is drawn from over 34,100 titles from more than 5,000 international publishers
    • This platform takes its name from the SCImago Journal Rank (SJR) indicator , developed by SCImago from the widely known algorithm Google PageRank™. This indicator shows the visibility of the journals contained in the Scopus® database from 1996.
    • SCImago is a research group from the Consejo Superior de Investigaciones Científicas (CSIC), University of Granada, Extremadura, Carlos III (Madrid) and Alcalá de Henares, dedicated to information analysis, representation and retrieval by means of visualisation techniques.

    ☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
    ScimagoJR Journal Rank
    SCImagoJR About Us

    Available indicators:

    • SJR (SCImago Journal Rank) indicator: It expresses the average number of weighted citations received in the selected year by the documents published in the selected journal in the three previous years, --i.e. weighted citations received in year X to documents published in the journal in years X-1, X-2 and X-3. See detailed description of SJR (PDF).
    • H Index: The h index expresses the journal's number of articles (h) that have received at least h citations. It quantifies both journal scientific productivity and scientific impact and it is also applicable to scientists, countries, etc. (see H-index wikipedia definition)
    • Total Documents: Output of the selected period. All types of documents are considered, including citable and non citable documents.
    • Total Documents (3years): Published documents in the three previous years (selected year documents are excluded), i.e.when the year X is selected, then X-1, X-2 and X-3 published documents are retrieved. All types of documents are considered, including citable and non citable documents.
    • Citable Documents (3 years): Number of citable documents published by a journal in the three previous years (selected year documents are excluded). Exclusively articles, reviews and conference papers are considered. Non-citable Docs. (Available in the graphics) Non-citable documents ratio in the period being considered.
    • Total Cites (3years): Number of citations received in the seleted year by a journal to the documents published in the three previous years, --i.e. citations received in year X to documents published in years X-1, X-2 and X-3. All types of documents are considered.
    • Cites per Document (2 years): Average citations per document in a 2 year period. It is computed considering the number of citations received by a journal in the current year to the documents published in the two previous years, --i.e. citations received in year X to documents published in years X-1 and X-2.
    • Cites per Document (3 years): Average citations per document in a 3 year period. It is computed considering the number of citations received by a journal in the current year to the documents published in the three previous years, --i.e. citations received in year X to documents published in years X-1, X-2 and X-3.
    • Self Cites: Number of journal's self-citations in the seleted year to its own documents published in the three previous years, --i.e. self-citations in year X to documents published in years X-1, X-2 and X-3. All types of documents are considered.
    • Cited Documents: Number of documents cited at least once in the three previous years, --i.e. years X-1, X-2 and X-3
    • Uncited Documents: Number of uncited documents in the three previous years, --i.e. years X-1, X-2 and X-3
    • Total References: It includes all the bibliographical references in a journal in the selected period.
    • References per Document:Average number of references per document in the selected year.
    • % International Collaboration: Document ratio whose affiliation includes more than one country address.
  6. w

    Dataset of books called Dictionary of environmental science and technology

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Dictionary of environmental science and technology [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Dictionary+of+environmental+science+and+technology
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 5 rows and is filtered where the book is Dictionary of environmental science and technology. It features 7 columns including author, publication date, language, and book publisher.

  7. PDS Data Dictionary (1r47)

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Aug 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2025). PDS Data Dictionary (1r47) [Dataset]. https://catalog.data.gov/dataset/pds-data-dictionary-1r47
    Explore at:
    Dataset updated
    Aug 30, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Data Dictionary and Index Files

  8. INTEGRAL Science Window Data - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). INTEGRAL Science Window Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integral-science-window-data
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse

    The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .

  9. WiDS data dictionary v2.xlsx

    • kaggle.com
    Updated Feb 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VivekSingh (2018). WiDS data dictionary v2.xlsx [Dataset]. https://www.kaggle.com/viveksinghub/wids-data-dictionary-v2xlsx/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    VivekSingh
    Description

    Dataset

    This dataset was created by VivekSingh

    Released under Data files © Original Authors

    Contents

  10. H

    Replication Data for: Automated Dictionary Generation for Political Event...

    • dataverse.harvard.edu
    Updated Nov 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin J Radford (2018). Replication Data for: Automated Dictionary Generation for Political Event Coding [Dataset]. http://doi.org/10.7910/DVN/N3Y3MF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Benjamin J Radford
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method via actor-country classification and demonstrate the method’s ability to generalize to new domains with the production of a novel event dataset on cybersecurity.

  11. Latest Data Professionals Salary Dataset

    • kaggle.com
    Updated Jul 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2023). Latest Data Professionals Salary Dataset [Dataset]. https://www.kaggle.com/datasets/whenamancodes/data-professionals-salary-dataset-2022
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Chauhan
    Description

    About Dataset

    Context

    Analytics refers to the methodical examination and calculation of data or statistics. Its purpose is to uncover, interpret, and convey meaningful patterns found within the data. Additionally, analytics involves utilizing these data patterns to make informed decisions. It proves valuable in domains abundant with recorded information, employing a combination of statistics, computer programming, and operations research to measure performance.

    Businesses can leverage analytics to describe, predict, and enhance their overall performance. Various branches of analytics encompass predictive analytics, prescriptive analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big Data Analytics, retail analytics, supply chain analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix modeling, web analytics, call analytics, speech analytics, sales force sizing and optimization, price and promotion modeling, predictive science, graph analytics, credit risk analysis, and fraud analytics. Due to the extensive computational requirements involved (particularly with big data), analytics algorithms and software utilize state-of-the-art methods from computer science, statistics, and mathematics.

    Data Dictionary

    ColumnsDescription
    Company NameCompany Name refers to the name of the organization or company where an individual is employed. It represents the specific entity that provides job opportunities and is associated with a particular industry or sector.
    Job TitleJob Title refers to the official designation or position held by an individual within a company or organization. It represents the specific role or responsibilities assigned to the person in their professional capacity.
    Salaries ReportedSalaries Reported indicates the information or data related to the salaries of employees within a company or industry. This data may be collected and reported through various sources, such as surveys, employee disclosures, or public records.
    LocationLocation refers to the specific geographical location or area where a company or job position is situated. It provides information about the physical location or address associated with the company's operations or the job's work environment.
    SalarySalary refers to the monetary compensation or remuneration received by an employee in exchange for their work or services. It represents the amount of money paid to an individual on a regular basis, typically in the form of wages or a fixed annual income.

    Content

    This Dataset consists of salaries for Data Scientists, Machine Learning Engineers, Data Analysts, and Data Engineers in various cities across India (2022).

    -Salary Dataset.csv -Partially Cleaned Salary Dataset.csv

    Acknowledgements

    This Dataset is created from https://www.glassdoor.co.in/. If you want to learn more, you can visit the Website.

  12. c

    INTEGRAL Science Window Data

    • s.cnmilf.com
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Science Window Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/integral-science-window-data
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    High Energy Astrophysics Science Archive Research Center
    Description

    Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse

    The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .

  13. w

    Dataset of book subjects that contain Longman illustrated dictionary of...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Longman illustrated dictionary of computing science : computers and their application [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Longman+illustrated+dictionary+of+computing+science+:+computers+and+their+application&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 1 row and is filtered where the books is Longman illustrated dictionary of computing science : computers and their application. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  14. o

    Data for: Sustainable connectivity in a community repository

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +3more
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ted Habermann (2023). Data for: Sustainable connectivity in a community repository [Dataset]. http://doi.org/10.5061/dryad.nzs7h44xr
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Ted Habermann
    Description

    Data For: Sustainable Connectivity in a Community Repository ## GENERAL INFORMATION This readme.txt file was generated on 30231110 by Ted Habermann ### Title of Dataset Data For: Sustainable Connectivity in a Community Repository ### Author Information Principal Investigator Contact Information Name: Ted Habermann (0000-0003-3585-6733) Institution: Metadata Game Changers () Email: ORCID: 0000-0003-3585-6733 ### Date published or finalized for release: November 10, 2023 ## Date of data collection (single date, range, approximate date) May and June 2023 ### Information about funding sources that supported the collection of the data: National Science Foundation (Crossref Funder ID: 100000001) Award 2134956. ### Overview of the data (abstract): These data are Dryad metadata retrieved from and translated into csv files. There are two datasets: 1. DryadJournalDataset was retrieved from Dryad using the ISSNs in the file DryadJournalDataset_ISSNs.txt, although some had no data. 2. DryadOrganizationDataset was retrieved from Dryad using the RORs in the file DryadOrganizationDataset_RORs.txt, although some had no data. Each dataset includes four types of metadata: identifiers, funders, keywords, and related works, each in a separate comma (.csv) or tab (.tsv) delimited files. There are also Microsoft Excel files (.xlsx) for the identifier metadata and connectivity summaries for each dataset (*.html). The connectivity summaries include summaries of each parameter in all four data files with definitions, counts, unique counts, most frequent values, and completeness. These data formed the basis for an analysis of the connectivity of the Dryad repository for organizations, funders, and people. | Size | FileName | | --------: | :--------------------------------------------------------- | | 90541505 | DryadJournalDataset_Identifiers_20230520_12.csv | | 9017051 | DryadJournalDataset_funders_20230520_12.tsv | | 29108477 | DryadJournalDataset_keywords_20230520_12.tsv | | 8833842 | DryadJournalDataset_relatedWorks_20230520_12.tsv | | | | | 18260935 | DryadOrganizationDataset_funders_20230601_12.tsv | | 240128730 | DryadOrganizationDataset_identifiers_20230601_12.tsv | | 39600659 | DryadOrganizationDataset_keywords_20230601_12.tsv | | 11520475 | DryadOrganizationDataset_relatedWorks_20230601_12.tsv | | | | | 40726143 | DryadJournalDataset_identifiers_20230520_12.xlsx | | 81894301 | DryadOrganizationDataset_identifiers_20230601_12.xlsx | | | | | 842827 | DryadJournalDataset_ConnectivitySummary.html | | 387551 | DryadOrganizationDataset_ConnectivitySummary.html | ### Field Definitions ## SHARING/ACCESS INFORMATION ### Licenses/restrictions placed on the data: Creative Commons Public Domain License (CC0) ### Links to publications that cite or use the data: TBD ### Was data derived from another source? No ## DATA & FILE OVERVIEW ### File List A. *Dataset_identifiers_YYYYMMDD_HH.*sv: Short description: Identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. B. *Dataset_funders_YYYYMMDD_HH.*sv: Short description: Funder metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. C. *Dataset_keywords_YYYYMMDD_HH.*sv: Short description: Keyword metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. D. *Dataset_relatedWorks_YYYYMMDD_HH.*sv: Short description: Related work metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. E. *Dataset_identifiers_YYYYMMDD_HH.xlsx: Short description: Excel spreadsheet with identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. F. *Dataset_ConnectivitySummary.html: Short description: Connectivity summary for Dataset. G. summarizeConnectivity.ipynb Short description: Python notebook with code for creating connectivity summaries and plots. ### Relationship between files: All files with the same dataset name make up a dataset. The .*sv are original metadata extracted from Dryad. ## METHODOLOGICAL INFORMATION ### Description of methods used for collection/generation of data: Most of the analysis is simply extracting and comparing counts of various metadata elements. ## DATA-SPECIFIC INFORMATION See connectivity summaries (*ConnectivitySummary.html) for a list of parameters in each file and summaries of their values. ### Identifier Metadata The identifier metadata datasets include the following fields: | Field | Definition | | :------------------------------- | :--------------------------------------------------------------------------------------------------- | | DOI | Digital object identifier for the dataset | | title | Title for the dataset | | datePublished | Date dataset published | | relatedPublicationISSN | International Standard Serial Number for journal with related publication | | primary_article | Digital object identifier for pr...

  15. U

    Data Dictionary for Electron Microprobe Data Collected with Probe for EPMA...

    • data.usgs.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather Lowers, Data Dictionary for Electron Microprobe Data Collected with Probe for EPMA Software Package Developed by Probe Software [Dataset]. http://doi.org/10.5066/P91HKRPM
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Heather Lowers
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 1995 - Dec 31, 2050
    Description

    This data dictionary describes most of the possible output options given in the Probe for EPMA software package developed by Probe Software. Examples of the data output options include sample identification, analytical conditions, elemental weight percents, atomic percents, detection limits, and stage coordinates. Many more options are available and the data that is output will depend upon the end use.

  16. t

    Data from: Trusted Research Environments: Analysis of Characteristics and...

    • researchdata.tuwien.ac.at
    bin, csv
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

    Methodology

    We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

    • Peer-reviewed articles where available,
    • TRE websites,
    • TRE metadata catalogs.

    The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

    Technical details

    This dataset consists of five comma-separated values (.csv) files describing our inventory:

    • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
    • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
    • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
    • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
    • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

    Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

    • schema.sql: Schema definition file to create the tables and views used in the analysis.

    The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

  17. d

    Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska Science Center, Molecular Ecology Laboratory [Dataset]. https://catalog.data.gov/dataset/dictionary-of-microsatellite-loci-from-the-u-s-geological-survey-alaska-science-center-mol
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This data package provides a dictionary of microsatellite locus primers used by the USGS Alaska Science Center, Molecular Ecology Laboratory (MEL). It is a look-up file of microsatellite locus names and citations to the original publication or source where additional information about the locus primers may be found.

  18. Definition of our abstract data model.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Cedeno-Mieles; Zhihao Hu; Yihui Ren; Xinwei Deng; Noshir Contractor; Saliya Ekanayake; Joshua M. Epstein; Brian J. Goode; Gizem Korkmaz; Chris J. Kuhlman; Dustin Machi; Michael Macy; Madhav V. Marathe; Naren Ramakrishnan; Parang Saraf; Nathan Self (2023). Definition of our abstract data model. [Dataset]. http://doi.org/10.1371/journal.pone.0242453.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Vanessa Cedeno-Mieles; Zhihao Hu; Yihui Ren; Xinwei Deng; Noshir Contractor; Saliya Ekanayake; Joshua M. Epstein; Brian J. Goode; Gizem Korkmaz; Chris J. Kuhlman; Dustin Machi; Michael Macy; Madhav V. Marathe; Naren Ramakrishnan; Parang Saraf; Nathan Self
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Definition of our abstract data model.

  19. v

    INTEGRAL Public Pointed Science Window Data

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Public Pointed Science Window Data [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/integral-public-pointed-science-window-data
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    High Energy Astrophysics Science Archive Research Center
    Description

    Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Public Pointed Science Window Data Catalog is a subset of the INTEGRAL Science Window Data Catalog. It only includes pointed science windows that have non-private proprietary status and at least one instrument's "good" exposure time greater than zero. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse

    The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .

  20. Data

    • figshare.com
    txt
    Updated Oct 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anusha Sathyanadh; Guillaume Monteil; Marko Scholze; Anne Klosterhalfen; Hjalmar Laudon (2021). Data [Dataset]. http://doi.org/10.6084/m9.figshare.16676656.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 18, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anusha Sathyanadh; Guillaume Monteil; Marko Scholze; Anne Klosterhalfen; Hjalmar Laudon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The modelled concentrations with different vegetation flux inputs and background concentrations are added in the modeldata.csv. Regional sensitivity index for 2017 and 2018 are added seperately. The regions definition used in the ms is given as a csv file, regmap

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Leicester
Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Search
Clear search
Close search
Google apps
Main menu