67 datasets found

l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
zenodo.org
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Charalampos Alexopoulos
Magdalena Ciesielska
Nina Rizun
Andrea Miletič
Anastasija Nikiforova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
Data Science Glossary For QA
kaggle.com
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofianesun (2024). Data Science Glossary For QA [Dataset]. https://www.kaggle.com/datasets/sofianesun/data-science-glossary-for-qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sofianesun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A dataset for the 1st task Explain or teach basic data science concepts of the competition Google – AI Assistants for Data Tasks with Gemma. This dataset contains several glossaries of Data Science, where every sample contains two keys term(vocab name) and definition.
w
Dataset of book subjects that contain Dictionary of science
workwithdata.com
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Dictionary of science [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Dictionary+of+science&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 2 rows and is filtered where the books is Dictionary of science. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Scientific JOURNALS Indicators & Info - SCImagoJR
kaggle.com
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Jalaali (2025). Scientific JOURNALS Indicators & Info - SCImagoJR [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/scimagojr-scientific-journals-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2025
Dataset provided by
Kaggle
Authors
Ali Jalaali
Description
The SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.). These indicators can be used to assess and analyze scientific domains. Journals can be compared or analysed separately.

💬Also have a look at
💡 COUNTRIES Research & Science Dataset - SCImagoJR
💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR

Journals can be grouped by subject area (27 major thematic areas), subject category (309 specific subject categories) or by country.

Citation data is drawn from over 34,100 titles from more than 5,000 international publishers

This platform takes its name from the SCImago Journal Rank (SJR) indicator , developed by SCImago from the widely known algorithm Google PageRank™. This indicator shows the visibility of the journals contained in the Scopus® database from 1996.

SCImago is a research group from the Consejo Superior de Investigaciones Científicas (CSIC), University of Granada, Extremadura, Carlos III (Madrid) and Alcalá de Henares, dedicated to information analysis, representation and retrieval by means of visualisation techniques.

☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
ScimagoJR Journal Rank
SCImagoJR About Us

Available indicators:

SJR (SCImago Journal Rank) indicator: It expresses the average number of weighted citations received in the selected year by the documents published in the selected journal in the three previous years, --i.e. weighted citations received in year X to documents published in the journal in years X-1, X-2 and X-3. See detailed description of SJR (PDF).

H Index: The h index expresses the journal's number of articles (h) that have received at least h citations. It quantifies both journal scientific productivity and scientific impact and it is also applicable to scientists, countries, etc. (see H-index wikipedia definition)

Total Documents: Output of the selected period. All types of documents are considered, including citable and non citable documents.

Total Documents (3years): Published documents in the three previous years (selected year documents are excluded), i.e.when the year X is selected, then X-1, X-2 and X-3 published documents are retrieved. All types of documents are considered, including citable and non citable documents.

Citable Documents (3 years): Number of citable documents published by a journal in the three previous years (selected year documents are excluded). Exclusively articles, reviews and conference papers are considered. Non-citable Docs. (Available in the graphics) Non-citable documents ratio in the period being considered.

Total Cites (3years): Number of citations received in the seleted year by a journal to the documents published in the three previous years, --i.e. citations received in year X to documents published in years X-1, X-2 and X-3. All types of documents are considered.

Cites per Document (2 years): Average citations per document in a 2 year period. It is computed considering the number of citations received by a journal in the current year to the documents published in the two previous years, --i.e. citations received in year X to documents published in years X-1 and X-2.

Cites per Document (3 years): Average citations per document in a 3 year period. It is computed considering the number of citations received by a journal in the current year to the documents published in the three previous years, --i.e. citations received in year X to documents published in years X-1, X-2 and X-3.

Self Cites: Number of journal's self-citations in the seleted year to its own documents published in the three previous years, --i.e. self-citations in year X to documents published in years X-1, X-2 and X-3. All types of documents are considered.

Cited Documents: Number of documents cited at least once in the three previous years, --i.e. years X-1, X-2 and X-3

Uncited Documents: Number of uncited documents in the three previous years, --i.e. years X-1, X-2 and X-3

Total References: It includes all the bibliographical references in a journal in the selected period.

References per Document:Average number of references per document in the selected year.

% International Collaboration: Document ratio whose affiliation includes more than one country address.
w
Dataset of books called Dictionary of environmental science and technology
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Dictionary of environmental science and technology [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Dictionary+of+environmental+science+and+technology
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 5 rows and is filtered where the book is Dictionary of environmental science and technology. It features 7 columns including author, publication date, language, and book publisher.
PDS Data Dictionary (1r47)
catalog.data.gov
datasets.ai
+2more
Updated Aug 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). PDS Data Dictionary (1r47) [Dataset]. https://catalog.data.gov/dataset/pds-data-dictionary-1r47
Explore at:
Dataset updated
Aug 30, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Data Dictionary and Index Files
INTEGRAL Science Window Data - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). INTEGRAL Science Window Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integral-science-window-data
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse
The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .
WiDS data dictionary v2.xlsx
kaggle.com
Updated Feb 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VivekSingh (2018). WiDS data dictionary v2.xlsx [Dataset]. https://www.kaggle.com/viveksinghub/wids-data-dictionary-v2xlsx/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VivekSingh
Description
Dataset

This dataset was created by VivekSingh

Released under Data files © Original Authors

Contents
H
Replication Data for: Automated Dictionary Generation for Political Event...
dataverse.harvard.edu
Updated Nov 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin J Radford (2018). Replication Data for: Automated Dictionary Generation for Political Event Coding [Dataset]. http://doi.org/10.7910/DVN/N3Y3MF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/N3Y3MF
Dataset updated
Nov 12, 2018
Dataset provided by
Harvard Dataverse
Authors
Benjamin J Radford
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method via actor-country classification and demonstrate the method’s ability to generalize to new domains with the production of a novel event dataset on cybersecurity.

Latest Data Professionals Salary Dataset

kaggle.com

Updated Jul 9, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Aman Chauhan (2023). Latest Data Professionals Salary Dataset [Dataset]. https://www.kaggle.com/datasets/whenamancodes/data-professionals-salary-dataset-2022

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 9, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Aman Chauhan

Description

About Dataset

Context

Analytics refers to the methodical examination and calculation of data or statistics. Its purpose is to uncover, interpret, and convey meaningful patterns found within the data. Additionally, analytics involves utilizing these data patterns to make informed decisions. It proves valuable in domains abundant with recorded information, employing a combination of statistics, computer programming, and operations research to measure performance.

Businesses can leverage analytics to describe, predict, and enhance their overall performance. Various branches of analytics encompass predictive analytics, prescriptive analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big Data Analytics, retail analytics, supply chain analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix modeling, web analytics, call analytics, speech analytics, sales force sizing and optimization, price and promotion modeling, predictive science, graph analytics, credit risk analysis, and fraud analytics. Due to the extensive computational requirements involved (particularly with big data), analytics algorithms and software utilize state-of-the-art methods from computer science, statistics, and mathematics.

Data Dictionary

Columns	Description
Company Name	Company Name refers to the name of the organization or company where an individual is employed. It represents the specific entity that provides job opportunities and is associated with a particular industry or sector.
Job Title	Job Title refers to the official designation or position held by an individual within a company or organization. It represents the specific role or responsibilities assigned to the person in their professional capacity.
Salaries Reported	Salaries Reported indicates the information or data related to the salaries of employees within a company or industry. This data may be collected and reported through various sources, such as surveys, employee disclosures, or public records.
Location	Location refers to the specific geographical location or area where a company or job position is situated. It provides information about the physical location or address associated with the company's operations or the job's work environment.
Salary	Salary refers to the monetary compensation or remuneration received by an employee in exchange for their work or services. It represents the amount of money paid to an individual on a regular basis, typically in the form of wages or a fixed annual income.

Content

This Dataset consists of salaries for Data Scientists, Machine Learning Engineers, Data Analysts, and Data Engineers in various cities across India (2022).

-Salary Dataset.csv -Partially Cleaned Salary Dataset.csv

Acknowledgements

This Dataset is created from https://www.glassdoor.co.in/. If you want to learn more, you can visit the Website.

c
INTEGRAL Science Window Data
s.cnmilf.com
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Science Window Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/integral-science-window-data
Explore at:
Dataset updated
Aug 22, 2025
Dataset provided by
High Energy Astrophysics Science Archive Research Center
Description
Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Science Window Data Catalog allows for the keyed search and selection of sets of Science Windows and the retrieval of the corresponding data products. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse
The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .
w
Dataset of book subjects that contain Longman illustrated dictionary of...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Longman illustrated dictionary of computing science : computers and their application [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Longman+illustrated+dictionary+of+computing+science+:+computers+and+their+application&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 1 row and is filtered where the books is Longman illustrated dictionary of computing science : computers and their application. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
o
Data for: Sustainable connectivity in a community repository
explore.openaire.eu
data.niaid.nih.gov
+3more
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ted Habermann (2023). Data for: Sustainable connectivity in a community repository [Dataset]. http://doi.org/10.5061/dryad.nzs7h44xr
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.nzs7h44xr
Dataset updated
Dec 7, 2023
Authors
Ted Habermann
Description
Data For: Sustainable Connectivity in a Community Repository ## GENERAL INFORMATION This readme.txt file was generated on 30231110 by Ted Habermann ### Title of Dataset Data For: Sustainable Connectivity in a Community Repository ### Author Information Principal Investigator Contact Information Name: Ted Habermann (0000-0003-3585-6733) Institution: Metadata Game Changers () Email: ORCID: 0000-0003-3585-6733 ### Date published or finalized for release: November 10, 2023 ## Date of data collection (single date, range, approximate date) May and June 2023 ### Information about funding sources that supported the collection of the data: National Science Foundation (Crossref Funder ID: 100000001) Award 2134956. ### Overview of the data (abstract): These data are Dryad metadata retrieved from and translated into csv files. There are two datasets: 1. DryadJournalDataset was retrieved from Dryad using the ISSNs in the file DryadJournalDataset_ISSNs.txt, although some had no data. 2. DryadOrganizationDataset was retrieved from Dryad using the RORs in the file DryadOrganizationDataset_RORs.txt, although some had no data. Each dataset includes four types of metadata: identifiers, funders, keywords, and related works, each in a separate comma (.csv) or tab (.tsv) delimited files. There are also Microsoft Excel files (.xlsx) for the identifier metadata and connectivity summaries for each dataset (*.html). The connectivity summaries include summaries of each parameter in all four data files with definitions, counts, unique counts, most frequent values, and completeness. These data formed the basis for an analysis of the connectivity of the Dryad repository for organizations, funders, and people. | Size | FileName | | --------: | :--------------------------------------------------------- | | 90541505 | DryadJournalDataset_Identifiers_20230520_12.csv | | 9017051 | DryadJournalDataset_funders_20230520_12.tsv | | 29108477 | DryadJournalDataset_keywords_20230520_12.tsv | | 8833842 | DryadJournalDataset_relatedWorks_20230520_12.tsv | | | | | 18260935 | DryadOrganizationDataset_funders_20230601_12.tsv | | 240128730 | DryadOrganizationDataset_identifiers_20230601_12.tsv | | 39600659 | DryadOrganizationDataset_keywords_20230601_12.tsv | | 11520475 | DryadOrganizationDataset_relatedWorks_20230601_12.tsv | | | | | 40726143 | DryadJournalDataset_identifiers_20230520_12.xlsx | | 81894301 | DryadOrganizationDataset_identifiers_20230601_12.xlsx | | | | | 842827 | DryadJournalDataset_ConnectivitySummary.html | | 387551 | DryadOrganizationDataset_ConnectivitySummary.html | ### Field Definitions ## SHARING/ACCESS INFORMATION ### Licenses/restrictions placed on the data: Creative Commons Public Domain License (CC0) ### Links to publications that cite or use the data: TBD ### Was data derived from another source? No ## DATA & FILE OVERVIEW ### File List A. *Dataset_identifiers_YYYYMMDD_HH.*sv: Short description: Identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. B. *Dataset_funders_YYYYMMDD_HH.*sv: Short description: Funder metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. C. *Dataset_keywords_YYYYMMDD_HH.*sv: Short description: Keyword metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. D. *Dataset_relatedWorks_YYYYMMDD_HH.*sv: Short description: Related work metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. E. *Dataset_identifiers_YYYYMMDD_HH.xlsx: Short description: Excel spreadsheet with identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API. F. *Dataset_ConnectivitySummary.html: Short description: Connectivity summary for Dataset. G. summarizeConnectivity.ipynb Short description: Python notebook with code for creating connectivity summaries and plots. ### Relationship between files: All files with the same dataset name make up a dataset. The .*sv are original metadata extracted from Dryad. ## METHODOLOGICAL INFORMATION ### Description of methods used for collection/generation of data: Most of the analysis is simply extracting and comparing counts of various metadata elements. ## DATA-SPECIFIC INFORMATION See connectivity summaries (*ConnectivitySummary.html) for a list of parameters in each file and summaries of their values. ### Identifier Metadata The identifier metadata datasets include the following fields: | Field | Definition | | :------------------------------- | :--------------------------------------------------------------------------------------------------- | | DOI | Digital object identifier for the dataset | | title | Title for the dataset | | datePublished | Date dataset published | | relatedPublicationISSN | International Standard Serial Number for journal with related publication | | primary_article | Digital object identifier for pr...
U
Data Dictionary for Electron Microprobe Data Collected with Probe for EPMA...
data.usgs.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather Lowers, Data Dictionary for Electron Microprobe Data Collected with Probe for EPMA Software Package Developed by Probe Software [Dataset]. http://doi.org/10.5066/P91HKRPM
Explore at:
Unique identifier
https://doi.org/10.5066/P91HKRPM
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Heather Lowers
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 1995 - Dec 31, 2050
Description
This data dictionary describes most of the possible output options given in the Probe for EPMA software package developed by Probe Software. Examples of the data output options include sample identification, analytical conditions, elemental weight percents, atomic percents, detection limits, and stage coordinates. Many more options are available and the data that is output will depend upon the end use.
t
Data from: Trusted Research Environments: Analysis of Characteristics and...
researchdata.tuwien.ac.at
bin, csv
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
d
Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska Science Center, Molecular Ecology Laboratory [Dataset]. https://catalog.data.gov/dataset/dictionary-of-microsatellite-loci-from-the-u-s-geological-survey-alaska-science-center-mol
Explore at:
Dataset updated
Nov 10, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data package provides a dictionary of microsatellite locus primers used by the USGS Alaska Science Center, Molecular Ecology Laboratory (MEL). It is a look-up file of microsatellite locus names and citations to the original publication or source where additional information about the locus primers may be found.
Definition of our abstract data model.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanessa Cedeno-Mieles; Zhihao Hu; Yihui Ren; Xinwei Deng; Noshir Contractor; Saliya Ekanayake; Joshua M. Epstein; Brian J. Goode; Gizem Korkmaz; Chris J. Kuhlman; Dustin Machi; Michael Macy; Madhav V. Marathe; Naren Ramakrishnan; Parang Saraf; Nathan Self (2023). Definition of our abstract data model. [Dataset]. http://doi.org/10.1371/journal.pone.0242453.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0242453.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Vanessa Cedeno-Mieles; Zhihao Hu; Yihui Ren; Xinwei Deng; Noshir Contractor; Saliya Ekanayake; Joshua M. Epstein; Brian J. Goode; Gizem Korkmaz; Chris J. Kuhlman; Dustin Machi; Michael Macy; Madhav V. Marathe; Naren Ramakrishnan; Parang Saraf; Nathan Self
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Definition of our abstract data model.
v
INTEGRAL Public Pointed Science Window Data
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Energy Astrophysics Science Archive Research Center (2025). INTEGRAL Public Pointed Science Window Data [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/integral-public-pointed-science-window-data
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
High Energy Astrophysics Science Archive Research Center
Description
Because of the pointing-slew-pointing dithering-nature of INTEGRAL operations, each observation of a celestial target is actually comprised of numerous individual S/C pointings and slews. In addition, there are periods within a given sequence where scheduled observations occur, i.e., engineering windows, yet the instruments still acquire data. The INTEGRAL Science Data Center (ISDC) generalizes all of these data acquisition periods into so-called `Science Windows.' A Science Window (ScW) is a continuous time interval during which all data acquired by the INTEGRAL instruments result from a specific S/C attitude orientation state. Pointing (fixed orientation), Slew (changing orientation), and Engineering (undefined orientation) windows are all special cases of a Science Window. The key is that the same attitude information may be associated with all acquired data of a given Science Window. Note that it is possible to divide a time interval that qualifies as a Science Window under this definition into several smaller Science Windows using arbitrary criteria. The INTEGRAL Public Pointed Science Window Data Catalog is a subset of the INTEGRAL Science Window Data Catalog. It only includes pointed science windows that have non-private proprietary status and at least one instrument's "good" exposure time greater than zero. This database table was first created at the HEASARC in October 2004. It is a slightly modified mirror of the online database maintained by the ISDC at the URL http://isdc.unige.ch/index.cgi?Data+browse
The HEASARC version of this table is updated automatically within a day of the ISDC updating their database table. This is a service provided by NASA HEASARC .
Data
figshare.com
txt
Updated Oct 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anusha Sathyanadh; Guillaume Monteil; Marko Scholze; Anne Klosterhalfen; Hjalmar Laudon (2021). Data [Dataset]. http://doi.org/10.6084/m9.figshare.16676656.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16676656.v3
Dataset updated
Oct 18, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anusha Sathyanadh; Guillaume Monteil; Marko Scholze; Anne Klosterhalfen; Hjalmar Laudon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The modelled concentrations with different vegetation flux inputs and background concentrations are added in the modeldata.csv. Regional sensitivity index for 2017 and 2018 are added seperately. The regions definition used in the ms is given as a csv file, regmap

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9746900.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Clear search

Close search

Google apps

Main menu

LScD (Leicester Scientific Dictionary)

Dataset: A Systematic Literature Review on the topic of High-value datasets

Data Science Glossary For QA

Dataset of book subjects that contain Dictionary of science

Scientific JOURNALS Indicators & Info - SCImagoJR

Available indicators:

Dataset of books called Dictionary of environmental science and technology

PDS Data Dictionary (1r47)

INTEGRAL Science Window Data - Dataset - NASA Open Data Portal

WiDS data dictionary v2.xlsx

Dataset

Contents

Replication Data for: Automated Dictionary Generation for Political Event...

Latest Data Professionals Salary Dataset

About Dataset

Context

Data Dictionary

Content

Acknowledgements

INTEGRAL Science Window Data

Dataset of book subjects that contain Longman illustrated dictionary of...

Data for: Sustainable connectivity in a community repository

Data Dictionary for Electron Microprobe Data Collected with Probe for EPMA...

Data from: Trusted Research Environments: Analysis of Characteristics and...

Methodology

Technical details

Dictionary of Microsatellite Loci from the U.S. Geological Survey, Alaska...

Definition of our abstract data model.

INTEGRAL Public Pointed Science Window Data

Data

LScD (Leicester Scientific Dictionary)See More Versions

LScD (Leicester Scientific Dictionary)