100+ datasets found
  1. COUNTRIES Research & Science Dataset - SCImagoJR

    • kaggle.com
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Jalaali (2025). COUNTRIES Research & Science Dataset - SCImagoJR [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/scimago-country-info-and-rank
    Explore at:
    zip(54895151 bytes)Available download formats
    Dataset updated
    Apr 10, 2025
    Authors
    Ali Jalaali
    Description


    The SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.). These indicators can be used to assess and analyze scientific domains. Country rankings may also be compared or analysed separately.

    ✅Collected by: SCImagoJR Country Data Collector Notebook

    💬Also have a look at
    💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR
    💡 Scientific JOURNALS Indicators & Info - SCImagoJR

    • 27 major thematic subject areas as well as 309 specific subject categories according to Scopus® Classification.
    • Citation data is drawn from over 34,100 titles from more than 5,000 international publishers
    • SCImago is a research group from the Consejo Superior de Investigaciones Científicas (CSIC), University of Granada, Extremadura, Carlos III (Madrid) and Alcalá de Henares, dedicated to information analysis, representation and retrieval by means of visualisation techniques.

    ☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
    ScimagoJR Country Rank
    SCImagoJR About Us

    Available indicators:

    • Documents: Number of documents published during the selected year. It is usually called the country's scientific output.

    • Citable Documents: Selected year citable documents. Exclusively articles, reviews and conference papers are considered.

    • Citations: Number of citations by the documents published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

    • Citations per Document: Average citations per document published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

    • Self Citations: Country self-citations. Number of self-citations of all dates received by the documents published during the source year, --i.e. self-citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

    • H index: The h index is a country's number of articles (h) that have received at least h- citations. It quantifies both country's scientific productivity and scientific impact and it is also applicable to scientists, journals, etc.

  2. S&T Project 22041 Final Report: Evaluation of file formats for storage and...

    • data.usbr.gov
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Bureau of Reclamation (2023). S&T Project 22041 Final Report: Evaluation of file formats for storage and transfer of large datasets in the RISE platform [Dataset]. https://data.usbr.gov/catalog/8002/item/128581
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset authored and provided by
    United States Bureau of Reclamationhttp://www.usbr.gov/
    Area covered
    Description

    The Reclamation Research and Development Office funded an evaluation of file formats for large datasets to use in RISE through the Science & Technology Program. A team of Reclamation scientific and information technology (IT) subject matter experts evaluated multiple file formats commonly utilized for scientific data through literature review and independent benchmarks. The network Common Data Form (netCDF) and Zarr formats were identified as open-source options that could meet a variety of Reclamation use cases. The formats allow for metadata, data compression, subsetting, and appending in a single file using an efficient binary format. Additionally, the Zarr format is optimized for cloud storage applications. While support of both formats would provide the most flexibility, the maturity of the netCDF format led to its prioritization as the preferred RISE file format for large datasets.

    This report documents the evaluation and selection of large data file formats for the RISE platform. Additionally, a preliminary list of identified changes to the RISE platform needed to support the netCDF format is provided. The intent is to frame future RISE development by providing a roadmap to support large datasets within the platform.

  3. m

    Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

    • data.mendeley.com
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
    Explore at:
    Dataset updated
    Jul 25, 2022
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

    Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

    Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

    The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.

  4. d

    August 2024 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Sep 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2024). August 2024 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.7
    Explore at:
    Dataset updated
    Sep 16, 2024
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  5. MIMIC-III - Deep Reinforcement Learning

    • kaggle.com
    zip
    Updated Apr 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asjad K (2022). MIMIC-III - Deep Reinforcement Learning [Dataset]. https://www.kaggle.com/datasets/asjad99/mimiciii
    Explore at:
    zip(11100065 bytes)Available download formats
    Dataset updated
    Apr 7, 2022
    Authors
    Asjad K
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.

    Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).

    As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.

    MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

    we try to answer the following question:

    Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?

    we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.

    Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH

  6. Dataset 1: Studies included in literature review

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset 1: Studies included in literature review [Dataset]. https://catalog.data.gov/dataset/dataset-1-studies-included-in-literature-review
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains the results of a literature review of experimental nutrient addition studies to determine which nutrient forms were most often measured in the scientific literature. To obtain a representative selection of relevant studies, we searched Web of Science™ using a search string to target experimental studies in artificial and natural lotic systems while limiting irrelevant papers. We screened the titles and abstracts of returned papers for relevance (experimental studies in streams/stream mesocosms that manipulated nutrients). To supplement this search, we sorted the relevant articles from the Web of Science™ search alphabetically by author and sequentially examined the bibliographies for additional relevant articles (screening titles for relevance, and then screening abstracts of potentially relevant articles) until we had obtained a total of 100 articles. If we could not find a relevant article electronically, we moved to the next article in the bibliography. Our goal was not to be completely comprehensive, but to obtain a fairly large sample of published, peer-reviewed studies from which to assess patterns. We excluded any lentic or estuarine studies from consideration and included only studies that used mesocosms mimicking stream systems (flowing water or stream water source) or that manipulated nutrient concentrations in natural streams or rivers. We excluded studies that used nutrient diffusing substrate (NDS) because these manipulate nutrients on substrates and not in the water column. We also excluded studies examining only nutrient uptake, which rely on measuring dissolved nutrient concentrations with the goal of characterizing in-stream processing (e.g., Newbold et al., 1983). From the included studies, we extracted or summarized the following information: study type, study duration, nutrient treatments, nutrients measured, inclusion of TN and/or TP response to nutrient additions, and a description of how results were reported in relation to the research-management mismatch, if it existed. Below is information on how the search was conducted: Search string used for Web of Science advanced search Search conducted on 27 September 2016. TS= (stream OR creek OR river* OR lotic OR brook OR headwater OR tributary) AND TS = (mesocosm OR flume OR "artificial stream" OR "experimental stream" OR "nutrient addition") AND TI= (nitrogen OR phosphorus OR nutrient OR enrichment OR fertilization OR eutrophication)

  7. l

    LScDC Word-Category RIG Matrix

    • figshare.le.ac.uk
    pdf
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

  8. Controlled feature selection and compressive big data analytics:...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simeone Marino; Jiachen Xu; Yi Zhao; Nina Zhou; Yiwang Zhou; Ivo D. Dinov (2023). Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies [Dataset]. http://doi.org/10.1371/journal.pone.0202674
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Simeone Marino; Jiachen Xu; Yi Zhao; Nina Zhou; Yiwang Zhou; Ivo D. Dinov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.

  9. H

    Data from: Supernova Detection Datasets

    • dataverse.harvard.edu
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Yin (2020). Supernova Detection Datasets [Dataset]. http://doi.org/10.7910/DVN/JGO6VI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 16, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Kai Yin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Supernovae are dying stars whose light curves change in certain special patterns, which helps astronomers infer the expansion coefficient of the Universe. A series of sky surveys were launched in search of supernovae and generated tremendous amount of data, which pushed astronomy into a new era of big data. While the traditional machine learning methods performs well to deal with such data, the deep learning methods such as convolutional neural networks demonstrate more powerful adaptability for big data in this area. However, most data in the existing works are either simulated or without generality. To address these problems, we collected and sorted all the known objectives of the Pan-STARRS and the Popular Supernova Project (PSP) and produced two datasets and then compared the YOLOv3 and the FCOS algorithm on them.

  10. n

    SILO (Scientific Information for Land Owners) is a database of Australian...

    • access.earthdata.nasa.gov
    • cmr.earthdata.nasa.gov
    Updated Jan 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). SILO (Scientific Information for Land Owners) is a database of Australian climate data from 1889 (current to yesterday). [Dataset]. https://access.earthdata.nasa.gov/collections/C1214600564-SCIOPS
    Explore at:
    Dataset updated
    Jan 2, 2025
    Time period covered
    Jan 1, 1889 - Present
    Area covered
    Earth
    Description

    SILO (Scientific Information for Land Owners) is a database of Australian climate data from 1889 (current to yesterday). It provides daily datasets for a range of climate variables in ready-to-use formats suitable for research and climate applications. SILO products provide national coverage with interpolated infills for missing data, which allows you to focus on your research or model development without the burden of data preparation.

    SILO is hosted by the Science and Technology Division of the Queensland Government's Department of Environment and Science (DES). The datasets are constructed from observational data obtained from the Australian Bureau of Meteorology.

  11. ObjectNET [4 of 10]

    • kaggle.com
    zip
    Updated Jul 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). ObjectNET [4 of 10] [Dataset]. https://www.kaggle.com/datasets/dschettler8845/objectnet-4-of-10
    Explore at:
    zip(19434765775 bytes)Available download formats
    Dataset updated
    Jul 15, 2022
    Authors
    Darien Schettler
    Description

    NOTE: BY USING THIS DATASET YOU ACKNOWLEDGE THAT YOU HAVE READ THE LICENSE AND WILL ABIDE BY THE TERMS THEREWITHIN

    THE LICENSE

    ObjectNet is free to use for both research and commercial
    applications. The authors own the source images and allow their use
    under a license derived from Creative Commons Attribution 4.0 with
    two additional clauses:
    
    1. ObjectNet may never be used to tune the parameters of any
      model. This includes, but is not limited to, computing statistics
      on ObjectNet and including those statistics into a model,
      fine-tuning on ObjectNet, performing gradient updates on any
      parameters based on these images.
    
    2. Any individual images from ObjectNet may only be posted to the web
      including their 1 pixel red border.
    
    If you post this archive in a public location, please leave the password
    intact as "objectnetisatestset".
    
    [Other General License Information Conforms to Attribution 4.0 International]
    



    This is Part 4 of 10 * Original Paper Link * ObjectNet Website


    The links to the various parts of the dataset are:



    Description From ObjectNET Homepage



    What is ObjectNet?

    • A new kind of vision dataset borrowing the idea of controls from other areas of science.
    • No training set, only a test set! Put your vision system through its paces.
    • Collected to intentionally show objects from new viewpoints on new backgrounds.
    • 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint.
    • 313 object classes with 113 overlapping ImageNet
    • Large performance drop, what you can expect from vision systems in the real world!
    • Robust to fine-tuning and a very difficult transfer learning problem


    Controls For Biases Increase Variation


    https://objectnet.dev/images/objectnet_controls_table.png">



    Easy For Humans, Hard For Machines

    • Ready to help develop the next generation of object recognition algorithms that have robustness, bias, and safety in mind.
    • Controls can remove bias from other datasets machine learning, not just vision.


    https://objectnet.dev/images/objectnet_results.png">



    Full Description

    ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random.

    Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases.

    We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generalization. The dataset is both easier than ImageNet – objects are largely centred and unoccluded – and harder, due to the controls. Although we focus on object recognition here, data with controls can be gathered at scale using automated tools throughout machine learning to generate datasets that exercise models in new ways thus providing valuable feedback to researchers. This work opens up new avenues for research in generalizable, robust, and more human-like computer vision and in creating datasets where results are predictive of real-world performance.


    Citation

    ...

  12. f

    Data from: MOF-ChemUnity: Literature-Informed Large Language Models for...

    • acs.figshare.com
    xlsx
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Michael Pruyn; Amro Aswad; Sartaaj Takrim Khan; Ju Huang; Robert Black; Seyed Mohamad Moosavi (2025). MOF-ChemUnity: Literature-Informed Large Language Models for Metal–Organic Framework Research [Dataset]. http://doi.org/10.1021/jacs.5c11789.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 10, 2025
    Dataset provided by
    ACS Publications
    Authors
    Thomas Michael Pruyn; Amro Aswad; Sartaaj Takrim Khan; Ju Huang; Robert Black; Seyed Mohamad Moosavi
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Artificial intelligence (AI) is transforming research in metal–organic frameworks (MOFs), where models trained on structured computational data routinely predict new materials and optimize their properties. This raises a central question: What if we could leverage the full breadth of MOF knowledge, not just structured data sets, but also the scientific literature? For researchers, the literature remains the primary source of knowledge, yet much of its content, including experimental data and expert insight, remains underutilized by AI systems. We introduce MOF-ChemUnity, a structured, extensible, and scalable knowledge graph that unifies MOF data by linking literature-derived insights to crystal structures and computational data sets. By disambiguating MOF names in the literature and connecting them to crystal structures in the Cambridge Structural Database, MOF-ChemUnity unifies experimental and computational sources and enables cross-document knowledge extraction and linking. We showcase how this enables multiproperty machine learning across simulated and experimental data, compilation of complete synthesis records for individual compounds by aggregating information across multiple publications, and expert-guided materials recommendations via structure-based machine learning descriptors for pore geometry and chemistry. When used as a knowledge source to augment large language models (LLMs), MOF-ChemUnity enables a literature-informed AI assistant that operates over the full scope of MOF knowledge. Expert evaluations show improved accuracy, interpretability, and trustworthiness across tasks such as retrieval, inference of structure–property relationships, and materials recommendation, outperforming standard LLMs. This work lays the foundation for literature-informed materials discovery, enabling both scientists and AI systems to reason over the full existing knowledge in a new way.

  13. PMC-Patients-Dataset for Clinical Decision Support

    • kaggle.com
    zip
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyam Choksi (2024). PMC-Patients-Dataset for Clinical Decision Support [Dataset]. https://www.kaggle.com/datasets/priyamchoksi/pmc-patients-dataset-for-clinical-decision-support
    Explore at:
    zip(191511713 bytes)Available download formats
    Dataset updated
    Jul 22, 2024
    Authors
    Priyam Choksi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The PMC-Patients dataset is a pioneering resource designed for developing and benchmarking Retrieval-based Clinical Decision Support (ReCDS) systems. It comprises 167,000 patient summaries extracted from case reports in PubMed Central (PMC), along with 3.1 million patient-article relevance annotations and 293,000 patient-patient similarity annotations defined by the PubMed citation graph. This dataset is invaluable for advancing research in clinical decision support and patient information retrieval.

    Dataset Details:

    • Patient Summaries: 167,000 patient summaries extracted from PubMed Central (PMC) case reports.
    • Patient-Article Relevance: 3.1 million annotations indicating relevance between patients and articles.
    • Patient-Patient Similarity: 293,000 annotations defining similarities between patients based on the PubMed citation graph.
    • Benchmarking: Includes training, development, and test data for the ReCDS benchmark.
    • References: Articles used in the dataset are credited in meta_data/PMC-Patients_citations.json.

    Usage:

    • Clinical Decision Support: Develop and evaluate systems for retrieving relevant patient information to aid clinical decision-making.
    • Benchmarking: Use the provided data to benchmark and compare different ReCDS systems.
    • Similarity Analysis: Analyze patient similarities and their relevance to clinical information.

    Citation:

    If you use this dataset in your research, please cite the following paper:

    @article{Zhao2023ALD,
     title={A large-scale dataset of patient summaries for retrieval-based clinical decision support systems.},
     author={Zhengyun Zhao and Qiao Jin and Fangyuan Chen and Tuorui Peng and Sheng Yu},
     journal={Scientific data},
     year={2023},
     volume={10 1},
     pages={909},
     url={https://api.semanticscholar.org/CorpusID:266360591}
    }
    
  14. m

    Spearman Correlation Heatmaps After Feature Selection

    • data.mendeley.com
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdulkader hajjouz (2024). Spearman Correlation Heatmaps After Feature Selection [Dataset]. http://doi.org/10.17632/hxd7gmrvth.1
    Explore at:
    Dataset updated
    Nov 20, 2024
    Authors
    abdulkader hajjouz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: This is a Spearman Correlation Heatmap of the 32 features used for machine learning and deep learning models in cybersecurity. The diagonal cells are perfect self-correlation (value = 1) and the off-diagonal cells are pairwise correlations between features. Since there are no strong correlations (close to 1 or -1) we removed the redundant or irrelevant features, so each selected feature brings unique and independent information to the model. Feature selection is key in building cyber intrusion detection systems as it reduces computational overhead, simplifies the model and improves accuracy and robustness. This is part of the systematic feature engineering process to optimize datasets for anomaly detection, network traffic analysis and intrusion detection. Researchers in AI for cybersecurity can use this to build more interpretable and efficient models to detect in large scale networks. This figure shows the importance of correlation analysis for high dimensional datasets and contributes to cyber, data science and machine learning.

    Why It Matters: Reduces overfitting in machine learning models. Improves computational efficiency for large-scale datasets. Enhances feature interpretability for robust cybersecurity solutions.

    Keywords: Spearman Correlation Heatmap, Feature Selection, Intrusion Detection System, Cybersecurity, Machine Learning, Deep Learning, Anomaly Detection, Network Traffic Analysis, Artificial Intelligence in Cybersecurity, Dataset Optimization, Feature Engineering for Cyber Threats

    References: This file pertains to our research study, which has been accepted for publication in the Scientific and Technical Journal of Information Technologies, Mechanics and Optics. The study is titled: "Enhancing and Extending CatBoost for Accurate Detection and Classification of DoS and DDoS Attack Subtypes in Network Traffic."

    https://doi.org/10.1109/ICSIP61881.2024.10671552 https://doi.org/10.24143/2072-9502-2024-3-65-74

  15. d

    Harnessing the Power of Digital Data for Science and Society: Report of the...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCO NITRD (2025). Harnessing the Power of Digital Data for Science and Society: Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council [Dataset]. https://catalog.data.gov/dataset/harnessing-the-power-of-digital-data-for-science-and-society-report-of-the-interagency-wor
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    NCO NITRD
    Description

    This report provides a strategy to ensure that digital scientific data can be reliably preserved for maximum use in catalyzing progress in science and society.Empowered by an array of new digital technologies, science in the 21st century will be conducted in a fully digital world. In this world, the power of digital information to catalyze progress is limited only by the power of the human mind. Data are not consumed by the ideas and innovations they spark but are an endless fuel for creativity. A few bits, well found, can drive a giant leap of creativity. The power of a data set is amplified by ingenuity through applications unimagined by the authors and distant from the original field...

  16. c

    Research Data supporting "Extraction of chemical synthesis information using...

    • repository.cam.ac.uk
    bin, xml
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rihm, Simon; Saluz, Fabio; Kondinski, Aleksandar; Kraft, Markus (2025). Research Data supporting "Extraction of chemical synthesis information using The World Avatar" [Dataset]. http://doi.org/10.17863/CAM.118147
    Explore at:
    bin(20400427 bytes), bin(3382149 bytes), xml(39948 bytes), bin(3017390 bytes), bin(2904136 bytes)Available download formats
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Rihm, Simon; Saluz, Fabio; Kondinski, Aleksandar; Kraft, Markus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    An LLM-based automated pipeline was created (see referenced article) for extracting synthesis procedures of metal-organic polyhedra, structuring the information in ontology-based triples, and integrating them with pre-existing data in a knowledge graph.

    This data set includes the underlying ontology for general synthesis procedures as well as the structured and integrated synthesis information extracted from 75 selected scientific papers. Moreover, 2 smaller datasets containing synthesis information from 9 papers (one automatically extracted and one manually curated) are included that were used for verification purposes and calculating performance metrics.

  17. E

    SIMORC, System of Industry Metocean data for the Offshore and Research...

    • bodc.ac.uk
    nc
    Updated Sep 2, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Total SA (2008). SIMORC, System of Industry Metocean data for the Offshore and Research Communities [Dataset]. https://www.bodc.ac.uk/resources/inventories/edmed/report/2998/
    Explore at:
    ncAvailable download formats
    Dataset updated
    Sep 2, 2008
    Dataset authored and provided by
    Total SA
    License

    https://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/

    Time period covered
    1970 - Present
    Area covered
    World,
    Description

    Observed metocean data, analyses and climate studies provide the oil and gas industry with essential information and knowledge for the design and engineering of offshore installations, such as production platforms and pipelines, and for assessing workability conditions. In addition the information is used for supporting the planning of, for example diving operations and the installation of pipelines, and the forecasting of storms and heavy weather conditions, which might require timely evacuation or other safety measures to be taken during the operation of offshore installations. To support these activities, and to complement metocean data and information, that can be retrieved from public sources, major oil and gas companies are active in monitoring and collecting metocean data themselves. This is done all over the world and over many years the oil and gas companies have acquired together a large volume of data sets. These data sets are acquired, processed and analysed both in joint industry projects (JIPs) as well as in field surveys and monitoring activities performed for individual companies. Often these data sets are acquired at substantial cost and in remote areas. These datasets are managed by the metocean departments of the oil and gas companies and stored in various formats and are only exchanged on a limited scale between companies. Despite various industry co-operative joint projects, there is not yet a common awareness of available data sets and no systematic indexing and archival of these data sets within the industry. Furthermore there is only limited reporting and access to these data sets and results of field studies for other parties, in particular the scientific community. Opening up these data sets for further use will provide favourable conditions for creating highly valuable extra knowledge of both local and regional ocean and marine systems. To stimulate and support a wider application of these industry metocean datasets a System of Industry Metocean data for the Offshore and Research Communities (SIMORC) has been established within the framework of the SIMORC project. The SIMORC project is co-funded by the European Commission for a 2 year project period starting 1st June 2005.
    The SIMORC system consists of an index metadatabase and a database of actual data sets that together are accessible through the Internet. The index metadatabase is public domain, while access to data is regulated by a dedicated SIMORC Data Protocol. This contains rules for access and use of data sets by scientific users, by oil & gas companies, and by third parties. All metocean data sets in the database have undergone quality control and conversion to unified formats, resulting in consistent and high quality, harmonized data sets. SIMORC is a unique and challenging development, undertaken by major ocean data management specialists: MARIS (NL) coordinator and operator, BODC (UK) and IOC-IODE (UNESCO), and the International Association of Oil & Gas Producers (OGP), involving participation of major oil & gas companies that bring in their considerable data sets. The objective is to expand the coverage of the SIMORC database with regular submissions of major oil & gas companies, while the SIMORC service will be operated as part of OGP services.

  18. n

    ccPDB - Compilation and Creation of datasets from PDB

    • neuinfo.org
    • scicrunch.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ccPDB - Compilation and Creation of datasets from PDB [Dataset]. http://identifiers.org/RRID:SCR_005870
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    ccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.

  19. d

    The Enhanced Microsoft Academic Knowledge Graph - Dataset - B2FIND

    • demo-b2find.dkrz.de
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The Enhanced Microsoft Academic Knowledge Graph - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/0b242683-17d4-5b73-8606-3ea007e5e3c2
    Explore at:
    Dataset updated
    May 3, 2024
    Description

    The Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.

  20. Biolinks, datasets and algorithms supporting semantic-based distribution and...

    • zenodo.org
    • data.niaid.nih.gov
    bin, tsv, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia; Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia (2020). Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications [Dataset]. http://doi.org/10.5281/zenodo.829920
    Explore at:
    zip, tsv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia; Leyla Jael Garcia Castro; Rafael Berlanga; Alexander Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups. Biolinks supports both title and abstract only as well as full-text.

    Materials: In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.

    Here we include the annotations on title and abstract as well as those for full-text for all our datasets (profiles.zip). We also provide the global similarity matrices (similarity.zip).

    Methods: The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.

    In order to assess the similarity metric regarding the cohesion of TREC-05 groups, we used Silhouette Coefficient analyses. An additional dataset Stem-TAFT-dataset was used and compared to TAFT and FT datasets.

    Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.

    Conclusions: Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.

    [1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218

    [2] Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt

    [3] Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ali Jalaali (2025). COUNTRIES Research & Science Dataset - SCImagoJR [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/scimago-country-info-and-rank
Organization logo

COUNTRIES Research & Science Dataset - SCImagoJR

Countries scientific indicators and information from Scopus® database (Elsevier)

Explore at:
zip(54895151 bytes)Available download formats
Dataset updated
Apr 10, 2025
Authors
Ali Jalaali
Description


The SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.). These indicators can be used to assess and analyze scientific domains. Country rankings may also be compared or analysed separately.

✅Collected by: SCImagoJR Country Data Collector Notebook

💬Also have a look at
💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR
💡 Scientific JOURNALS Indicators & Info - SCImagoJR

  • 27 major thematic subject areas as well as 309 specific subject categories according to Scopus® Classification.
  • Citation data is drawn from over 34,100 titles from more than 5,000 international publishers
  • SCImago is a research group from the Consejo Superior de Investigaciones Científicas (CSIC), University of Granada, Extremadura, Carlos III (Madrid) and Alcalá de Henares, dedicated to information analysis, representation and retrieval by means of visualisation techniques.

☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
ScimagoJR Country Rank
SCImagoJR About Us

Available indicators:

  • Documents: Number of documents published during the selected year. It is usually called the country's scientific output.

  • Citable Documents: Selected year citable documents. Exclusively articles, reviews and conference papers are considered.

  • Citations: Number of citations by the documents published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

  • Citations per Document: Average citations per document published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

  • Self Citations: Country self-citations. Number of self-citations of all dates received by the documents published during the source year, --i.e. self-citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.

  • H index: The h index is a country's number of articles (h) that have received at least h- citations. It quantifies both country's scientific productivity and scientific impact and it is also applicable to scientists, journals, etc.

Search
Clear search
Close search
Google apps
Main menu