Facebook
Twitter💬Also have a look at
💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR
💡 Scientific JOURNALS Indicators & Info - SCImagoJR
☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
ScimagoJR Country Rank
SCImagoJR About Us
Documents: Number of documents published during the selected year. It is usually called the country's scientific output.
Citable Documents: Selected year citable documents. Exclusively articles, reviews and conference papers are considered.
Citations: Number of citations by the documents published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
Citations per Document: Average citations per document published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
Self Citations: Country self-citations. Number of self-citations of all dates received by the documents published during the source year, --i.e. self-citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
H index: The h index is a country's number of articles (h) that have received at least h- citations. It quantifies both country's scientific productivity and scientific impact and it is also applicable to scientists, journals, etc.
Facebook
TwitterThe Reclamation Research and Development Office funded an evaluation of file formats for large datasets to use in RISE through the Science & Technology Program. A team of Reclamation scientific and information technology (IT) subject matter experts evaluated multiple file formats commonly utilized for scientific data through literature review and independent benchmarks. The network Common Data Form (netCDF) and Zarr formats were identified as open-source options that could meet a variety of Reclamation use cases. The formats allow for metadata, data compression, subsetting, and appending in a single file using an efficient binary format. Additionally, the Zarr format is optimized for cloud storage applications. While support of both formats would provide the most flexibility, the maturity of the netCDF format led to its prioritization as the preferred RISE file format for large datasets.
This report documents the evaluation and selection of large data file formats for the RISE platform. Additionally, a preliminary list of identified changes to the RISE platform needed to support the netCDF format is provided. The intent is to frame future RISE development by providing a roadmap to support large datasets within the platform.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2
Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.
Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).
As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.
MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
we try to answer the following question:
Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?
we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.
Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH
Facebook
TwitterThis dataset contains the results of a literature review of experimental nutrient addition studies to determine which nutrient forms were most often measured in the scientific literature. To obtain a representative selection of relevant studies, we searched Web of Science™ using a search string to target experimental studies in artificial and natural lotic systems while limiting irrelevant papers. We screened the titles and abstracts of returned papers for relevance (experimental studies in streams/stream mesocosms that manipulated nutrients). To supplement this search, we sorted the relevant articles from the Web of Science™ search alphabetically by author and sequentially examined the bibliographies for additional relevant articles (screening titles for relevance, and then screening abstracts of potentially relevant articles) until we had obtained a total of 100 articles. If we could not find a relevant article electronically, we moved to the next article in the bibliography. Our goal was not to be completely comprehensive, but to obtain a fairly large sample of published, peer-reviewed studies from which to assess patterns. We excluded any lentic or estuarine studies from consideration and included only studies that used mesocosms mimicking stream systems (flowing water or stream water source) or that manipulated nutrient concentrations in natural streams or rivers. We excluded studies that used nutrient diffusing substrate (NDS) because these manipulate nutrients on substrates and not in the water column. We also excluded studies examining only nutrient uptake, which rely on measuring dissolved nutrient concentrations with the goal of characterizing in-stream processing (e.g., Newbold et al., 1983). From the included studies, we extracted or summarized the following information: study type, study duration, nutrient treatments, nutrients measured, inclusion of TN and/or TP response to nutrient additions, and a description of how results were reported in relation to the research-management mismatch, if it existed. Below is information on how the search was conducted: Search string used for Web of Science advanced search Search conducted on 27 September 2016. TS= (stream OR creek OR river* OR lotic OR brook OR headwater OR tributary) AND TS = (mesocosm OR flume OR "artificial stream" OR "experimental stream" OR "nutrient addition") AND TI= (nitrogen OR phosphorus OR nutrient OR enrichment OR fertilization OR eutrophication)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supernovae are dying stars whose light curves change in certain special patterns, which helps astronomers infer the expansion coefficient of the Universe. A series of sky surveys were launched in search of supernovae and generated tremendous amount of data, which pushed astronomy into a new era of big data. While the traditional machine learning methods performs well to deal with such data, the deep learning methods such as convolutional neural networks demonstrate more powerful adaptability for big data in this area. However, most data in the existing works are either simulated or without generality. To address these problems, we collected and sorted all the known objectives of the Pan-STARRS and the Popular Supernova Project (PSP) and produced two datasets and then compared the YOLOv3 and the FCOS algorithm on them.
Facebook
TwitterSILO (Scientific Information for Land Owners) is a database of Australian climate data from 1889 (current to yesterday). It provides daily datasets for a range of climate variables in ready-to-use formats suitable for research and climate applications. SILO products provide national coverage with interpolated infills for missing data, which allows you to focus on your research or model development without the burden of data preparation.
SILO is hosted by the Science and Technology Division of the Queensland Government's Department of Environment and Science (DES). The datasets are constructed from observational data obtained from the Australian Bureau of Meteorology.
Facebook
TwitterObjectNet is free to use for both research and commercial
applications. The authors own the source images and allow their use
under a license derived from Creative Commons Attribution 4.0 with
two additional clauses:
1. ObjectNet may never be used to tune the parameters of any
model. This includes, but is not limited to, computing statistics
on ObjectNet and including those statistics into a model,
fine-tuning on ObjectNet, performing gradient updates on any
parameters based on these images.
2. Any individual images from ObjectNet may only be posted to the web
including their 1 pixel red border.
If you post this archive in a public location, please leave the password
intact as "objectnetisatestset".
[Other General License Information Conforms to Attribution 4.0 International]
This is Part 4 of 10 * Original Paper Link * ObjectNet Website
The links to the various parts of the dataset are:
https://objectnet.dev/images/objectnet_controls_table.png">
https://objectnet.dev/images/objectnet_results.png">
ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random.
Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases.
We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generalization. The dataset is both easier than ImageNet – objects are largely centred and unoccluded – and harder, due to the controls. Although we focus on object recognition here, data with controls can be gathered at scale using automated tools throughout machine learning to generate datasets that exercise models in new ways thus providing valuable feedback to researchers. This work opens up new avenues for research in generalizable, robust, and more human-like computer vision and in creating datasets where results are predictive of real-world performance.
...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Artificial intelligence (AI) is transforming research in metal–organic frameworks (MOFs), where models trained on structured computational data routinely predict new materials and optimize their properties. This raises a central question: What if we could leverage the full breadth of MOF knowledge, not just structured data sets, but also the scientific literature? For researchers, the literature remains the primary source of knowledge, yet much of its content, including experimental data and expert insight, remains underutilized by AI systems. We introduce MOF-ChemUnity, a structured, extensible, and scalable knowledge graph that unifies MOF data by linking literature-derived insights to crystal structures and computational data sets. By disambiguating MOF names in the literature and connecting them to crystal structures in the Cambridge Structural Database, MOF-ChemUnity unifies experimental and computational sources and enables cross-document knowledge extraction and linking. We showcase how this enables multiproperty machine learning across simulated and experimental data, compilation of complete synthesis records for individual compounds by aggregating information across multiple publications, and expert-guided materials recommendations via structure-based machine learning descriptors for pore geometry and chemistry. When used as a knowledge source to augment large language models (LLMs), MOF-ChemUnity enables a literature-informed AI assistant that operates over the full scope of MOF knowledge. Expert evaluations show improved accuracy, interpretability, and trustworthiness across tasks such as retrieval, inference of structure–property relationships, and materials recommendation, outperforming standard LLMs. This work lays the foundation for literature-informed materials discovery, enabling both scientists and AI systems to reason over the full existing knowledge in a new way.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The PMC-Patients dataset is a pioneering resource designed for developing and benchmarking Retrieval-based Clinical Decision Support (ReCDS) systems. It comprises 167,000 patient summaries extracted from case reports in PubMed Central (PMC), along with 3.1 million patient-article relevance annotations and 293,000 patient-patient similarity annotations defined by the PubMed citation graph. This dataset is invaluable for advancing research in clinical decision support and patient information retrieval.
Dataset Details:
meta_data/PMC-Patients_citations.json.Usage:
Citation:
If you use this dataset in your research, please cite the following paper:
@article{Zhao2023ALD,
title={A large-scale dataset of patient summaries for retrieval-based clinical decision support systems.},
author={Zhengyun Zhao and Qiao Jin and Fangyuan Chen and Tuorui Peng and Sheng Yu},
journal={Scientific data},
year={2023},
volume={10 1},
pages={909},
url={https://api.semanticscholar.org/CorpusID:266360591}
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: This is a Spearman Correlation Heatmap of the 32 features used for machine learning and deep learning models in cybersecurity. The diagonal cells are perfect self-correlation (value = 1) and the off-diagonal cells are pairwise correlations between features. Since there are no strong correlations (close to 1 or -1) we removed the redundant or irrelevant features, so each selected feature brings unique and independent information to the model. Feature selection is key in building cyber intrusion detection systems as it reduces computational overhead, simplifies the model and improves accuracy and robustness. This is part of the systematic feature engineering process to optimize datasets for anomaly detection, network traffic analysis and intrusion detection. Researchers in AI for cybersecurity can use this to build more interpretable and efficient models to detect in large scale networks. This figure shows the importance of correlation analysis for high dimensional datasets and contributes to cyber, data science and machine learning.
Why It Matters: Reduces overfitting in machine learning models. Improves computational efficiency for large-scale datasets. Enhances feature interpretability for robust cybersecurity solutions.
Keywords: Spearman Correlation Heatmap, Feature Selection, Intrusion Detection System, Cybersecurity, Machine Learning, Deep Learning, Anomaly Detection, Network Traffic Analysis, Artificial Intelligence in Cybersecurity, Dataset Optimization, Feature Engineering for Cyber Threats
References: This file pertains to our research study, which has been accepted for publication in the Scientific and Technical Journal of Information Technologies, Mechanics and Optics. The study is titled: "Enhancing and Extending CatBoost for Accurate Detection and Classification of DoS and DDoS Attack Subtypes in Network Traffic."
https://doi.org/10.1109/ICSIP61881.2024.10671552 https://doi.org/10.24143/2072-9502-2024-3-65-74
Facebook
TwitterThis report provides a strategy to ensure that digital scientific data can be reliably preserved for maximum use in catalyzing progress in science and society.Empowered by an array of new digital technologies, science in the 21st century will be conducted in a fully digital world. In this world, the power of digital information to catalyze progress is limited only by the power of the human mind. Data are not consumed by the ideas and innovations they spark but are an endless fuel for creativity. A few bits, well found, can drive a giant leap of creativity. The power of a data set is amplified by ingenuity through applications unimagined by the authors and distant from the original field...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An LLM-based automated pipeline was created (see referenced article) for extracting synthesis procedures of metal-organic polyhedra, structuring the information in ontology-based triples, and integrating them with pre-existing data in a knowledge graph.
This data set includes the underlying ontology for general synthesis procedures as well as the structured and integrated synthesis information extracted from 75 selected scientific papers. Moreover, 2 smaller datasets containing synthesis information from 9 papers (one automatically extracted and one manually curated) are included that were used for verification purposes and calculating performance metrics.
Facebook
Twitterhttps://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/
Observed metocean data, analyses and climate studies provide the oil and gas industry with essential information and knowledge for the design and engineering of offshore installations, such as production platforms and pipelines, and for assessing workability conditions. In addition the information is used for supporting the planning of, for example diving operations and the installation of pipelines, and the forecasting of storms and heavy weather conditions, which might require timely evacuation or other safety measures to be taken during the operation of offshore installations.
To support these activities, and to complement metocean data and information, that can be retrieved from public sources, major oil and gas companies are active in monitoring and collecting metocean data themselves. This is done all over the world and over many years the oil and gas companies have acquired together a large volume of data sets. These data sets are acquired, processed and analysed both in joint industry projects (JIPs) as well as in field surveys and monitoring activities performed for individual companies. Often these data sets are acquired at substantial cost and in remote areas.
These datasets are managed by the metocean departments of the oil and gas companies and stored in various formats and are only exchanged on a limited scale between companies. Despite various industry co-operative joint projects, there is not yet a common awareness of available data sets and no systematic indexing and archival of these data sets within the industry. Furthermore there is only limited reporting and access to these data sets and results of field studies for other parties, in particular the scientific community.
Opening up these data sets for further use will provide favourable conditions for creating highly valuable extra knowledge of both local and regional ocean and marine systems. To stimulate and support a wider application of these industry metocean datasets a System of Industry Metocean data for the Offshore and Research Communities (SIMORC) has been established within the framework of the SIMORC project. The SIMORC project is co-funded by the European Commission for a 2 year project period starting 1st June 2005.
The SIMORC system consists of an index metadatabase and a database of actual data sets that together are accessible through the Internet. The index metadatabase is public domain, while access to data is regulated by a dedicated SIMORC Data Protocol. This contains rules for access and use of data sets by scientific users, by oil & gas companies, and by third parties. All metocean data sets in the database have undergone quality control and conversion to unified formats, resulting in consistent and high quality, harmonized data sets.
SIMORC is a unique and challenging development, undertaken by major ocean data management specialists: MARIS (NL) coordinator and operator, BODC (UK) and IOC-IODE (UNESCO), and the International Association of Oil & Gas Producers (OGP), involving participation of major oil & gas companies that bring in their considerable data sets.
The objective is to expand the coverage of the SIMORC database with regular submissions of major oil & gas companies, while the SIMORC service will be operated as part of OGP services.
Facebook
TwitterccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.
Facebook
TwitterThe Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups. Biolinks supports both title and abstract only as well as full-text.
Materials: In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.
Here we include the annotations on title and abstract as well as those for full-text for all our datasets (profiles.zip). We also provide the global similarity matrices (similarity.zip).
Methods: The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.
In order to assess the similarity metric regarding the cohesion of TREC-05 groups, we used Silhouette Coefficient analyses. An additional dataset Stem-TAFT-dataset was used and compared to TAFT and FT datasets.
Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.
Conclusions: Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.
[1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218
[2] Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt
[3] Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423
Facebook
Twitter💬Also have a look at
💡 UNIVERSITIES & Research INSTITUTIONS Rank - SCImagoIR
💡 Scientific JOURNALS Indicators & Info - SCImagoJR
☢️❓The entire dataset is obtained from public and open-access data of ScimagoJR (SCImago Journal & Country Rank)
ScimagoJR Country Rank
SCImagoJR About Us
Documents: Number of documents published during the selected year. It is usually called the country's scientific output.
Citable Documents: Selected year citable documents. Exclusively articles, reviews and conference papers are considered.
Citations: Number of citations by the documents published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
Citations per Document: Average citations per document published during the source year, --i.e. citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
Self Citations: Country self-citations. Number of self-citations of all dates received by the documents published during the source year, --i.e. self-citations in years X, X+1, X+2, X+3... to documents published during year X. When referred to the period 1996-2021, all published documents during this period are considered.
H index: The h index is a country's number of articles (h) that have received at least h- citations. It quantifies both country's scientific productivity and scientific impact and it is also applicable to scientists, journals, etc.