26 datasets found

Additional file 1 of Learning from biomedical linked data to suggest valid...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D1.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3747806_D1.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPARQL query example 1. This text file contains the SPARQL query we apply on our PGx linked data to obtain the data graph represented in Fig. 3. This query includes the definition of prefixes mentioned in Figs. 2 and 3. This query takes about 30 s on our https://pgxlod.loria.fr server. (TXT 2 kb)
f
Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...
datasetcatalog.nlm.nih.gov
acs.figshare.com
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002054578
Explore at:
Dataset updated
Jun 23, 2025
Authors
Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin
Description
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
m
Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands...
data.mendeley.com
Updated Feb 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minxing Si (2021). Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands Database for Knowledge Discovery and Carbon Cost Analysis [Dataset]. http://doi.org/10.17632/8ngkgz69zb.4
Explore at:
Unique identifier
https://doi.org/10.17632/8ngkgz69zb.4
Dataset updated
Feb 10, 2021
Authors
Minxing Si
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A better understanding of greenhouse gas (GHG) emissions resulting from oil sands (bitumen) extraction can help to meet global oil demands, identify potential mitigation measures, and design effective carbon policies. While several studies have attempted to model GHG emissions from oil sands extractions, these studies have encountered data availability challenges, particularly with respect to actual fuel use data, and have thus struggled to accurately quantify GHG emissions. This dataset contains actual operational data from 20 in-situ oil sands operations, including information for fuel gas, flare gas, vented gas, production, steam injection, gas injection, condensate injection, and C3 injection.
h
Tiselac
huggingface.co
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monash Scalable Time Series Evaluation Repository (2025). Tiselac [Dataset]. https://huggingface.co/datasets/monster-monash/Tiselac
Explore at:
Dataset updated
Jan 22, 2025
Dataset authored and provided by
Monash Scalable Time Series Evaluation Repository
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Part of MONSTER: https://arxiv.org/abs/2502.15122.

Tiselac

Category Satellite

Num. Examples 99,687

Num. Channels 10

Length 23

Sampling Freq. 16 days

Num. Classes 9

License Other

Citations [1] [2]

TiSeLaC (Time Series Land Cover Classification) was created for the time series land cover classification challenge held in conjunction with the 2017 European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases [1]. It was… See the full description on the dataset page: https://huggingface.co/datasets/monster-monash/Tiselac.
e
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...
b2find.eudat.eu
Updated Aug 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3524622d-2099-554c-826a-f2155c3f4bb4
Explore at:
Dataset updated
Aug 17, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.
f
datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in...
frontiersin.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cyril Esnault; May-Line Gadonna; Maxence Queyrel; Alexandre Templier; Jean-Daniel Zucker (2023). datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in Clinical Data Analysis — An Application to the International Diabetes Management Practice Study.pdf [Dataset]. http://doi.org/10.3389/frai.2020.559927.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2020.559927.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Cyril Esnault; May-Line Gadonna; Maxence Queyrel; Alexandre Templier; Jean-Daniel Zucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Addressing the heterogeneity of both the outcome of a disease and the treatment response to an intervention is a mandatory pathway for regulatory approval of medicines. In randomized clinical trials (RCTs), confirmatory subgroup analyses focus on the assessment of drugs in predefined subgroups, while exploratory ones allow a posteriori the identification of subsets of patients who respond differently. Within the latter area, subgroup discovery (SD) data mining approach is widely used—particularly in precision medicine—to evaluate treatment effect across different groups of patients from various data sources (be it from clinical trials or real-world data). However, both the limited consideration by standard SD algorithms of recommended criteria to define credible subgroups and the lack of statistical power of the findings after correcting for multiple testing hinder the generation of hypothesis and their acceptance by healthcare authorities and practitioners. In this paper, we present the Q-Finder algorithm that aims to generate statistically credible subgroups to answer clinical questions, such as finding drivers of natural disease progression or treatment response. It combines an exhaustive search with a cascade of filters based on metrics assessing key credibility criteria, including relative risk reduction assessment, adjustment on confounding factors, individual feature’s contribution to the subgroup’s effect, interaction tests for assessing between-subgroup treatment effect interactions and tests adjustment (multiple testing). This allows Q-Finder to directly target and assess subgroups on recommended credibility criteria. The top-k credible subgroups are then selected, while accounting for subgroups’ diversity and, possibly, clinical relevance. Those subgroups are tested on independent data to assess their consistency across databases, while preserving statistical power by limiting the number of tests. To illustrate this algorithm, we applied it on the database of the International Diabetes Management Practice Study (IDMPS) to better understand the drivers of improved glycemic control and rate of episodes of hypoglycemia in type 2 diabetics patients. We compared Q-Finder with state-of-the-art approaches from both Subgroup Identification and Knowledge Discovery in Databases literature. The results demonstrate its ability to identify and support a short list of highly credible and diverse data-driven subgroups for both prognostic and predictive tasks.
S
Crop trait regulating-genes knowledge graph dataset
scidb.cn
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhang dan dan (2025). Crop trait regulating-genes knowledge graph dataset [Dataset]. http://doi.org/10.57760/sciencedb.agriculture.00175
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.agriculture.00175
Dataset updated
Jan 3, 2025
Dataset provided by
Science Data Bank
Authors
zhang dan dan
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
In the scientific research of crop breeding, breeding new crop varieties with various excellent traits has always been the direction of efforts of breeders. At present, with the accelerated application of information technology in the field of crop breeding, the multi-dimensional scientific data related to crop breeding has shown exponential growth. These semi-structured and structured scientific data are distributed in scientific databases in different fields and lack the association and fusion of multi-dimensional scientific data across species. It hindered the transfer and reuse of existing crop breeding knowledge and maximized the value of crop breeding scientific data, which brought challenges to the knowledge discovery of crop trait regulation genes. Therefore, more and more crop breeding research work is based on the reorganization, correlation, analysis and utilization of existing breeding scientific data, so as to achieve the discovery of crop trait regulation gene knowledge.The dataset of knowledge map of crop trait regulatory genes was selected from PubMed literature database, Phytozome (genomic information of 4 species) and Ensembl (European Molecular Biology Laboratory's European) Bioinformatics Institute (Bioinformatics Institute) plants (Genome information of 4 species), UniProt (Universal Protein) (protein Annotation information of 4 species), Rice Genome Annotation (RGAP) Project), STRING (protein interaction information for 4 species), Pfam (Protein family analysis and modeling) (protein family information for 4 species), KEGG (Kyoto Encyclopedia of Genes) The entities and relationships of the multi-source scientific data with different data formats were extracted using the and Genomes (pathway annotation information of the 4 species) and the GO (Gene Ontology) domain scientific database as the data sources. It mainly includes mapping knowledge extraction for structured data. For XML semi-structured data, knowledge extraction based on Kettle data analysis is adopted. For FASTA semi-structured data, knowledge extraction based on BLAST model is adopted. For Text unstructured data, knowledge extraction based on large language model is adopted. On the basis of the above entity and relationship extraction, the association fusion of multi-source crop breeding knowledge was realized based on entity mapping and specific attribute association. Finally, the crop trait regulatory gene knowledge map dataset was formed, which consisted of 13 entity datasets and 16 entity relationship datasets.The crop trait -egulating gene knowledge graph dataset provides a key semantic model and important data basis for crop breeding knowledge discovery, such as excellent pleiotropic gene discovery, cross-species gene function prediction and potential discovery of pathway gene network.
f
Beyoglu Preservation Area Building Features Database
figshare.com
data.4tu.nl
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahu Sokmenoglu Sohtorik (2023). Beyoglu Preservation Area Building Features Database [Dataset]. http://doi.org/10.4121/uuid:37ccd095-8523-419f-97e1-b6368a821a4f
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:37ccd095-8523-419f-97e1-b6368a821a4f
Dataset updated
Jun 3, 2023
Dataset provided by
4TU.ResearchData
Authors
Ahu Sokmenoglu Sohtorik
License
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Area covered
Beyoğlu
Description
The Beyoglu Preservation Area Building Features Database. A large and quite comprehensive GIS database was constructed in order to implement the data mining analysis, based mainly on the traditional thematic maps of the Master Plan for the Beyoğlu Preservation Area. This database consists of 45 spatial and non-spatial features attributed to the 11,984 buildings located in the Beyoğlu Preservation Area and it is one of the original outputs of the PhD Thesis entitled "A Knowledge Discovery Approach to Urban Analysis: The Beyoglu Preservation Area as a data mine".
r
CODE dataset
researchdata.se
figshare.scilifelab.se
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro (2025). CODE dataset [Dataset]. http://doi.org/10.17044/SCILIFELAB.15169716
Explore at:
Unique identifier
https://doi.org/10.17044/SCILIFELAB.15169716
Dataset updated
Feb 27, 2025
Dataset provided by
Uppsala University
Authors
Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro
Description
Dataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group. Requesting access Researchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request. If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted. Openly available subset: A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206. Content The folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format. Additional references The dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are: - [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classiﬁcation to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classiﬁcation using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071. Code: The following github repositories perform analysis that use this dataset: - https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-prediction Related Datasets: - CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618) Ethics declarations The CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.
Data from: Is this bug severe? A text-cum-graph based model for bug severity...
zenodo.org
csv, txt
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rima Hazra; Arpit Dwivedi; Animesh Mukherjee; Rima Hazra; Arpit Dwivedi; Animesh Mukherjee (2023). Is this bug severe? A text-cum-graph based model for bug severity prediction [Dataset]. http://doi.org/10.5281/zenodo.5554978
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5554978
Dataset updated
Aug 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rima Hazra; Arpit Dwivedi; Animesh Mukherjee; Rima Hazra; Arpit Dwivedi; Animesh Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A snapshot of the dataset has been updated. For the time being, we are publishing a snapshot of the dataset where the bugs were reported after 2017.

Paper link: https://arxiv.org/abs/2207.00623 (ECML-PKDD 2022)

Cite our paper:

@InProceedings{10.1007/978-3-031-26422-1_15,
author="Hazra, Rima
and Dwivedi, Arpit
and Mukherjee, Animesh",
editor="Amini, Massih-Reza
and Canu, St{\'e}phane
and Fischer, Asja
and Guns, Tias
and Kralj Novak, Petra
and Tsoumakas, Grigorios",
title="Is This Bug Severe? A Text-Cum-Graph Based Model for Bug Severity Prediction",
booktitle="Machine Learning and Knowledge Discovery in Databases",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="236--252",
isbn="978-3-031-26422-1"
}

*** Please see the new version. (10.5281/zenodo.5554978)

There is a total of six files.

bug_descriptions.csv: This file contains the bug id and its description.

bug_comments.csv: This file contains three columns. The columns are the bug ids, comments and timestamp of the comment.

bug_REPORTED_ON_details.csv: This file contains the bug id and the package name on which the bug is reported

affect_dataset.csv: This file contains the bug id and the affected packages along with the affect timestamp.

bug_heat_2019.csv: This file contains the bug ids and its bug heats crawled in November 2019.

bug_heat_2020.csv: This file contains the bug ids and its bug heats crawled in November 2020.
e
Data from: Proteomics profiling of research models for studying pancreatic...
ebi.ac.uk
Updated Dec 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Animesh Sharma (2024). Proteomics profiling of research models for studying pancreatic ductal adenocarcinoma [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD057804
Explore at:
Dataset updated
Dec 18, 2024
Authors
Animesh Sharma
Variables measured
Proteomics
Description
Pancreatic ductal adenocarcinoma (PDAC) remains one of the most lethal malignancies, with a five-year survival rate of 10-15% due to late-stage diagnosis and limited efficacy of existing treatments. This study utilized proteomics-based system modelling to generate multimodal datasets from various research models, including PDAC cells, spheroids, organoids, and tissues derived from murine and human samples. Identical mass spectrometry-based proteomics was applied across the different models. Preparation and validation of the research models and the proteomics were described in detail. The assembly datasets we present here may contribute to the data collection on PDAC, which will be useful for systems modeling, data mining, knowledge discovery in databases, and bioinformatics of individual models. Further data analysis may lead to generation of research hypotheses, predictions of targets for diagnosis and treatment and relationships between data variables. bridging the gap between preclinical research and clinical trials, thus enhancing the possibilities for discovering early diagnostic biomarkers and effective therapeutic targets.
Playlist2vec: Spotify Million Playlist Dataset
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Papreja; Piyush Papreja (2021). Playlist2vec: Spotify Million Playlist Dataset [Dataset]. http://doi.org/10.5281/zenodo.5002584
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5002584
Dataset updated
Jun 22, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Piyush Papreja; Piyush Papreja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was created using Spotify developer API. It consists of user-created as well as Spotify-curated playlists.
The dataset consists of 1 million playlists, 3 million unique tracks, 3 million unique albums, and 1.3 million artists.
The data is stored in a SQL database, with the primary entities being songs, albums, artists, and playlists.
Each of the aforementioned entities are represented by unique IDs (Spotify URI).
Data is stored into following tables:

album

artist

track

playlist

track_artist1

track_playlist1

album

| id | name | uri |

id: Album ID as provided by Spotify
name: Album Name as provided by Spotify
uri: Album URI as provided by Spotify

artist

| id | name | uri |

id: Artist ID as provided by Spotify
name: Artist Name as provided by Spotify
uri: Artist URI as provided by Spotify

track

| id | name | duration | popularity | explicit | preview_url | uri | album_id |

id: Track ID as provided by Spotify
name: Track Name as provided by Spotify
duration: Track Duration (in milliseconds) as provided by Spotify
popularity: Track Popularity as provided by Spotify
explicit: Whether the track has explicit lyrics or not. (true or false)
preview_url: A link to a 30 second preview (MP3 format) of the track. Can be null
uri: Track Uri as provided by Spotify
album_id: Album Id to which the track belongs

playlist

| id | name | followers | uri | total_tracks |

id: Playlist ID as provided by Spotify
name: Playlist Name as provided by Spotify
followers: Playlist Followers as provided by Spotify
uri: Playlist Uri as provided by Spotify
total_tracks: Total number of tracks in the playlist.

track_artist1

| track_id | artist_id |

Track-Artist association table

track_playlist1

| track_id | playlist_id |

Track-Playlist association table

- - - - - SETUP - - - - -

The data is in the form of a SQL dump. The download size is about 10 GB, and the database populated from it comes out to about 35GB.

spotifydbdumpschemashare.sql contains the schema for the database (for reference):
spotifydbdumpshare.sql is the actual data dump.

Setup steps:
1. Create database

- - - - - PAPER - - - - -

The description of this dataset can be found in the following paper:

Papreja P., Venkateswara H., Panchanathan S. (2020) Representation, Exploration and Recommendation of Playlists. In: Cellier P., Driessens K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1168. Springer, Cham
Data from: Car Evaluation Data Set
hypi.ai
zip
Updated Sep 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahiale Darlington (2017). Car Evaluation Data Set [Dataset]. https://hypi.ai/wp/wp-content/uploads/2019/10/car-evaluation-data-set/
Explore at:
zip(4775 bytes)Available download formats
Dataset updated
Sep 1, 2017
Authors
Ahiale Darlington
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
from: https://archive.ics.uci.edu/ml/datasets/car+evaluation

Title: Car Evaluation Database

Sources: (a) Creator: Marko Bohanec (b) Donors: Marko Bohanec (marko.bohanec@ijs.si) Blaz Zupan (blaz.zupan@ijs.si) (c) Date: June, 1997

Past Usage:

The hierarchical decision model, from which this dataset is derived, was first presented in

M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988.

Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in

B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

Relevant Information Paragraph:

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car

Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

Number of Instances: 1728 (instances completely cover the attribute space)

Number of Attributes: 6

Attribute Values:

buying v-high, high, med, low maint v-high, high, med, low doors 2, 3, 4, 5-more persons 2, 4, more lug_boot small, med, big safety low, med, high

Missing Attribute Values: none

Class Distribution (number of instances per class)

class N N[%]

unacc 1210 (70.023 %) acc 384 (22.222 %) good 69 ( 3.993 %) v-good 65 ( 3.762 %)
r
International Journal of Engineering and Advanced Technology FAQ -...
researchhelpdesk.org
Updated May 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). International Journal of Engineering and Advanced Technology FAQ - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/faq/552/international-journal-of-engineering-and-advanced-technology
Explore at:
Dataset updated
May 28, 2022
Dataset authored and provided by
Research Help Desk
Description
International Journal of Engineering and Advanced Technology FAQ - ResearchHelpDesk - International Journal of Engineering and Advanced Technology (IJEAT) is having Online-ISSN 2249-8958, bi-monthly international journal, being published in the months of February, April, June, August, October, and December by Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP) Bhopal (M.P.), India since the year 2011. It is academic, online, open access, double-blind, peer-reviewed international journal. It aims to publish original, theoretical and practical advances in Computer Science & Engineering, Information Technology, Electrical and Electronics Engineering, Electronics and Telecommunication, Mechanical Engineering, Civil Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. All submitted papers will be reviewed by the board of committee of IJEAT. Aim of IJEAT Journal disseminate original, scientific, theoretical or applied research in the field of Engineering and allied fields. dispense a platform for publishing results and research with a strong empirical component. aqueduct the significant gap between research and practice by promoting the publication of original, novel, industry-relevant research. seek original and unpublished research papers based on theoretical or experimental works for the publication globally. publish original, theoretical and practical advances in Computer Science & Engineering, Information Technology, Electrical and Electronics Engineering, Electronics and Telecommunication, Mechanical Engineering, Civil Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. impart a platform for publishing results and research with a strong empirical component. create a bridge for a significant gap between research and practice by promoting the publication of original, novel, industry-relevant research. solicit original and unpublished research papers, based on theoretical or experimental works. Scope of IJEAT International Journal of Engineering and Advanced Technology (IJEAT) covers all topics of all engineering branches. Some of them are Computer Science & Engineering, Information Technology, Electronics & Communication, Electrical and Electronics, Electronics and Telecommunication, Civil Engineering, Mechanical Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. The main topic includes but not limited to: 1. Smart Computing and Information Processing Signal and Speech Processing Image Processing and Pattern Recognition WSN Artificial Intelligence and machine learning Data mining and warehousing Data Analytics Deep learning Bioinformatics High Performance computing Advanced Computer networking Cloud Computing IoT Parallel Computing on GPU Human Computer Interactions 2. Recent Trends in Microelectronics and VLSI Design Process & Device Technologies Low-power design Nanometer-scale integrated circuits Application specific ICs (ASICs) FPGAs Nanotechnology Nano electronics and Quantum Computing 3. Challenges of Industry and their Solutions, Communications Advanced Manufacturing Technologies Artificial Intelligence Autonomous Robots Augmented Reality Big Data Analytics and Business Intelligence Cyber Physical Systems (CPS) Digital Clone or Simulation Industrial Internet of Things (IIoT) Manufacturing IOT Plant Cyber security Smart Solutions – Wearable Sensors and Smart Glasses System Integration Small Batch Manufacturing Visual Analytics Virtual Reality 3D Printing 4. Internet of Things (IoT) Internet of Things (IoT) & IoE & Edge Computing Distributed Mobile Applications Utilizing IoT Security, Privacy and Trust in IoT & IoE Standards for IoT Applications Ubiquitous Computing Block Chain-enabled IoT Device and Data Security and Privacy Application of WSN in IoT Cloud Resources Utilization in IoT Wireless Access Technologies for IoT Mobile Applications and Services for IoT Machine/ Deep Learning with IoT & IoE Smart Sensors and Internet of Things for Smart City Logic, Functional programming and Microcontrollers for IoT Sensor Networks, Actuators for Internet of Things Data Visualization using IoT IoT Application and Communication Protocol Big Data Analytics for Social Networking using IoT IoT Applications for Smart Cities Emulation and Simulation Methodologies for IoT IoT Applied for Digital Contents 5. Microwaves and Photonics Microwave filter Micro Strip antenna Microwave Link design Microwave oscillator Frequency selective surface Microwave Antenna Microwave Photonics Radio over fiber Optical communication Optical oscillator Optical Link design Optical phase lock loop Optical devices 6. Computation Intelligence and Analytics Soft Computing Advance Ubiquitous Computing Parallel Computing Distributed Computing Machine Learning Information Retrieval Expert Systems Data Mining Text Mining Data Warehousing Predictive Analysis Data Management Big Data Analytics Big Data Security 7. Energy Harvesting and Wireless Power Transmission Energy harvesting and transfer for wireless sensor networks Economics of energy harvesting communications Waveform optimization for wireless power transfer RF Energy Harvesting Wireless Power Transmission Microstrip Antenna design and application Wearable Textile Antenna Luminescence Rectenna 8. Advance Concept of Networking and Database Computer Network Mobile Adhoc Network Image Security Application Artificial Intelligence and machine learning in the Field of Network and Database Data Analytic High performance computing Pattern Recognition 9. Machine Learning (ML) and Knowledge Mining (KM) Regression and prediction Problem solving and planning Clustering Classification Neural information processing Vision and speech perception Heterogeneous and streaming data Natural language processing Probabilistic Models and Methods Reasoning and inference Marketing and social sciences Data mining Knowledge Discovery Web mining Information retrieval Design and diagnosis Game playing Streaming data Music Modelling and Analysis Robotics and control Multi-agent systems Bioinformatics Social sciences Industrial, financial and scientific applications of all kind 10. Advanced Computer networking Computational Intelligence Data Management, Exploration, and Mining Robotics Artificial Intelligence and Machine Learning Computer Architecture and VLSI Computer Graphics, Simulation, and Modelling Digital System and Logic Design Natural Language Processing and Machine Translation Parallel and Distributed Algorithms Pattern Recognition and Analysis Systems and Software Engineering Nature Inspired Computing Signal and Image Processing Reconfigurable Computing Cloud, Cluster, Grid and P2P Computing Biomedical Computing Advanced Bioinformatics Green Computing Mobile Computing Nano Ubiquitous Computing Context Awareness and Personalization, Autonomic and Trusted Computing Cryptography and Applied Mathematics Security, Trust and Privacy Digital Rights Management Networked-Driven Multicourse Chips Internet Computing Agricultural Informatics and Communication Community Information Systems Computational Economics, Digital Photogrammetric Remote Sensing, GIS and GPS Disaster Management e-governance, e-Commerce, e-business, e-Learning Forest Genomics and Informatics Healthcare Informatics Information Ecology and Knowledge Management Irrigation Informatics Neuro-Informatics Open Source: Challenges and opportunities Web-Based Learning: Innovation and Challenges Soft computing Signal and Speech Processing Natural Language Processing 11. Communications Microstrip Antenna Microwave Radar and Satellite Smart Antenna MIMO Antenna Wireless Communication RFID Network and Applications 5G Communication 6G Communication 12. Algorithms and Complexity Sequential, Parallel And Distributed Algorithms And Data Structures Approximation And Randomized Algorithms Graph Algorithms And Graph Drawing On-Line And Streaming Algorithms Analysis Of Algorithms And Computational Complexity Algorithm Engineering Web Algorithms Exact And Parameterized Computation Algorithmic Game Theory Computational Biology Foundations Of Communication Networks Computational Geometry Discrete Optimization 13. Software Engineering and Knowledge Engineering Software Engineering Methodologies Agent-based software engineering Artificial intelligence approaches to software engineering Component-based software engineering Embedded and ubiquitous software engineering Aspect-based software engineering Empirical software engineering Search-Based Software engineering Automated software design and synthesis Computer-supported cooperative work Automated software specification Reverse engineering Software Engineering Techniques and Production Perspectives Requirements engineering Software analysis, design and modelling Software maintenance and evolution Software engineering tools and environments Software engineering decision support Software design patterns Software product lines Process and workflow management Reflection and metadata approaches Program understanding and system maintenance Software domain modelling and analysis Software economics Multimedia and hypermedia software engineering Software engineering case study and experience reports Enterprise software, middleware, and tools Artificial intelligent methods, models, techniques Artificial life and societies Swarm intelligence Smart Spaces Autonomic computing and agent-based systems Autonomic computing Adaptive Systems Agent architectures, ontologies, languages and protocols Multi-agent systems Agent-based learning and knowledge discovery Interface agents Agent-based auctions and marketplaces Secure mobile and multi-agent systems Mobile agents SOA and Service-Oriented Systems Service-centric software engineering Service oriented requirements engineering Service oriented architectures Middleware for service based systems Service discovery and composition Service level agreements (drafting,
Data and code files for the Adaptive Skip-Train Structured Ensemble for...
figshare.com
txt
Updated Mar 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Pavlovski; Fang Zhou; Ivan Stojkovic; Ljupco Kocarev; Zoran Obradovic (2018). Data and code files for the Adaptive Skip-Train Structured Ensemble for Temporal Networks [Dataset]. http://doi.org/10.6084/m9.figshare.5444500.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5444500.v1
Dataset updated
Mar 7, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Pavlovski; Fang Zhou; Ivan Stojkovic; Ljupco Kocarev; Zoran Obradovic
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This fileset contains the data and source code related to the paper: Pavlovski, M., Zhou, F., Stojkovic, I., Kocarev, L., & Obradovic, Z. "Adaptive Skip-Train Structured Regression for Temporal Networks", Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017The code files contain the experimental setups and code for running the various models:AST-SE: Adaptive Skip-Train Structured Ensemble, a sampling-based structured regression ensemble for prediction on top of temporal networksLR: An L1-regularized linear regression. LR was employed as an unstructured predictor for each of the following models in order to achieve efficiency.GCRF: Standard GCRF model that enables the chosen unstructured predictor to learn the network structure. SE: Structured ensemble composed of multiple GCRF models. WSE: Weighted structured ensemble that combines the predictions of multiple GCRFs in a weighted mixture in order to predict the nodes' outputs in the next timestep.The data file H3N2_data.mat contains temporally collected gene expression measurements (12,032 genes) of a human subject infected with the H3N2 virus.For further details see the related Conference paper.All code is written in MATLAB and is available in .m format files. Raw code can be accessed from these files using openly-accessible text edit software. Data are provided in .mat format, accessible using the MATLAB computing environment.Background A broad range of high impact applications involve learning a predictive model in a temporal network environment. In weather forecasting, predicting effectiveness of treatments, outcomes in healthcare and in many other domains, networks are often large, while intervals between consecutive time moments are brief. Therefore, models are required to forecast in a more scalable and efficient way, without compromising accuracy. The Gaussian Conditional Random Field (GCRF) is a widely used graphical model for performing structured regression on networks. However, GCRF is not applicable to large networks and it cannot capture different network substructures (communities) since it considers the entire network while learning. In this study, we present a novel model, Adaptive Skip-Train Structured Ensemble (AST-SE), which is a sampling-based structured regression ensemble for prediction on top of temporal networks. AST-SE takes advantage of the scheme of ensemble methods to allow multiple GCRFs to learn from several subnetworks. The proposed model is able to automatically skip the entire training or some phases of the training process. The prediction accuracy and efficiency of AST-SE were assessed and compared against alternatives on synthetic temporal networks and the H3N2 Virus Influenza network. The obtained results provide evidence that (1) AST-SE is ~140 times faster than GCRF as it skips retraining quite frequently; (2) It still captures the original network structure more accurately than GCRF while operating solely on partial views of the network; (3) It outperforms both unweighted and weighted GCRF ensembles which also operate on sub- networks but require retraining at each timestep.
f
Table_3_Computational Advances in Drug Safety: Systematic and Mapping Review...
frontiersin.figshare.com
docx
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pantelis Natsiavas; Andigoni Malousi; Cédric Bousquet; Marie-Christine Jaulent; Vassilis Koutkias (2023). Table_3_Computational Advances in Drug Safety: Systematic and Mapping Review of Knowledge Engineering Based Approaches.DOCX [Dataset]. http://doi.org/10.3389/fphar.2019.00415.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2019.00415.s003
Dataset updated
Jun 9, 2023
Dataset provided by
Frontiers
Authors
Pantelis Natsiavas; Andigoni Malousi; Cédric Bousquet; Marie-Christine Jaulent; Vassilis Koutkias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Drug Safety (DS) is a domain with significant public health and social impact. Knowledge Engineering (KE) is the Computer Science discipline elaborating on methods and tools for developing “knowledge-intensive” systems, depending on a conceptual “knowledge” schema and some kind of “reasoning” process. The present systematic and mapping review aims to investigate KE-based approaches employed for DS and highlight the introduced added value as well as trends and possible gaps in the domain. Journal articles published between 2006 and 2017 were retrieved from PubMed/MEDLINE and Web of Science® (873 in total) and filtered based on a comprehensive set of inclusion/exclusion criteria. The 80 finally selected articles were reviewed on full-text, while the mapping process relied on a set of concrete criteria (concerning specific KE and DS core activities, special DS topics, employed data sources, reference ontologies/terminologies, and computational methods, etc.). The analysis results are publicly available as online interactive analytics graphs. The review clearly depicted increased use of KE approaches for DS. The collected data illustrate the use of KE for various DS aspects, such as Adverse Drug Event (ADE) information collection, detection, and assessment. Moreover, the quantified analysis of using KE for the respective DS core activities highlighted room for intensifying research on KE for ADE monitoring, prevention and reporting. Finally, the assessed use of the various data sources for DS special topics demonstrated extensive use of dominant data sources for DS surveillance, i.e., Spontaneous Reporting Systems, but also increasing interest in the use of emerging data sources, e.g., observational healthcare databases, biochemical/genetic databases, and social media. Various exemplar applications were identified with promising results, e.g., improvement in Adverse Drug Reaction (ADR) prediction, detection of drug interactions, and novel ADE profiles related with specific mechanisms of action, etc. Nevertheless, since the reviewed studies mostly concerned proof-of-concept implementations, more intense research is required to increase the maturity level that is necessary for KE approaches to reach routine DS practice. In conclusion, we argue that efficiently addressing DS data analytics and management challenges requires the introduction of high-throughput KE-based methods for effective knowledge discovery and management, resulting ultimately, in the establishment of a continuous learning DS system.
p
MIMIC-IV
physionet.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
Explore at:
Unique identifier
https://doi.org/10.13026/kpb9-mt58
Dataset updated
Oct 11, 2024
Authors
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
f
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
datasetcatalog.nlm.nih.gov
Updated Sep 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Zhi; Omenn, Gilbert S.; Moritz, Robert L.; Shteynberg, David; Campbell, David S.; Binz, Pierre-Alain; Mendoza, Luis; Deutsch, Eric W.; Farrah, Terry (2016). Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001553063
Explore at:
Dataset updated
Sep 12, 2016
Authors
Sun, Zhi; Omenn, Gilbert S.; Moritz, Robert L.; Shteynberg, David; Campbell, David S.; Binz, Pierre-Alain; Mendoza, Luis; Deutsch, Eric W.; Farrah, Terry
Description
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstancesa problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.
T
kddcup99
tensorflow.org
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). kddcup99 [Dataset]. https://www.tensorflow.org/datasets/catalog/kddcup99
Explore at:
Dataset updated
Jan 4, 2023
Description
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('kddcup99', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
f
Table_1_Computational Advances in Drug Safety: Systematic and Mapping Review...
frontiersin.figshare.com
docx
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pantelis Natsiavas; Andigoni Malousi; Cédric Bousquet; Marie-Christine Jaulent; Vassilis Koutkias (2023). Table_1_Computational Advances in Drug Safety: Systematic and Mapping Review of Knowledge Engineering Based Approaches.DOCX [Dataset]. http://doi.org/10.3389/fphar.2019.00415.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2019.00415.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Pantelis Natsiavas; Andigoni Malousi; Cédric Bousquet; Marie-Christine Jaulent; Vassilis Koutkias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Drug Safety (DS) is a domain with significant public health and social impact. Knowledge Engineering (KE) is the Computer Science discipline elaborating on methods and tools for developing “knowledge-intensive” systems, depending on a conceptual “knowledge” schema and some kind of “reasoning” process. The present systematic and mapping review aims to investigate KE-based approaches employed for DS and highlight the introduced added value as well as trends and possible gaps in the domain. Journal articles published between 2006 and 2017 were retrieved from PubMed/MEDLINE and Web of Science® (873 in total) and filtered based on a comprehensive set of inclusion/exclusion criteria. The 80 finally selected articles were reviewed on full-text, while the mapping process relied on a set of concrete criteria (concerning specific KE and DS core activities, special DS topics, employed data sources, reference ontologies/terminologies, and computational methods, etc.). The analysis results are publicly available as online interactive analytics graphs. The review clearly depicted increased use of KE approaches for DS. The collected data illustrate the use of KE for various DS aspects, such as Adverse Drug Event (ADE) information collection, detection, and assessment. Moreover, the quantified analysis of using KE for the respective DS core activities highlighted room for intensifying research on KE for ADE monitoring, prevention and reporting. Finally, the assessed use of the various data sources for DS special topics demonstrated extensive use of dominant data sources for DS surveillance, i.e., Spontaneous Reporting Systems, but also increasing interest in the use of emerging data sources, e.g., observational healthcare databases, biochemical/genetic databases, and social media. Various exemplar applications were identified with promising results, e.g., improvement in Adverse Drug Reaction (ADR) prediction, detection of drug interactions, and novel ADE profiles related with specific mechanisms of action, etc. Nevertheless, since the reviewed studies mostly concerned proof-of-concept implementations, more intense research is required to increase the maturity level that is necessary for KE approaches to reach routine DS practice. In conclusion, we argue that efficiently addressing DS data analytics and management challenges requires the introduction of high-throughput KE-based methods for effective knowledge discovery and management, resulting ultimately, in the establishment of a continuous learning DS system.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D1.v1

Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.c.3747806_D1.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SPARQL query example 1. This text file contains the SPARQL query we apply on our PGx linked data to obtain the data graph represented in Fig. 3. This query includes the definition of prefixes mentioned in Figs. 2 and 3. This query takes about 30 s on our https://pgxlod.loria.fr server. (TXT 2 kb)

Clear search

Close search

Google apps

Main menu

Additional file 1 of Learning from biomedical linked data to suggest valid...

Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands...

Tiselac

Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in...

Crop trait regulating-genes knowledge graph dataset

Beyoglu Preservation Area Building Features Database

CODE dataset

Data from: Is this bug severe? A text-cum-graph based model for bug severity...

Data from: Proteomics profiling of research models for studying pancreatic...

Playlist2vec: Spotify Million Playlist Dataset

Data from: Car Evaluation Data Set

class N N[%]

International Journal of Engineering and Advanced Technology FAQ -...

Data and code files for the Adaptive Skip-Train Structured Ensemble for...

Table_3_Computational Advances in Drug Safety: Systematic and Mapping Review...

MIMIC-IV

Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics

kddcup99

Table_1_Computational Advances in Drug Safety: Systematic and Mapping Review...

Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes