Facebook
TwitterModeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RCV1
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.
All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesnât have [Religion]. SANAD contains a total number of 190k+ articles.
How to use it:
SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.
How to use it:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multi-label code-smell dataset for studies related to multi-label classification
Facebook
TwitterThe open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Color Label is a dataset for object detection tasks - it contains Color Bottle annotations for 2,258 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multi-Label Web Page Classification Dataset
Dataset Description
The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content⌠See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality]_[species Organ]_[resolution].tif Labels - [Modality]_[species Organ]_[resolution]_labels.tif Sub-volumes of larger dataset - [Modality]_[species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background â(Brown et al., 2019)â. OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature â(Walsh et al., 2021)â. The image data has been processed using an asymmetric deconvolution algorithm described by âWalsh et al., 2020â. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house â(Walsh et al., 2021)â. The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by âWalsh et al., 2020â. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute â(Bosch et al., 2022)â. NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: ââBosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1â16. https://doi.org/10.1038/s41467-022-30199-6 âBrown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 âWalsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 âWalsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235â249. https://doi.org/10.1007/978-3-030-52791-4_19 ââŻ
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Last update: February 2021.The dataset folder includes both raw and post-processed radar data used for training and testing the networks proposed in Sect. VIII of the article âMachine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overviewâ.The folder Human Activity Classification containsâRawâ folder where 150 files acquired with our FMCW radar sensor are given inside the âdoppler_datasetâ zip folder; they are divided in 50 for walking, 50 for jumping and 50 for running;âPost_processâ divided in- âMachine Learningâ folder including âdataset_ML_doppler_real_activities.matâ; this dataset has been used for training and testing the SVM, K-NN and Adaboost described in Sect. VIII-A).- The 150x4 matrix âX_measâ including the features described by eqs. (227)-(234) and the 150x1 vector of char âlabels_pyâ containing the associated labels.- âDeep Learningâ folder containing âdataset_DL_doppler_real_activities.matâ; this dataset is composed by 150 structs of data, where each of them, associated to a specific activity, includes:- The label associated to the considered activity,- The overall range variations from the beginning to the end of the motion âdelta_Râ;- The Range-Doppler map âRD_mapâ;- The normalized spectrogram âSP_Normâ;- The Cadence Velocity Diagram âCVDâ;- The period of the spectrogram âperâ;- The peaks associated to the greatest three cadence frequencies in âpeaks_cadâ;- The three strongest cadence frequencies and their normalized version in âcad_freqsâ and âcad_freqs_normâ;- The strongest cadence frequency âc1â;- The three velocity profiles associated to the three strongest cadence frequencies âmatr_vexâ.The spectrogram images (SP_Norm) contained in this dataset were used for training and testing the CNN in Sect. VIII-A).The folder Obstacle Detection containsâRawâ folder where raw data acquired with our radar system and TOF camera in the presence of a multi target or single target scenario are given inside the âobst_detect_Raw_matâ zip folder. Itâs important to note that each radar frame and each TOF camera image have their own time stamp, but since they come from different sensor have to be synchronized.âPost_processâ divided in- âNeural Netâ folder containing âinputs_bis_1.matâ and ât_d_1.matâ, where- âinputs_bis_1.matâ contains the vectors of features of size 32x1, used for training and testing the feed-forward neural network described in Sect. VIII-B) (see eqs. (243)-(251)),- ât_d_1.matâ contains the associated 2x1 vectors of labels (see eq. (235)).- âYolo v2â folder containing the folder âDataset_YOLOâ and the table âobj_dataset_tabâ, where- âDataset_YOLO_v2â contains (inside the sub-folder âobj_datasetâ) the Range-Azimuth maps used for training the YOLO v2 network (see eqs. (257)-(258) and Fig. 30);- âobj_dataset_tabâ contains the path, the bounding box and the label associated to the Range-Azimuth maps (see eq. (256)-(266)).Cite as: A. Davoli, G. Guerzoni and G. M. Vitetta, "Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview," in IEEE Access, vol. 9, pp. 33704-33755, 2021, doi: 10.1109/ACCESS.2021.3061424.
Facebook
TwitterBats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of âA Plan for the North American Bat Monitoring Programâ (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in âA Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Programâ (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consisting of six multi-label datasets from the UCI Machine Learning repository.
Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The âamputationâ was performed using the âMissing Completely at Randomâ mechanism.
File names are represented as follows:
amp_DB_MR.arff
where:
DB = original dataset;
MR = missing rate.
For more details, please read:
IEEE Access article (in review process)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256Ă256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.
The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.
The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.
For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com
Facebook
TwitterThe ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.
The dataset consists of approximately 200,000+ research papers and includes the following columns:
Paper ID: A unique identifier for each research paper in the dataset.Title: The title of the research paper.Abstract: A brief summary or abstract of the research paper.Year: The publication year of the research paper.Primary Category: The primary category of the research paper, representing the main topic or area of focus.Categories: Additional categories or subtopics associated with the research paper.This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.
Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.
We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This comprehensive pharmaceutical synthetic dataset contains 1,393 records of synthetic drug information with 15 columns, designed for data science projects focusing on healthcare analytics, drug safety analysis, and pharmaceutical research. The dataset simulates real-world pharmaceutical data with appropriate variety and realistic constraints for machine learning applications.
| Attribute | Value |
|---|---|
| Total Records | 1,393 |
| Total Columns | 15 |
| File Format | CSV |
| Data Types | Mixed (intentional for data cleaning practice) |
| Domain | Pharmaceutical/Healthcare |
| Use Case | ML Training, Data Analysis, Healthcare Research |
| Column Name | Data Type | Unique Values | Description | Example Values |
|---|---|---|---|---|
drug_name | Object | 1,283 unique | Pharmaceutical drug names with realistic naming patterns | "Loxozepam32", "Amoxparin43", "Virazepam10" |
manufacturer | Object | 10 unique | Major pharmaceutical companies | Pfizer Inc., AstraZeneca, Johnson & Johnson |
drug_class | Object | 10 unique | Therapeutic drug classifications | Antibiotic, Analgesic, Antidepressant, Vaccine |
indications | Object | 10 unique | Medical conditions the drug treats | "Pain relief", "Bacterial infections", "Depression treatment" |
side_effects | Object | 434 unique | Combination of side effects (1-3 per drug) | "Nausea, Dizziness", "Headache, Fatigue, Rash" |
administration_route | Object | 7 unique | Method of drug delivery | Oral, Intravenous, Topical, Inhalation, Sublingual |
contraindications | Object | 10 unique | Medical warnings for drug usage | "Pregnancy", "Heart disease", "Liver disease" |
warnings | Object | 10 unique | Safety instructions and precautions | "Take with food", "Avoid alcohol", "Monitor blood pressure" |
batch_number | Object | 1,393 unique | Manufacturing batch identifiers | "xr691zv", "Ye266vU", "Rm082yX" |
expiry_date | Object | 782 unique | Drug expiration dates (YYYY-MM-DD) | "2025-12-13", "2027-03-09", "2026-10-06" |
side_effect_severity | Object | 3 unique | Severity classification | Mild, Moderate, Severe |
approval_status | Object | 3 unique | Regulatory approval status | Approved, Pending, Rejected |
| Column Name | Data Type | Range | Mean | Std Dev | Description |
|---|---|---|---|---|---|
approval_year | Float/String* | 1990-2024 | 2006.7 | 10.0 | FDA/regulatory approval year |
dosage_mg | Float/String* | 10-990 mg | 499.7 | 290.0 | Medication strength in milligrams |
price_usd | Float/String* | $2.32-$499.24 | $251.12 | $144.81 | Drug price in US dollars |
*Intentionally stored as mixed types for data cleaning practice
| Manufacturer | Count | Percentage |
|---|---|---|
| Pfizer Inc. | 170 | 12.2% |
| AstraZeneca | ~140 | ~10.0% |
| Merck & Co. | ~140 | ~10.0% |
| Johnson & Johnson | ~140 | ~10.0% |
| GlaxoSmithKline | ~140 | ~10.0% |
| Others | ~623 | ~44.8% |
| Drug Class | Count | Most Common |
|---|---|---|
| Anti-inflammatory | 154 | â |
| Antibiotic | ~140 | |
| Antidepressant | ~140 | |
| Antiviral | ~140 | |
| Vaccine | ~140 | |
| Others | ~679 |
| Severity | Count | Percentage |
|---|---|---|
| Severe | 488 | 35.0% |
| Moderate | ~453 | ~32.5% |
| Mild | ~452 | ~32.5% |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.
| Participant 1 | Participant 2 | Participant 3 | Participant 4 | |
|---|---|---|---|---|
| Age | 67 | 55 | 60 | 63 |
| Gender | Male | Female | Male | Female |
|
Final Valid Days | 9 | 15 | 12 | 13 |
| Total EMAs | 42 | 57 | 64 | 46 |
Table I. Summary of participants' collected data.
This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.
The data we provide in this repository consist of two type of files:
NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.
Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction
References:
[1] . A. Russell, âA circumplex model of affect,â Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980
Facebook
TwitterModeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).