90 datasets found

Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of...
zenodo.org
bin, csv
Updated May 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin (2022). Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts" [Dataset]. http://doi.org/10.5281/zenodo.6383115
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6383115
Dataset updated
May 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains the dataset of properly-licensed Jupyter notebooks from the MSR'22 paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts". The dataset contains 847,881 notebooks stored in the PostgreSQL dump file. You can find the details about the database in the README file. To transform the notebooks into this convenient format and to calcuate the structural metrics, we used our library called Matroskin, which can be found here: https://github.com/JetBrains-Research/Matroskin.
s
MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE)...
figshare.scilifelab.se
zip
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl (2025). MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) protocol data and codes [Dataset]. http://doi.org/10.17044/scilifelab.28008872.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.28008872.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Uppsala University
Authors
Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A test dataset for MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) data analysis. See "\Python codes for MUSCLE data analysis\README.txt" for the instructions on running the data analysis codes. Use the files in the "Test MUSCLE dataset" folder as input for the codes. "Test MUSCLE dataset\Output_tile1" contains the code output for the test dataset. The example dataset corresponds to one MiSeq tile in an experiment analyzing dCas9-induced R-loop formation for a library of 256 different target sequences.The latest version of the Python codes for matching single-molecule FRET traces with sequenced clusters is available at https://github.com/deindllab/MUSCLE/.
Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvj1
Dataset updated
Jul 8, 2024
Dataset provided by
Stanford University School of Medicine
Authors
Yuqi Tan; Tim Kempchen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.
d
Data associated with manuscript "Linearizing the vertical scale of an...
catalog.data.gov
data.nist.gov
+1more
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Data associated with manuscript "Linearizing the vertical scale of an interferometric microscope and its effect on step-height measurement" [Dataset]. https://catalog.data.gov/dataset/data-associated-with-manuscript-linearizing-the-vertical-scale-of-an-interferometric-micro
Explore at:
Dataset updated
Sep 11, 2024
Dataset provided by
National Institute of Standards and Technology
Description
This repository contains all of the data used in the manuscript "Linearizing the vertical scale of an interferometric microscope and its effect on step-height measurement," by Thomas A. Germer, T. Brian Renegar, Ulf Griesmann, and Johannes A. Soons, which has been published in Surface Topography: Metrology and Properties volume 12, number 2, article 025012 on 8 May 2024. The repository also contains a Python Jupyter notebook that performs the analysis of the data and generates the figures in the manuscript.
Dataset for manuscript "Isolating aerosol-climate interactions in global...
zenodo.org
application/gzip, bin
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ross Herbert; Ross Herbert (2024). Dataset for manuscript "Isolating aerosol-climate interactions in global kilometre-scale simulations" [Dataset]. http://doi.org/10.5281/zenodo.11507932
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11507932
Dataset updated
Jun 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ross Herbert; Ross Herbert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes two Jupyterlab book python script and a folder of compressed data.

Due to the size of the dataset we are unable to include the raw simulation output but have instead inluded all datasets required to plot the figures.

The python script "original_analysis_script.ipynb" contains the code that was used to perform the analysis on the outputs from the simulations. This is not intended for public use.

The python script "figure_plotting_scripts.ipynb" is a cut-down verison of the original code that can be used to plot the figures as presented in the manuscript. All neccessary data to achieve this can be found in the compressed folder "data_arrays".

The scripts will run on Jupyterlab book. Providing the user has the neccessary python modules available the plotting script can be run by amending the paths for the figure outout directory and the uncompressed "data_arrays" directory. These paths can be found at the top of the script.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Schaumlöffel, Timothy
Choksi, Bhavin
Roig, Gemma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carreira Pedro, Hugo (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Carreira Pedro, Hugo
Coimbra, Carlos
Larson, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
Z
Data from: Supporting data for "A convolution method to assess subgrid-scale...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Dec 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gourgue, Olivier (2022). Supporting data for "A convolution method to assess subgrid-scale interactions between flow and patchy vegetation in biogeomorphic models" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3715544
Explore at:
Dataset updated
Dec 2, 2022
Dataset provided by
Schwarz, Christian
van de Koppel, Johan
Temmerman, Stijn
van Belzen, Jim
Bouma, Tjeerd J.
Gourgue, Olivier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset necessary to reproduce the results and analyses presented in the paper:

Gourgue, O., van Belzen, J., Schwarz, C., Bouma, T.J., van de Koppel, J. & Temmerman, S. (2020) A convolution method to assess subgrid-scale interactions between flow and patchy vegetation in biogeomorphic models, Journal of Advances in Modeling Earth Systems, submitted.

The dataset contains:

Process-based model simulations, including their input files and the Python scripts to generate them (pre-processing), as well as the output files and Python scripts to post-process them.

Flume experiment data processed for the model calibration.

Python scripts to generate the figures and tables of the manuscript.
H
Data from: Critical Search: A procedure for guided reading in large-scale...
dataverse.harvard.edu
Updated Jan 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Guldi (2019). Critical Search: A procedure for guided reading in large-scale textual corpora [Dataset]. http://doi.org/10.7910/DVN/BJNAPD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/BJNAPD
Dataset updated
Jan 4, 2019
Dataset provided by
Harvard Dataverse
Authors
Jo Guldi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.
S
Dataset: Deenz Dark Triad Scale – Poland
sodha.be
datacatalogue.cessda.eu
tsv
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deen Mohd Dar; Deen Mohd Dar (2025). Dataset: Deenz Dark Triad Scale – Poland [Dataset]. http://doi.org/10.34934/DVN/4WYRN9
Explore at:
tsv(6069)Available download formats
Unique identifier
https://doi.org/10.34934/DVN/4WYRN9
Dataset updated
Feb 20, 2025
Dataset provided by
Social Sciences and Digital Humanities Archive – SODHA
Authors
Deen Mohd Dar; Deen Mohd Dar
License
https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9
Area covered
Poland
Description
This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.
P
BuGL Dataset
paperswithcode.com
opendatalab.com
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). BuGL Dataset [Dataset]. https://paperswithcode.com/dataset/bugl
Explore at:
Dataset updated
Apr 28, 2021
Description
BuGL is a large-scale cross-language dataset for bug localization in code. BuGL constitutes of more than 10,000 bug reports drawn from open-source projects written in four programming languages, namely C, C++, Java, and Python. The dataset consists of information which includes Bug Reports and Pull-Requests. BuGL aims to unfold new research opportunities in the area of bug localization.
T
wake_vision
tensorflow.org
dataverse.harvard.edu
+1more
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). wake_vision [Dataset]. http://doi.org/10.7910/DVN/1HOPXC
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/1HOPXC
Dataset updated
Jun 29, 2024
Description
Wake Vision is a large, high-quality dataset featuring over 6 million images, significantly exceeding the scale and diversity of current tinyML datasets (100x). This dataset includes images with annotations of whether each image contains a person. Additionally, it incorporates a comprehensive fine-grained benchmark to assess fairness and robustness, covering perceived gender, perceived age, subject distance, lighting conditions, and depictions. The Wake Vision labels are derived from Open Image's annotations which are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note from Open Images: "while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself."

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wake_vision', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/wake_vision-1.0.0.png" alt="Visualization" width="500px">
CODE-15%: a large scale annotated dataset of 12-lead ECGs
zenodo.org
csv, zip
Updated Jan 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antônio H. Ribeiro; Antônio H. Ribeiro; Gabriela M.M. Paixao; Gabriela M.M. Paixao; Emilly M. Lima; Emilly M. Lima; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Marcelo M. Pinto Filho; Marcelo M. Pinto Filho; Paulo R. Gomes; Paulo R. Gomes; Derick M. Oliveira; Derick M. Oliveira; Wagner Meira Jr; Wagner Meira Jr; Thömas B Schon; Thömas B Schon; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2025). CODE-15%: a large scale annotated dataset of 12-lead ECGs [Dataset]. http://doi.org/10.5281/zenodo.4916206
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4916206
Dataset updated
Jan 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antônio H. Ribeiro; Antônio H. Ribeiro; Gabriela M.M. Paixao; Gabriela M.M. Paixao; Emilly M. Lima; Emilly M. Lima; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Marcelo M. Pinto Filho; Marcelo M. Pinto Filho; Paulo R. Gomes; Paulo R. Gomes; Derick M. Oliveira; Derick M. Oliveira; Wagner Meira Jr; Wagner Meira Jr; Thömas B Schon; Thömas B Schon; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of 12-lead ECGs with annotations. The dataset contains 345 779 exams from 233 770 patients. It was obtained through stratified sampling from the CODE dataset ( 15% of the patients). The data was collected by the Telehealth Network of Minas Gerais in the period between 2010 and 2016.

This repository contains the files `exams.csv` and the files `exams_part{i}.zip` for i = 0, 1, 2, ... 17.

"exams.csv": is a comma-separated values (csv) file containing the columns

"exam_id": id used for identifying the exam;

"age": patient age in years at the moment of the exam;

"is_male": true if the patient is male;

"nn_predicted_age": age predicted by a neural network to the patient. As described in the paper "Deep neural network estimated electrocardiographic-age as a mortality predictor" bellow.

"1dAVb": Whether or not the patient has 1st degree AV block;

"RBBB": Whether or not the patient has right bundle branch block;

"LBBB": Whether or not the patient has left bundle branch block;

"SB": Whether or not the patient has sinus bradycardia;

"AF": Whether or not the patient has atrial fibrillation;

"ST": Whether or not the patient has sinus tachycardia;

"patient_id": id used for identifying the patient;

"normal_ecg": True if automatic annotation system say it is a normal ECG;

"death": true if the patient dies in the follow-up time. This data is available only in the first exam of the patient. Other exams will have this as an empty field;

"timey": if the patient dies it is the time to the death of the patient. If not, it is the follow-up time. This data is available only in the first exam of the patient. Other exams will have this as an empty field;

"trace_file": identify in which hdf5 file the file corresponding to this patient is located.

"exams_part{i}.hdf5": The HDF5 file containing two datasets named `tracings` and other named `exam_id`. The `exam_id` is a tensor of dimension `(N,)` containing the exam id (the same as in the csv file) and the dataset `tracings` is a `(N, 4096, 12)` tensor containing the ECG tracings in the same order. The first dimension corresponds to the different exams; the second dimension corresponds to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples), we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are then saved in the hdf5 dataset.
In python, one can read this file using h5py.
```python
import h5py

f = h5py.File(path_to_file, 'r')
# Get ids
traces_ids = np.array(self.f['id_exam'])
x = f['signal']
```
The `signal` dataset is too large to fit in memory, so don't convert it to a numpy array all at once.
It is possible to access a chunk of it using: ``x[start:end, :, :]``.

The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil. The dataset is described

Ribeiro, Antônio H., Manoel Horta Ribeiro, Gabriela M. M. Paixão, Derick M. Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton P. S. Ferreira, et al. “Automatic Diagnosis of the 12-Lead ECG Using a Deep Neural Network.” Nature Communications 11, no. 1 (2020): 1760. https://doi.org/10.1038/s41467-020-15432-4

The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the code dataset is described in and used for assessing model performance:
"Deep neural network estimated electrocardiographic-age as a mortality predictor"
Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232

The companion code for reproducing the experiments in the two papers described above can be found, respectively, in:
- https://github.com/antonior92/automatic-ecg-diagnosis; and in,
- https://github.com/antonior92/ecg-age-prediction.

Note about authorship: Antônio H. Ribeiro, Emilly M. Lima and Gabriela M.M. Paixão contributed equally to this work.
P
Project CodeNet Dataset
paperswithcode.com
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss (2022). Project CodeNet Dataset [Dataset]. https://paperswithcode.com/dataset/project-codenet
Explore at:
Dataset updated
Jun 10, 2022
Authors
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss
Description
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.

Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
T
wider_face
tensorflow.org
opendatalab.com
+1more
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wider_face [Dataset]. https://www.tensorflow.org/datasets/catalog/wider_face
Explore at:
Dataset updated
Dec 6, 2022
Description
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wider_face', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">
text-2-video-human-preferences-runway-alpha
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rapidata (2025). text-2-video-human-preferences-runway-alpha [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset provided by
Rapidata AG
Authors
Rapidata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rapidata Video Generation Runway Alpha Human Preference

If you get value from this dataset and would like to see more in the future, please consider liking it.

This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.

Overview

In this dataset, ~30'000 human annotations were collected to evaluate Runway's Alpha video generation model on our benchmark. The up to date benchmark can… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha.
d
Input data to model multiple effects of large-scale deployment of grass in...
datadryad.org
snd.se
+1more
zip
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oskar Englund (2022). Input data to model multiple effects of large-scale deployment of grass in crop-rotations at European scale [Dataset]. http://doi.org/10.5061/dryad.18931zd1m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.18931zd1m
Dataset updated
Nov 17, 2022
Dataset provided by
Dryad
Authors
Oskar Englund
Time period covered
2022
Description
See original article: Englund, O., Mola-Yudego, B., Börjesson, P., Cederberg, C., Dimitriou, I., Scarlat, N., Berndes, G. Large-scale deployment of grass in crop rotations as a multifunctional climate mitigation strategy. GCB Bioenergy Preprint: Englund, O., Mola-Yudego, B., Börjesson, P., Cederberg, C., Dimitriou, I., Scarlat, N., Berndes, G., (2022). Large-scale deployment of grass in crop rotations as a multifunctional climate mitigation strategy. EarthArXiv. Sept. 23. https://doi.org/10.31223/X5KW5J
T
criteo
tensorflow.org
Updated Dec 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). criteo [Dataset]. https://www.tensorflow.org/datasets/catalog/criteo
Explore at:
Dataset updated
Dec 22, 2022
Description
Criteo Uplift Modeling Dataset

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)

This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).

Fields

Here is a detailed description of the fields (they are comma-separated in the file):

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

Key figures

Format: CSV

Size: 459MB (compressed)

Rows: 25,309,483

Average Visit Rate: .04132

Average Conversion Rate: .00229

Treatment Ratio: .846

Tasks

The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:

benchmark for causal inference

uplift modeling

interactions between features and treatment

heterogeneity of treatment

benchmark for observational causality methods

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('criteo', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
H
Tutorial for NetCDF climate data retrieval and model integration
hydroshare.org
beta.hydroshare.org
zip
Updated Apr 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tutorial for NetCDF climate data retrieval and model integration [Dataset]. https://www.hydroshare.org/resource/8438dcb7795941d3ad2fe1a6fc055ef5
Explore at:
zip(125.5 KB)Available download formats
Dataset updated
Apr 4, 2019
Dataset provided by
HydroShare
Authors
Christina Bandaragoda; Jimmy Phuong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hydrological and meteorological information can help inform the conditions and risk factors related to the environment and their inhabitants. Due to the limitations of observation sampling, gridded data sets provide the modeled information for areas where data collection are infeasible using observations collected and known process relations. Although available, data users are faced with barriers to use, challenges like how to access, acquire, then analyze data for small watershed areas, when these datasets were produced for large, continental scale processes. In this tutorial, we introduce Observatory for Gridded Hydrometeorology (OGH) to resolve such hurdles in a use-case that incorporates NetCDF gridded data sets processes developed to interpret the findings and apply secondary modeling frameworks (landlab).

LEARNING OBJECTIVES - Familiarize with data management, metadata management, and analyses with gridded data - Inspecting and problem solving with Python libraries - Explore data architecture and processes - Learn about OGH Python Library - Discuss conceptual data engineering and science operations

Use-case operations: 1. Prepare computing environment 2. Get list of grid cells 3. NetCDF retrieval and clipping to a spatial extent 4. Extract NetCDF metadata and convert NetCDFs to 1D ASCII time-series files 5. Visualize the average monthly total precipitations 6. Apply summary values as modeling inputs 7. Visualize modeling outputs 8. Save results in a new HydroShare resource

For inquiries, issues, or contribute to the developments, please refer to https://github.com/freshwater-initiative/Observatory

Facebook

Twitter

Click to copy link

Link copied

Cite

Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin (2022). Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts" [Dataset]. http://doi.org/10.5281/zenodo.6383115

Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts"

Explore at:

csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6383115

Dataset updated

May 17, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This archive contains the dataset of properly-licensed Jupyter notebooks from the MSR'22 paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts". The dataset contains 847,881 notebooks stored in the PostgreSQL dump file. You can find the details about the database in the README file.

To transform the notebooks into this convenient format and to calcuate the structural metrics, we used our library called Matroskin, which can be found here: https://github.com/JetBrains-Research/Matroskin.

Clear search

Close search

Google apps

Main menu

Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of...

MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE)...

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

Data associated with manuscript "Linearizing the vertical scale of an...

Dataset for manuscript "Isolating aerosol-climate interactions in global...

Multimodal Vision-Audio-Language Dataset

Data from: A comprehensive dataset for the accelerated development and...

Data from: Supporting data for "A convolution method to assess subgrid-scale...

Data from: Critical Search: A procedure for guided reading in large-scale...

Dataset: Deenz Dark Triad Scale – Poland

BuGL Dataset

wake_vision

CODE-15%: a large scale annotated dataset of 12-lead ECGs

Project CodeNet Dataset

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

wider_face

text-2-video-human-preferences-runway-alpha

Input data to model multiple effects of large-scale deployment of grass in...

criteo

Criteo Uplift Modeling Dataset

Data description

Fields

Key figures

Tasks

Tutorial for NetCDF climate data retrieval and model integration

Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts"