100+ datasets found

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
application/gzip
Updated Mar 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3519618
Dataset updated
Mar 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; A large-scale study about quality and reproducibility of jupyter notebooks. In: International Conference on Mining Software Repositories (MSR), 2019, Montreal, Canada.

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering, 2021 (in press)

This repository contains three files:

db2020-09-22.dump.gz

sample.tar.gz

julynter_reproducility.tar.gz

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz

Install the dependencies: $ pip install julynter/requirements.txt

Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
f
CSV Data Dump for 31 Day Model Run
brunel.figshare.com
xlsx
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ioana Pisica; Alex Gray (2023). CSV Data Dump for 31 Day Model Run [Dataset]. http://doi.org/10.17633/rd.brunel.23545038.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.17633/rd.brunel.23545038.v1
Dataset updated
Jul 14, 2023
Dataset provided by
Brunel University London
Authors
Ioana Pisica; Alex Gray
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The target company's hydraulic modelling package uses Innovyze InfoworksTM. This product enables third party integration through API’s and Ruby scripts when the ICM Exchange service is enabled. As a result, the research looked at opportunities to exploit scripting in order to run the chosen optimisation strategy. The first approach initially investigated the use of a CS-script tool that would export the results tables directly from the Innovyze InfoworksTM environment into CSV format workbooks. From here the data could then be inspected, with the application of mathematical tooling to optimise the pump start parameters before returning these back into the model and rerunning. Note, the computational resource the research obtained to deploy the modelling and analysis tools comprised the following specification. Hardware

Dell Poweredge R720

Intel Xeon Processor E5-2600 v2

2x Processor Sockets

32GB Memory random access memory (RAM) – 1866MT/s Virtual Machine

Hosted on VMWare Hypervisor v6.0.

Windows Server 2012R2.

Microsoft Excel 64bit.

16 virtual-central-processing-units (V-CPU’s).

Full provision of 32GB RAM – 1866MT/s.

were highlighted in the first round of data exports as, even with a dedicated

Issues server offering 16-V-CPUs, and the specification as shown above, the Excel frontend environment was unable to process the very large data matrices being generated. There were regular failings of the Excel executable which led to an overall inability to inspect the data let alone run calculations on the matrices. When considering the five- second sample over 31 days this resulted in matrices in the order of [44x535682] per model run, with the calculations in (14-19) needing to be applied on a per cell basis.
d
Geochemistry and microbiology data collected to study the effects of oil and...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Geochemistry and microbiology data collected to study the effects of oil and gas wastewater dumping on arid lands in New Mexico [Dataset]. https://catalog.data.gov/dataset/geochemistry-and-microbiology-data-collected-to-study-the-effects-of-oil-and-gas-wastewate
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
New Mexico
Description
The Permian Basin, straddling New Mexico and Texas, is one of the most productive oil and gas (OG) provinces in the United States. OG production yields large volumes of wastewater that contain elevated concentrations of major ions including salts (also referred to as brines), and trace organic and inorganic constituents. These OG wastewaters pose unknown environmental health risks, particularly in the case of accidental or intentional releases. Releases of OG wastewaters have resulted in water-quality and environmental health effects at sites in West Virginia (Akob, et al., 2016, Orem et al. 2017, Kassotis et al. 2016) and in the Williston Basin region in Montana and North Dakota (Cozzarelli et al. 2017, Cozzarelli et al. 2021, Lauer et al. 2016, Gleason et al. 2014, and Mills et al. 2011). Starting in November 2017, 39 illegal dumps of OG wastewater were identified in southeastern New Mexico on public lands by the Bureau of Land Management (BLM). Illegal dumping is an unpermitted release of waste materials that is in violation of Federal and State laws including the U.S. Resource Conservation and Recovery Act (U.S. EPA, 1976), Federal Land Policy and Management Act (U.S. DOI, 2016; 43 USC 1701(a)(8); 43 USC 1733(g)), the State of New Mexico’s Oil and Gas Act (New Mexico Legislature. 2019), and New Mexico Administrative Code § 19.15.34.20. To evaluate the effects of these releases, changes in soil geochemistry and microbial community structure at 6 sites were analyzed by comparing soils from within OG wastewater dump-affected zones to corresponding unaffected (control) soils. In addition, the effects on local vegetation were evaluated by measuring the chemistry of 4 plant species from dump-affected and control zones at a single site. Samples of local produced waters were geochemically and isotopically characterized to link soil geochemistry to reservoir geochemistry. These data sets included field observations; soil water extractable inorganic chemical composition, pH, strontium (Sr) isotopes, and specific conductance; bulk soil Raman, carbon (C), nitrogen (N), mercury (Hg), radium (Ra) and thorium (Th) isotopes, and percent moisture; plant inorganic chemical composition; and soil microbial community composition data. At each site, triplicate soil samples were collected from dump-affected and control zones and duplicate field samples were collected at each site. Plant biomass was collected in triplicate from dump-affected and control zones at a single site. This data release includes eleven data tables provided as machine readable 'comma-separated values' format (*.csv): T01_Permian_Data_Dictionary.csv, the entity and attribute metadata section for tables T02-T11 in table format; T02_Soil_Geochemistry.csv, descriptions of sampling sites and concentrations of major anions, cations, and trace elements from the soil samples; T03_Plant_Geochemistry.csv, concentrations of major anions, cations, trace elements, and Sr isotopes from the vegetation samples; T04_Soil_Isotopes.csv, Sr, Ra, and Th isotopes from the soils; T05_Raman_Counts.csv, Raman spectra counts from the soil samples; T06_Raman_Band_Separation.csv, Raman band separation from selected soil samples; T07_Soil_Organics_Spectra.csv, spectral data of alkane unresolved complex mixtures (UCMs) from soil extracts; T08_Soil_Organics_Summary.csv, a summary of alkane UCMs from soil extracts; T09_Soil_16S_BIOM.csv, microbial operational taxonomic units from the soils; T10_Produced_Water.csv, selected geochemistry and isotopic measurements from produced water samples; T11_Limits_AnalyticalMethods.csv, a listing of analytical detection limits.
r
Dump truck object detection dataset including scale-models
demo.researchdata.se
researchdata.se
Updated May 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl Borngrund (2020). Dump truck object detection dataset including scale-models [Dataset]. http://doi.org/10.5878/8z9b-1718
Explore at:
Unique identifier
https://doi.org/10.5878/8z9b-1718
Dataset updated
May 8, 2020
Dataset provided by
Luleå University of Technology
Authors
Carl Borngrund
Description
Object detection is a vital part of any autonomous vision system and to obtain a high performing object detector data is needed. The object detection task aims to detect and classify different objects using camera input and getting bounding boxes containing the objects as output. This is usually done by utilizing deep neural networks.

When training an object detector a large amount of data is used, however it is not always practical to collect large amounts of data. This has led to multiple different techniques which decreases the amount of data needed. Examples of such techniques are transfer learning and domain adaptation. Working with construction equipment is a time consuming process and we wanted to examine if it was possible to use scale-model data to train a network and then used that network to detect real objects with no additional training.

This small dataset contains training and validation data of a scale dump truck in different environments while the test set contains images of a full size dump truck of similar model. The aim of the dataset is to train a network to classify wheels, cabs and tipping bodies of a scale-model dump truck and use that to classify the same classes on a full-scale dump truck.

The label structure of the dataset is the YOLO v3 structure, where the classes corresponds to a integer value, such that: Wheel: 0 Cab: 1 Tipping body: 2
Z
MoreFixes: Largest CVE dataset with fixes
data.niaid.nih.gov
zenodo.org
Updated Oct 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GADYATSKAYA, Olga (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
Explore at:
Dataset updated
Oct 23, 2024
Dataset provided by
GADYATSKAYA, Olga
Akhoundali, Jafar
Rahim Nouri, Sajad
Rietveld, Kristian F. D.
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

This product uses the NVD API but is not endorsed or certified by the NVD.

This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

Please use this for citation:

title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery}, author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga}, booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering}, pages={42--51}, year={2024} }
P
RealNews Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi, RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
Explore at:
Authors
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
Description
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
The dataset of the Global Collections survey of natural history collections
zenodo.org
data.niaid.nih.gov
bin, pdf, txt, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer (2024). The dataset of the Global Collections survey of natural history collections [Dataset]. http://doi.org/10.5281/zenodo.6985399
Explore at:
pdf, bin, zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6985399
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From 2016 to 2018, we surveyed the world’s largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:

Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.

Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.

Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.

Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.

This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:

A diagram of the MySQL database schema.

A SQL dump of the MySQL database schema, excluding the data.

A SQL dump of the MySQL database schema with all data. This may be imported into an instance of MySQL Server to create a complete reconstruction of the database.

Raw data from each database table in CSV format.

A set of more human-readable views of the data in CSV format. These correspond to the database tables, but foreign keys are substituted for values from the linked tables to make the data easier to read and analyse.

A text file containing the definitions of the size categories used in the collection_unit table.

The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:

Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.

Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.

Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.
h
openlegaldata-bulk-data
huggingface.co
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennard Zündorf (2023). openlegaldata-bulk-data [Dataset]. https://huggingface.co/datasets/LennardZuendorf/openlegaldata-bulk-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2023
Authors
Lennard Zündorf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for openlegaldata.io bulk case data

Dataset Description

This is the copy of the lastest dump from openlegaldata.io. I will try to keep this updated, since there is no offical Huggingface Dataset Repo.

Homepage: https://de.openlegaldata.io/ Repository: Bulk Data

Dataset Summary

This is the openlegaldata bulk case download from October 2022. Please refer to the offical website (above) for any more information. I have not made any changes for… See the full description on the dataset page: https://huggingface.co/datasets/LennardZuendorf/openlegaldata-bulk-data.
dumpdata
kaggle.com
Updated Apr 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serdar Goler (2020). dumpdata [Dataset]. https://www.kaggle.com/serdargoler/dumpdata/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Serdar Goler
Description
Context

How large is the impact of a dump site on house prices in the area?

Content

You work for a local government agency. They need to locate a new garbage dump site near the city and are looking for the optimal location to minimize its impact on house prices in the area. Your task is to take the available historical data about house prices in the dump sites for two available years and estimate the impact of a dump site’s vicinity on house prices. Present an econometric model.
ROR Data
data.niaid.nih.gov
zenodo.org
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Organization Registry (2025). ROR Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6347574
Explore at:
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Research Organization Registryhttps://ror.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations.

Release v1.63 contains ROR IDs and metadata for 115299 research organizations in JSON and CSV format, in schema versions 1 and 2. This includes the addition of 287 new records and metadata updates to 105 existing records. See the release notes.

Data format
Beginning with release v1.45 on 11 April 2024, data releases contain JSON and CSV files formatted according to both schema v1 and schema v2. v2 files have _schema_v2 appended to the end of the filename, ex v1.45-2024-04-11-ror-data_schema_v2.json. In order to maintain compatibility with previous releases, v1 files have no version information in the filename, ex v1.45-2024-04-11-ror-data.json
For both versions, the CSV file contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. As ROR records and the ROR schema are maintained in JSON, CSVs are for convenience only. JSON is the format of record.
Release versioning
Beginning with v1.45 in April 2024, ROR has introduced schema versioning, with files available in schema v1 and schema v2. The ROR API default version, however, remains v1 and will be changed to v2 in April 2025. To align with the API, the data dump major version will remain 1 until the API default version is changed to v2. At that time, the data dump major version will be incremented to 2 per below.
Data releases are versioned as follows:

Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure.

Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure.

Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice.

For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.
The ROR data dump is provided under the Creative Commons CC0 Public Domain Dedication. Location data in ROR comes from GeoNames and is licensed under a Creative Commons Attribution 4.0 license.
Forrest Gump
openneuro.org
Updated Sep 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Hanke; Florian J. Baumgartner; Pierre Ibe; Falko R. Kaule; Stefan Pollmann; Oliver Speck; Wolf Zinke; Jorg Stadler (2018). Forrest Gump [Dataset]. http://doi.org/10.18112/openneuro.ds000113.v1.1.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds000113.v1.1.0
Dataset updated
Sep 12, 2018
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Michael Hanke; Florian J. Baumgartner; Pierre Ibe; Falko R. Kaule; Stefan Pollmann; Oliver Speck; Wolf Zinke; Jorg Stadler
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Note: This dataset is the combination of four related datasets that were originally hosted on OpenfMRI.org: ds000113, ds000113b, ds000113c and ds000113d. The combined dataset is now in BIDS format and is simply referred to as ds000113 on OpenNeuro.org.

For more information about the project visit: http://studyforrest.org

This dataset contains high-resolution functional magnetic resonance (fMRI) data from 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump''). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response pattern to complex auditory stimulation. Among the potential uses of this dataset is the study of auditory attention and cognition, language and music perception as well as social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures — from stimulus creation to data analysis. (https://www.nature.com/articles/sdata20143)

The dataset also contains data from the same twenty participants while being repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm. It also includes raw fMRI data, as well as pre-computed structural alignments for within-subject and group analysis.

Additionally, seven of the twenty subjects participated in another study: empirical ultra high-field fMRI data recorded at four spatial resolutions (0.8 mm, 1.4 mm, 2 mm, and 3 mm isotropic voxel size) for orientation decoding in visual cortex — in order to test hypotheses on the strength and spatial scale of orientation discriminating signals. (https://www.sciencedirect.com/science/article/pii/S2352340917302056)

Finally, there are additional acquisitions for fifteen of the the twenty participants: retinotopic mapping, a localizer paradigm for higher visual areas (FFA, EBA, PPA), and another 2 hour movie recording with 3T full-brain BOLD fMRI with simultaneous 1000 Hz eyetracking.

For more information about the project visit: http://studyforrest.org

Dataset content overview

Stimulus material and protocol descriptions

./sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt ./sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt ./sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt ./sourcedata/acquisition_protocols/07-DTI_high_2iso.txt ./sourcedata/acquisition_protocols/08-field_map.txt Philips-specific MRI acquisition parameters dumps (plain text) for structural MRI (T1w, T2w, SWI, DTI, fieldmap -- in this order)

./sourcedata/acquisition_protocols/task01_fmri_session1.pdf ./sourcedata/acquisition_protocols/task01_fmri_session2.pdf ./sourcedata/acquisition_protocols/angio_session.pdf Siemens-specific MRI acquisition parameters dumps (PDF format) for functional MRI and angiography.

./stimuli/annotations/german_audio_description.csv

Audio-description transcript

This transcript contains all information on the audio-movie content that cannot be inferred from the DVD release — in a plain text, comma-separated-value table. Start and end time stamp, as well as the spoken text are provided for each continuous audio description segment.

./stimuli/annotations/scenes.csv

Movie scenes

A plain text, comma-separated-value table with start and end time for all 198 scenes in the presented movie cut. In addition, each table row contains whether a scene takes place indoors or outdoors.

./stimuli/generate/generate_melt_cmds.py Python script to generate commands for stimuli generation

./stimuli/psychopy/buttons.csv ./stimuli/psychopy/forrest_gump.psyexp ./stimuli/psychopy/segment_cfg.csv Source code of the stimuli presentation in PsychoPy

Functional imaging - Forrest Gump Task

Prolonged quasi-natural auditory stimulation (Forrest Gump audio movie)

Eight approximately 15 min long recording runs, together comprising the entire duration of a two-hour presentation of an audio-only version of the Hollywood feature film "Forrest Gump" made for a visually impaired audience (German dubbing).

For each run, there are 4D volumetric images (160x160x36)in NIfTI format , one volume recorded every 2 s, obtain from a Siemens MR scanner at 7 Tesla using a T2*-weighted gradient-echo EPI sequence (1.4 mm isotropic voxel size). These images have partial brain coverage — centered on the auditory cortices in both brain hemispheres and include frontal and posterior portions of the brain. There is no coverage for the upper portion of the brain (e.g. large parts of motor and somato-sensory cortices).

Several flavors of raw and preprocessed data are available:

Raw BOLD functional MRI ~~~~~~~~~~~~~~~~~~~~~~~

These raw data suffer from severe geometric distortions.

Filename examples for subject 01 and run 01

./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold.nii.gz BOLD data

./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold_dicominfo.txt Image property dump from DICOM conversion

Raw BOLD functional MRI (with applied distortion correction) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Identical to raw BOLD data, but with a scanner-side correction for geometric distortions applied (also include correction for participant motion). These data are most suitable for analysis of individual brains.

Filename examples for subject 01 and run 01

./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold.nii.gz BOLD data

./derivatives/motion/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_moco_ref.nii.gz Reference volume used for motion correction. Only runs 1 and 5 (first runs in each session)

./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold_dicominfo.txt Image property dump from DICOM conversion

Raw BOLD functional MRI (linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.

Alignment procedure was linear (image projection using an affine transformation). These data are most suitable for group-analyses and inter-individual comparisons.

Filename examples for subject 01 and run 01

./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7Tad_run-01_bold.nii.gz BOLD data

./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMask_run-01_bold.nii.gz Matching brain mask volume

./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-XFMdico7Tad2grpbold7Tad_run-01_bold.mat 4x4 affine transformation matrix (plain text format)

Raw BOLD functional MRI (non-linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.

Alignment procedure was non-linear (image projection using an affine transformation with additional transformation by non-linear warpfields). These data are most suitable for group-analyses and inter-individual comparisons.

Filename examples for subject 01 and run 01

./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNL_run-01_bold.nii.gz BOLD data

./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMaskNLBrainMask_run-01_bold.nii.gz Matching brain mask volume

./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNLWarp_run-01_bold.nii.gz Warpfield (associated affine transformation is identical with "linear" alignment

Functional imaging - Auditory Perception Session

Participants were repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm.

Filename examples for subject 01 and run 01

./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz ./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv

Functional imaging - Localizer Session

Filename examples for subject 01 and run
c
Plaintext Wikipedia dump 2018
lindat.mff.cuni.cz
live.european-language-grid.eu
Updated Feb 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rudolf Rosa (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735
Explore at:
Dataset updated
Feb 25, 2018
Authors
Rudolf Rosa
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Z
DOIBoost Dataset Dump
data.niaid.nih.gov
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mannocci, Andrea (2020). DOIBoost Dataset Dump [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1438355
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Mannocci, Andrea
La Bruzzo, Sandro
Manghi, Paolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices.

The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows:

47,254,618 CrossRef records have been enriched with an abstract from MAG;

33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID;

509,588 CrossRef records have been enriched with an ORCID identifier from ORCID.

This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost).

Note that this records comes with two relationships to other results of this experiment:

link to the data paper: for more information on how the dataset is (and can be) generated;

link to the software: to repeat the experiment
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2546834
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.7

When we
B
Bulk Dump Truck Report
datainsightsmarket.com
doc, pdf, ppt
Updated Dec 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2024). Bulk Dump Truck Report [Dataset]. https://www.datainsightsmarket.com/reports/bulk-dump-truck-766420
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Dec 29, 2024
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The market for bulk dump trucks is anticipated to register a CAGR of 7% over the forecast period of 2023-2033, reaching a market size of 16,000 million value units by 2033. The growth of the market is attributed to the increasing demand for bulk materials in various industries such as construction, mining, and agriculture. Additionally, the rising adoption of automated and semi-automated bulk dump trucks is expected to further drive market growth. Manual bulk dump trucks currently account for the majority of the market share, but automated and semi-automated trucks are expected to gain traction due to their increased efficiency and safety features. North America and Europe are expected to remain the dominant regions in the bulk dump truck market, with a significant share in the global market. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the increasing demand from developing countries such as China and India. Major companies operating in the bulk dump truck market include Automated Conveyor Company, CDS-LIPE, National Bulk Equipment, TOTE Systems, and Weening Brothers. These companies are focusing on product development and innovation to meet the evolving needs of customers and enhance their competitive advantages in the market.
Z
INSPIRE HEP Dataset on 2021-01-08 for the paper Embracing data-driven...
data.niaid.nih.gov
zenodo.org
Updated May 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Börner, Katy (2021). INSPIRE HEP Dataset on 2021-01-08 for the paper Embracing data-driven decision making to manage and communicate the impact of big science collaborations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4496557
Explore at:
Dataset updated
May 30, 2021
Dataset provided by
Silva, Filipi N.
Börner, Katy
Milojević, Staša
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains original data used for data analyses and visualizations discussed in the “Embracing data-driven decision making to manage and communicate the impact of big science collaborations” paper. Four High Energy Physics and Astrophysics projects were studied: ATLAS, BaBar, LIGO, and IceCube. Data for these projects was collected from the INSPIRE HEP (https://inspirehep.net) from dumps available in http://old.inspirehep.net/dumps/inspire-dump.html on Jan 8 2020. Processed folder contains preprocessed data according to the code in the repository: https://github.com/bigscience/bigscience .
o
ROR Data
explore.openaire.eu
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Organization Registry (2023). ROR Data [Dataset]. http://doi.org/10.5281/zenodo.7742581
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7742581
Dataset updated
Mar 16, 2023
Authors
Research Organization Registry
Description
Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations. Release v1.21 contains ROR IDs and metadata for 104,834 research organizations. This includes the addition of 104 new records and metadata updates to 265 existing records. See the release notes. Starting with this release, the data dump includes a CSV version of the ROR data file in addition to the canonical JSON file. The data dump zip therefore now contains two files instead of one. If your code currently expects only one file, you will need to update it accordingly. The CSV contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. Beginning with its March 2022 release, ROR is curated independently from GRID. Semantic versioning beginning with v1.0 was added to reflect this departure from GRID. The existing data structure was not changed. From March 2022 onward, data releases are versioned as follows: Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure. Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure. Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice. For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.
g
DOIBoost Dataset Dump | gimi9.com
gimi9.com
Updated Oct 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). DOIBoost Dataset Dump | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-3559699/
Explore at:
Dataset updated
Oct 3, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices. The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows: 47,254,618 CrossRef records have been enriched with an abstract from MAG; 33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID; 509,588 CrossRef records have been enriched with an ORCID identifier from ORCID. This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost). Note that this records comes with two relationships to other results of this experiment: link to the data paper: for more information on how the dataset is (and can be) generated; link to the software: to repeat the experiment
d
Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...
datadryad.org
zip
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R. Stuart Geiger; Aaron Halfaker (2017). Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20 April 2017 [Dataset]. http://doi.org/10.6078/D1FD3K
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1FD3K
Dataset updated
Aug 15, 2017
Dataset provided by
Dryad
Authors
R. Stuart Geiger; Aaron Halfaker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2017
Description
See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.
C
Event Graph of BPI Challenge 2019
data.4tu.nl
zip
Updated Apr 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dirk Fahland (2021). Event Graph of BPI Challenge 2019 [Dataset]. http://doi.org/10.4121/14169614.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14169614.v1
Dataset updated
Apr 22, 2021
Dataset provided by
4TU.ResearchData
Authors
Dirk Fahland
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Business process event data modeled as labeled property graphs

Data Format
-----------

The dataset comprises one labeled property graph in two different file formats.

#1) Neo4j .dump format

A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/

/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=

The .dump was created with Neo4j v3.5.

#2) .graphml format

A .zip file containing a .graphml file of the entire graph

Data Schema
-----------

The graph is a labeled property graph over business process event data. Each graph uses the following concepts

:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"

:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")

:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node

:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations

:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities

:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.

:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log

:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph

:REL relationship - placeholder for any structural relationship between two :Entity nodes

The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552

Data Contents
-------------

neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)

An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1

This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.

The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).

The data contains the following entities and their events

- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased

Data Size
---------

BPIC19, nodes: 1926651, relationships: 15082099

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3519618

Dataset updated

Mar 16, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; A large-scale study about quality and reproducibility of jupyter notebooks. In: International Conference on Mining Software Repositories (MSR), 2019, Montreal, Canada.
PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering, 2021 (in press)

This repository contains three files:

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
Install the dependencies: $ pip install julynter/requirements.txt
Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

CSV Data Dump for 31 Day Model Run

Geochemistry and microbiology data collected to study the effects of oil and...

Dump truck object detection dataset including scale-models

MoreFixes: Largest CVE dataset with fixes

RealNews Dataset

The dataset of the Global Collections survey of natural history collections

openlegaldata-bulk-data

dumpdata

Context

Content

ROR Data

Forrest Gump

Dataset content overview

Stimulus material and protocol descriptions

Functional imaging - Forrest Gump Task

Functional imaging - Auditory Perception Session

Functional imaging - Localizer Session

Plaintext Wikipedia dump 2018

DOIBoost Dataset Dump

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

Bulk Dump Truck Report

INSPIRE HEP Dataset on 2021-01-08 for the paper Embracing data-driven...

ROR Data

DOIBoost Dataset Dump | gimi9.com

Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

Event Graph of BPI Challenge 2019

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter NotebooksSee More Versions

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks