100+ datasets found
  1. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    application/gzip
    Updated Mar 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

    Papers:

    This repository contains three files:

    Reproducing the Notebook Study

    The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

    gunzip -c db2020-09-22.dump.gz | psql jupyter

    Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

    For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

    The sample.tar.gz file contains the repositories obtained during the manual sampling.

    Reproducing the Julynter Experiment

    The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

    • Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
    • Install the dependencies: $ pip install julynter/requirements.txt
    • Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

    The collected data is stored in the julynter/data folder.

    Changelog

    2019/01/14 - Version 1 - Initial version
    2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
    2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
    2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

  2. f

    CSV Data Dump for 31 Day Model Run

    • brunel.figshare.com
    xlsx
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ioana Pisica; Alex Gray (2023). CSV Data Dump for 31 Day Model Run [Dataset]. http://doi.org/10.17633/rd.brunel.23545038.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 14, 2023
    Dataset provided by
    Brunel University London
    Authors
    Ioana Pisica; Alex Gray
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The target company's hydraulic modelling package uses Innovyze InfoworksTM. This product enables third party integration through API’s and Ruby scripts when the ICM Exchange service is enabled. As a result, the research looked at opportunities to exploit scripting in order to run the chosen optimisation strategy. The first approach initially investigated the use of a CS-script tool that would export the results tables directly from the Innovyze InfoworksTM environment into CSV format workbooks. From here the data could then be inspected, with the application of mathematical tooling to optimise the pump start parameters before returning these back into the model and rerunning. Note, the computational resource the research obtained to deploy the modelling and analysis tools comprised the following specification. Hardware

    Dell Poweredge R720

    Intel Xeon Processor E5-2600 v2

    2x Processor Sockets

    32GB Memory random access memory (RAM) – 1866MT/s Virtual Machine

    Hosted on VMWare Hypervisor v6.0.

    Windows Server 2012R2.

    Microsoft Excel 64bit.

    16 virtual-central-processing-units (V-CPU’s).

    Full provision of 32GB RAM – 1866MT/s.

    were highlighted in the first round of data exports as, even with a dedicated

    Issues server offering 16-V-CPUs, and the specification as shown above, the Excel frontend environment was unable to process the very large data matrices being generated. There were regular failings of the Excel executable which led to an overall inability to inspect the data let alone run calculations on the matrices. When considering the five- second sample over 31 days this resulted in matrices in the order of [44x535682] per model run, with the calculations in (14-19) needing to be applied on a per cell basis.

  3. d

    Geochemistry and microbiology data collected to study the effects of oil and...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Geochemistry and microbiology data collected to study the effects of oil and gas wastewater dumping on arid lands in New Mexico [Dataset]. https://catalog.data.gov/dataset/geochemistry-and-microbiology-data-collected-to-study-the-effects-of-oil-and-gas-wastewate
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    New Mexico
    Description

    The Permian Basin, straddling New Mexico and Texas, is one of the most productive oil and gas (OG) provinces in the United States. OG production yields large volumes of wastewater that contain elevated concentrations of major ions including salts (also referred to as brines), and trace organic and inorganic constituents. These OG wastewaters pose unknown environmental health risks, particularly in the case of accidental or intentional releases. Releases of OG wastewaters have resulted in water-quality and environmental health effects at sites in West Virginia (Akob, et al., 2016, Orem et al. 2017, Kassotis et al. 2016) and in the Williston Basin region in Montana and North Dakota (Cozzarelli et al. 2017, Cozzarelli et al. 2021, Lauer et al. 2016, Gleason et al. 2014, and Mills et al. 2011). Starting in November 2017, 39 illegal dumps of OG wastewater were identified in southeastern New Mexico on public lands by the Bureau of Land Management (BLM). Illegal dumping is an unpermitted release of waste materials that is in violation of Federal and State laws including the U.S. Resource Conservation and Recovery Act (U.S. EPA, 1976), Federal Land Policy and Management Act (U.S. DOI, 2016; 43 USC 1701(a)(8); 43 USC 1733(g)), the State of New Mexico’s Oil and Gas Act (New Mexico Legislature. 2019), and New Mexico Administrative Code § 19.15.34.20. To evaluate the effects of these releases, changes in soil geochemistry and microbial community structure at 6 sites were analyzed by comparing soils from within OG wastewater dump-affected zones to corresponding unaffected (control) soils. In addition, the effects on local vegetation were evaluated by measuring the chemistry of 4 plant species from dump-affected and control zones at a single site. Samples of local produced waters were geochemically and isotopically characterized to link soil geochemistry to reservoir geochemistry. These data sets included field observations; soil water extractable inorganic chemical composition, pH, strontium (Sr) isotopes, and specific conductance; bulk soil Raman, carbon (C), nitrogen (N), mercury (Hg), radium (Ra) and thorium (Th) isotopes, and percent moisture; plant inorganic chemical composition; and soil microbial community composition data. At each site, triplicate soil samples were collected from dump-affected and control zones and duplicate field samples were collected at each site. Plant biomass was collected in triplicate from dump-affected and control zones at a single site. This data release includes eleven data tables provided as machine readable 'comma-separated values' format (*.csv): T01_Permian_Data_Dictionary.csv, the entity and attribute metadata section for tables T02-T11 in table format; T02_Soil_Geochemistry.csv, descriptions of sampling sites and concentrations of major anions, cations, and trace elements from the soil samples; T03_Plant_Geochemistry.csv, concentrations of major anions, cations, trace elements, and Sr isotopes from the vegetation samples; T04_Soil_Isotopes.csv, Sr, Ra, and Th isotopes from the soils; T05_Raman_Counts.csv, Raman spectra counts from the soil samples; T06_Raman_Band_Separation.csv, Raman band separation from selected soil samples; T07_Soil_Organics_Spectra.csv, spectral data of alkane unresolved complex mixtures (UCMs) from soil extracts; T08_Soil_Organics_Summary.csv, a summary of alkane UCMs from soil extracts; T09_Soil_16S_BIOM.csv, microbial operational taxonomic units from the soils; T10_Produced_Water.csv, selected geochemistry and isotopic measurements from produced water samples; T11_Limits_AnalyticalMethods.csv, a listing of analytical detection limits.

  4. r

    Dump truck object detection dataset including scale-models

    • demo.researchdata.se
    • researchdata.se
    Updated May 8, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl Borngrund (2020). Dump truck object detection dataset including scale-models [Dataset]. http://doi.org/10.5878/8z9b-1718
    Explore at:
    Dataset updated
    May 8, 2020
    Dataset provided by
    Luleå University of Technology
    Authors
    Carl Borngrund
    Description

    Object detection is a vital part of any autonomous vision system and to obtain a high performing object detector data is needed. The object detection task aims to detect and classify different objects using camera input and getting bounding boxes containing the objects as output. This is usually done by utilizing deep neural networks.

    When training an object detector a large amount of data is used, however it is not always practical to collect large amounts of data. This has led to multiple different techniques which decreases the amount of data needed. Examples of such techniques are transfer learning and domain adaptation. Working with construction equipment is a time consuming process and we wanted to examine if it was possible to use scale-model data to train a network and then used that network to detect real objects with no additional training.

    This small dataset contains training and validation data of a scale dump truck in different environments while the test set contains images of a full size dump truck of similar model. The aim of the dataset is to train a network to classify wheels, cabs and tipping bodies of a scale-model dump truck and use that to classify the same classes on a full-scale dump truck.

    The label structure of the dataset is the YOLO v3 structure, where the classes corresponds to a integer value, such that: Wheel: 0 Cab: 1 Tipping body: 2

  5. Z

    MoreFixes: Largest CVE dataset with fixes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GADYATSKAYA, Olga (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
    Explore at:
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    GADYATSKAYA, Olga
    Akhoundali, Jafar
    Rahim Nouri, Sajad
    Rietveld, Kristian F. D.
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

    We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

    cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

    MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

    For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

    If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

    This product uses the NVD API but is not endorsed or certified by the NVD.

    This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

    To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

    POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

    Please use this for citation:

     title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
     author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
     booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
     pages={42--51},
     year={2024}
    }
    
  6. P

    RealNews Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi, RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
    Explore at:
    Authors
    Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
    Description

    RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.

  7. The dataset of the Global Collections survey of natural history collections

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf, txt, zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer (2024). The dataset of the Global Collections survey of natural history collections [Dataset]. http://doi.org/10.5281/zenodo.6985399
    Explore at:
    pdf, bin, zip, txtAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    From 2016 to 2018, we surveyed the world’s largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:

    • Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.

    • Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.

    • Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.

    • Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.

    This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:

    1. A diagram of the MySQL database schema.
    2. A SQL dump of the MySQL database schema, excluding the data.
    3. A SQL dump of the MySQL database schema with all data. This may be imported into an instance of MySQL Server to create a complete reconstruction of the database.
    4. Raw data from each database table in CSV format.
    5. A set of more human-readable views of the data in CSV format. These correspond to the database tables, but foreign keys are substituted for values from the linked tables to make the data easier to read and analyse.
    6. A text file containing the definitions of the size categories used in the collection_unit table.

    The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:

    Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.

    Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.

    Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.

  8. h

    openlegaldata-bulk-data

    • huggingface.co
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lennard Zündorf (2023). openlegaldata-bulk-data [Dataset]. https://huggingface.co/datasets/LennardZuendorf/openlegaldata-bulk-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2023
    Authors
    Lennard Zündorf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for openlegaldata.io bulk case data

      Dataset Description
    

    This is the copy of the lastest dump from openlegaldata.io. I will try to keep this updated, since there is no offical Huggingface Dataset Repo.

    Homepage: https://de.openlegaldata.io/ Repository: Bulk Data

      Dataset Summary
    

    This is the openlegaldata bulk case download from October 2022. Please refer to the offical website (above) for any more information. I have not made any changes for… See the full description on the dataset page: https://huggingface.co/datasets/LennardZuendorf/openlegaldata-bulk-data.

  9. dumpdata

    • kaggle.com
    Updated Apr 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serdar Goler (2020). dumpdata [Dataset]. https://www.kaggle.com/serdargoler/dumpdata/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Serdar Goler
    Description

    Context

    How large is the impact of a dump site on house prices in the area?

    Content

    You work for a local government agency. They need to locate a new garbage dump site near the city and are looking for the optimal location to minimize its impact on house prices in the area. Your task is to take the available historical data about house prices in the dump sites for two available years and estimate the impact of a dump site’s vicinity on house prices. Present an econometric model.

  10. ROR Data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Organization Registry (2025). ROR Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6347574
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Research Organization Registryhttps://ror.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations.

    Release v1.63 contains ROR IDs and metadata for 115299 research organizations in JSON and CSV format, in schema versions 1 and 2. This includes the addition of 287 new records and metadata updates to 105 existing records. See the release notes.

    Data format
    Beginning with release v1.45 on 11 April 2024, data releases contain JSON and CSV files formatted according to both schema v1 and schema v2. v2 files have _schema_v2 appended to the end of the filename, ex v1.45-2024-04-11-ror-data_schema_v2.json. In order to maintain compatibility with previous releases, v1 files have no version information in the filename, ex v1.45-2024-04-11-ror-data.json
    For both versions, the CSV file contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. As ROR records and the ROR schema are maintained in JSON, CSVs are for convenience only. JSON is the format of record.
    Release versioning
    Beginning with v1.45 in April 2024, ROR has introduced schema versioning, with files available in schema v1 and schema v2. The ROR API default version, however, remains v1 and will be changed to v2 in April 2025. To align with the API, the data dump major version will remain 1 until the API default version is changed to v2. At that time, the data dump major version will be incremented to 2 per below.
    Data releases are versioned as follows:

    Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure.

    Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure.

    Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice.

    For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.
    The ROR data dump is provided under the Creative Commons CC0 Public Domain Dedication. Location data in ROR comes from GeoNames and is licensed under a Creative Commons Attribution 4.0 license.

  11. Forrest Gump

    • openneuro.org
    Updated Sep 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Hanke; Florian J. Baumgartner; Pierre Ibe; Falko R. Kaule; Stefan Pollmann; Oliver Speck; Wolf Zinke; Jorg Stadler (2018). Forrest Gump [Dataset]. http://doi.org/10.18112/openneuro.ds000113.v1.1.0
    Explore at:
    Dataset updated
    Sep 12, 2018
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Michael Hanke; Florian J. Baumgartner; Pierre Ibe; Falko R. Kaule; Stefan Pollmann; Oliver Speck; Wolf Zinke; Jorg Stadler
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Note: This dataset is the combination of four related datasets that were originally hosted on OpenfMRI.org: ds000113, ds000113b, ds000113c and ds000113d. The combined dataset is now in BIDS format and is simply referred to as ds000113 on OpenNeuro.org.

    For more information about the project visit: http://studyforrest.org

    This dataset contains high-resolution functional magnetic resonance (fMRI) data from 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump''). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response pattern to complex auditory stimulation. Among the potential uses of this dataset is the study of auditory attention and cognition, language and music perception as well as social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures — from stimulus creation to data analysis. (https://www.nature.com/articles/sdata20143)

    The dataset also contains data from the same twenty participants while being repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm. It also includes raw fMRI data, as well as pre-computed structural alignments for within-subject and group analysis.

    Additionally, seven of the twenty subjects participated in another study: empirical ultra high-field fMRI data recorded at four spatial resolutions (0.8 mm, 1.4 mm, 2 mm, and 3 mm isotropic voxel size) for orientation decoding in visual cortex — in order to test hypotheses on the strength and spatial scale of orientation discriminating signals. (https://www.sciencedirect.com/science/article/pii/S2352340917302056)

    Finally, there are additional acquisitions for fifteen of the the twenty participants: retinotopic mapping, a localizer paradigm for higher visual areas (FFA, EBA, PPA), and another 2 hour movie recording with 3T full-brain BOLD fMRI with simultaneous 1000 Hz eyetracking.

    For more information about the project visit: http://studyforrest.org

    Dataset content overview

    Stimulus material and protocol descriptions

    ./sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt ./sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt ./sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt ./sourcedata/acquisition_protocols/07-DTI_high_2iso.txt ./sourcedata/acquisition_protocols/08-field_map.txt Philips-specific MRI acquisition parameters dumps (plain text) for structural MRI (T1w, T2w, SWI, DTI, fieldmap -- in this order)

    ./sourcedata/acquisition_protocols/task01_fmri_session1.pdf ./sourcedata/acquisition_protocols/task01_fmri_session2.pdf ./sourcedata/acquisition_protocols/angio_session.pdf Siemens-specific MRI acquisition parameters dumps (PDF format) for functional MRI and angiography.

    ./stimuli/annotations/german_audio_description.csv

    Audio-description transcript

    This transcript contains all information on the audio-movie content that cannot be inferred from the DVD release — in a plain text, comma-separated-value table. Start and end time stamp, as well as the spoken text are provided for each continuous audio description segment.

    ./stimuli/annotations/scenes.csv

    Movie scenes

    A plain text, comma-separated-value table with start and end time for all 198 scenes in the presented movie cut. In addition, each table row contains whether a scene takes place indoors or outdoors.

    ./stimuli/generate/generate_melt_cmds.py Python script to generate commands for stimuli generation

    ./stimuli/psychopy/buttons.csv ./stimuli/psychopy/forrest_gump.psyexp ./stimuli/psychopy/segment_cfg.csv Source code of the stimuli presentation in PsychoPy

    Functional imaging - Forrest Gump Task

    Prolonged quasi-natural auditory stimulation (Forrest Gump audio movie)

    Eight approximately 15 min long recording runs, together comprising the entire duration of a two-hour presentation of an audio-only version of the Hollywood feature film "Forrest Gump" made for a visually impaired audience (German dubbing).

    For each run, there are 4D volumetric images (160x160x36)in NIfTI format , one volume recorded every 2 s, obtain from a Siemens MR scanner at 7 Tesla using a T2*-weighted gradient-echo EPI sequence (1.4 mm isotropic voxel size). These images have partial brain coverage — centered on the auditory cortices in both brain hemispheres and include frontal and posterior portions of the brain. There is no coverage for the upper portion of the brain (e.g. large parts of motor and somato-sensory cortices).

    Several flavors of raw and preprocessed data are available:

    Raw BOLD functional MRI ~~~~~~~~~~~~~~~~~~~~~~~

    These raw data suffer from severe geometric distortions.

    Filename examples for subject 01 and run 01

    ./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold.nii.gz BOLD data

    ./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold_dicominfo.txt Image property dump from DICOM conversion

    Raw BOLD functional MRI (with applied distortion correction) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Identical to raw BOLD data, but with a scanner-side correction for geometric distortions applied (also include correction for participant motion). These data are most suitable for analysis of individual brains.

    Filename examples for subject 01 and run 01

    ./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold.nii.gz BOLD data

    ./derivatives/motion/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_moco_ref.nii.gz Reference volume used for motion correction. Only runs 1 and 5 (first runs in each session)

    ./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold_dicominfo.txt Image property dump from DICOM conversion

    Raw BOLD functional MRI (linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.

    Alignment procedure was linear (image projection using an affine transformation). These data are most suitable for group-analyses and inter-individual comparisons.

    Filename examples for subject 01 and run 01

    ./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7Tad_run-01_bold.nii.gz BOLD data

    ./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMask_run-01_bold.nii.gz Matching brain mask volume

    ./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-XFMdico7Tad2grpbold7Tad_run-01_bold.mat 4x4 affine transformation matrix (plain text format)

    Raw BOLD functional MRI (non-linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.

    Alignment procedure was non-linear (image projection using an affine transformation with additional transformation by non-linear warpfields). These data are most suitable for group-analyses and inter-individual comparisons.

    Filename examples for subject 01 and run 01

    ./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNL_run-01_bold.nii.gz BOLD data

    ./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMaskNLBrainMask_run-01_bold.nii.gz Matching brain mask volume

    ./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNLWarp_run-01_bold.nii.gz Warpfield (associated affine transformation is identical with "linear" alignment

    Functional imaging - Auditory Perception Session

    Participants were repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm.

    Filename examples for subject 01 and run 01

    ./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz ./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv

    Functional imaging - Localizer Session

    Filename examples for subject 01 and run

  12. c

    Plaintext Wikipedia dump 2018

    • lindat.mff.cuni.cz
    • live.european-language-grid.eu
    Updated Feb 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rudolf Rosa (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735
    Explore at:
    Dataset updated
    Feb 25, 2018
    Authors
    Rudolf Rosa
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  13. Z

    DOIBoost Dataset Dump

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mannocci, Andrea (2020). DOIBoost Dataset Dump [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1438355
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Mannocci, Andrea
    La Bruzzo, Sandro
    Manghi, Paolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices.

    The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows:

    47,254,618 CrossRef records have been enriched with an abstract from MAG;

    33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID;

    509,588 CrossRef records have been enriched with an ORCID identifier from ORCID.

    This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost).

    Note that this records comes with two relationships to other results of this experiment:

    link to the data paper: for more information on how the dataset is (and can be) generated;

    link to the software: to repeat the experiment

  14. Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.1
    Python 3.6.8
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-01-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.6:

    conda create -n py36 python=3.6

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • N0.Index.ipynb
    • N1.Repository.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    conda create -n raw37 python=3.7 -y
    conda activate raw37
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.7

    When we

  15. B

    Bulk Dump Truck Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2024). Bulk Dump Truck Report [Dataset]. https://www.datainsightsmarket.com/reports/bulk-dump-truck-766420
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Dec 29, 2024
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The market for bulk dump trucks is anticipated to register a CAGR of 7% over the forecast period of 2023-2033, reaching a market size of 16,000 million value units by 2033. The growth of the market is attributed to the increasing demand for bulk materials in various industries such as construction, mining, and agriculture. Additionally, the rising adoption of automated and semi-automated bulk dump trucks is expected to further drive market growth. Manual bulk dump trucks currently account for the majority of the market share, but automated and semi-automated trucks are expected to gain traction due to their increased efficiency and safety features. North America and Europe are expected to remain the dominant regions in the bulk dump truck market, with a significant share in the global market. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the increasing demand from developing countries such as China and India. Major companies operating in the bulk dump truck market include Automated Conveyor Company, CDS-LIPE, National Bulk Equipment, TOTE Systems, and Weening Brothers. These companies are focusing on product development and innovation to meet the evolving needs of customers and enhance their competitive advantages in the market.

  16. Z

    INSPIRE HEP Dataset on 2021-01-08 for the paper Embracing data-driven...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Börner, Katy (2021). INSPIRE HEP Dataset on 2021-01-08 for the paper Embracing data-driven decision making to manage and communicate the impact of big science collaborations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4496557
    Explore at:
    Dataset updated
    May 30, 2021
    Dataset provided by
    Silva, Filipi N.
    Börner, Katy
    Milojević, Staša
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains original data used for data analyses and visualizations discussed in the “Embracing data-driven decision making to manage and communicate the impact of big science collaborations” paper. Four High Energy Physics and Astrophysics projects were studied: ATLAS, BaBar, LIGO, and IceCube. Data for these projects was collected from the INSPIRE HEP (https://inspirehep.net) from dumps available in http://old.inspirehep.net/dumps/inspire-dump.html on Jan 8 2020. Processed folder contains preprocessed data according to the code in the repository: https://github.com/bigscience/bigscience .

  17. o

    ROR Data

    • explore.openaire.eu
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Organization Registry (2023). ROR Data [Dataset]. http://doi.org/10.5281/zenodo.7742581
    Explore at:
    Dataset updated
    Mar 16, 2023
    Authors
    Research Organization Registry
    Description

    Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations. Release v1.21 contains ROR IDs and metadata for 104,834 research organizations. This includes the addition of 104 new records and metadata updates to 265 existing records. See the release notes. Starting with this release, the data dump includes a CSV version of the ROR data file in addition to the canonical JSON file. The data dump zip therefore now contains two files instead of one. If your code currently expects only one file, you will need to update it accordingly. The CSV contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. Beginning with its March 2022 release, ROR is curated independently from GRID. Semantic versioning beginning with v1.0 was added to reflect this departure from GRID. The existing data structure was not changed. From March 2022 onward, data releases are versioned as follows: Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure. Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure. Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice. For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.

  18. g

    DOIBoost Dataset Dump | gimi9.com

    • gimi9.com
    Updated Oct 3, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). DOIBoost Dataset Dump | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-3559699/
    Explore at:
    Dataset updated
    Oct 3, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices. The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows: 47,254,618 CrossRef records have been enriched with an abstract from MAG; 33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID; 509,588 CrossRef records have been enriched with an ORCID identifier from ORCID. This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost). Note that this records comes with two relationships to other results of this experiment: link to the data paper: for more information on how the dataset is (and can be) generated; link to the software: to repeat the experiment

  19. d

    Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

    • datadryad.org
    zip
    Updated Aug 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R. Stuart Geiger; Aaron Halfaker (2017). Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20 April 2017 [Dataset]. http://doi.org/10.6078/D1FD3K
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 15, 2017
    Dataset provided by
    Dryad
    Authors
    R. Stuart Geiger; Aaron Halfaker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2017
    Description

    See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.

  20. C

    Event Graph of BPI Challenge 2019

    • data.4tu.nl
    zip
    Updated Apr 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dirk Fahland (2021). Event Graph of BPI Challenge 2019 [Dataset]. http://doi.org/10.4121/14169614.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Dirk Fahland
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Business process event data modeled as labeled property graphs

    Data Format
    -----------

    The dataset comprises one labeled property graph in two different file formats.

    #1) Neo4j .dump format

    A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/

    /bin/neo4j-admin.(bat|sh) load --database=graph.db --from=

    The .dump was created with Neo4j v3.5.

    #2) .graphml format

    A .zip file containing a .graphml file of the entire graph


    Data Schema
    -----------

    The graph is a labeled property graph over business process event data. Each graph uses the following concepts

    :Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"

    :Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")

    :Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node

    :Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations

    :CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities

    :DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.

    :HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log

    :OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph

    :REL relationship - placeholder for any structural relationship between two :Entity nodes

    The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552


    Data Contents
    -------------

    neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)

    An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
    van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1

    This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.

    The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).

    The data contains the following entities and their events

    - PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
    - POItem - an item in a Purchase Order document describing a specific item to be purchased
    - Resource - the user or worker handling the document or a specific item
    - Vendor - the external organization from which an item is to be purchased

    Data Size
    ---------

    BPIC19, nodes: 1926651, relationships: 15082099

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
Organization logo

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks

Explore at:
application/gzipAvailable download formats
Dataset updated
Mar 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

This repository contains three files:

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

  • Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
  • Install the dependencies: $ pip install julynter/requirements.txt
  • Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

Search
Clear search
Close search
Google apps
Main menu