Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.
Papers:
This repository contains three files:
Reproducing the Notebook Study
The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:
gunzip -c db2020-09-22.dump.gz | psql jupyter
Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.
For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)
The sample.tar.gz file contains the repositories obtained during the manual sampling.
Reproducing the Julynter Experiment
The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:
The collected data is stored in the julynter/data folder.
Changelog
2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The target company's hydraulic modelling package uses Innovyze InfoworksTM. This product enables third party integration through API’s and Ruby scripts when the ICM Exchange service is enabled. As a result, the research looked at opportunities to exploit scripting in order to run the chosen optimisation strategy. The first approach initially investigated the use of a CS-script tool that would export the results tables directly from the Innovyze InfoworksTM environment into CSV format workbooks. From here the data could then be inspected, with the application of mathematical tooling to optimise the pump start parameters before returning these back into the model and rerunning. Note, the computational resource the research obtained to deploy the modelling and analysis tools comprised the following specification. Hardware
Dell Poweredge R720
Intel Xeon Processor E5-2600 v2
2x Processor Sockets
32GB Memory random access memory (RAM) – 1866MT/s Virtual Machine
Hosted on VMWare Hypervisor v6.0.
Windows Server 2012R2.
Microsoft Excel 64bit.
16 virtual-central-processing-units (V-CPU’s).
Full provision of 32GB RAM – 1866MT/s.
were highlighted in the first round of data exports as, even with a dedicated
Issues server offering 16-V-CPUs, and the specification as shown above, the Excel frontend environment was unable to process the very large data matrices being generated. There were regular failings of the Excel executable which led to an overall inability to inspect the data let alone run calculations on the matrices. When considering the five- second sample over 31 days this resulted in matrices in the order of [44x535682] per model run, with the calculations in (14-19) needing to be applied on a per cell basis.
The Permian Basin, straddling New Mexico and Texas, is one of the most productive oil and gas (OG) provinces in the United States. OG production yields large volumes of wastewater that contain elevated concentrations of major ions including salts (also referred to as brines), and trace organic and inorganic constituents. These OG wastewaters pose unknown environmental health risks, particularly in the case of accidental or intentional releases. Releases of OG wastewaters have resulted in water-quality and environmental health effects at sites in West Virginia (Akob, et al., 2016, Orem et al. 2017, Kassotis et al. 2016) and in the Williston Basin region in Montana and North Dakota (Cozzarelli et al. 2017, Cozzarelli et al. 2021, Lauer et al. 2016, Gleason et al. 2014, and Mills et al. 2011). Starting in November 2017, 39 illegal dumps of OG wastewater were identified in southeastern New Mexico on public lands by the Bureau of Land Management (BLM). Illegal dumping is an unpermitted release of waste materials that is in violation of Federal and State laws including the U.S. Resource Conservation and Recovery Act (U.S. EPA, 1976), Federal Land Policy and Management Act (U.S. DOI, 2016; 43 USC 1701(a)(8); 43 USC 1733(g)), the State of New Mexico’s Oil and Gas Act (New Mexico Legislature. 2019), and New Mexico Administrative Code § 19.15.34.20. To evaluate the effects of these releases, changes in soil geochemistry and microbial community structure at 6 sites were analyzed by comparing soils from within OG wastewater dump-affected zones to corresponding unaffected (control) soils. In addition, the effects on local vegetation were evaluated by measuring the chemistry of 4 plant species from dump-affected and control zones at a single site. Samples of local produced waters were geochemically and isotopically characterized to link soil geochemistry to reservoir geochemistry. These data sets included field observations; soil water extractable inorganic chemical composition, pH, strontium (Sr) isotopes, and specific conductance; bulk soil Raman, carbon (C), nitrogen (N), mercury (Hg), radium (Ra) and thorium (Th) isotopes, and percent moisture; plant inorganic chemical composition; and soil microbial community composition data. At each site, triplicate soil samples were collected from dump-affected and control zones and duplicate field samples were collected at each site. Plant biomass was collected in triplicate from dump-affected and control zones at a single site. This data release includes eleven data tables provided as machine readable 'comma-separated values' format (*.csv): T01_Permian_Data_Dictionary.csv, the entity and attribute metadata section for tables T02-T11 in table format; T02_Soil_Geochemistry.csv, descriptions of sampling sites and concentrations of major anions, cations, and trace elements from the soil samples; T03_Plant_Geochemistry.csv, concentrations of major anions, cations, trace elements, and Sr isotopes from the vegetation samples; T04_Soil_Isotopes.csv, Sr, Ra, and Th isotopes from the soils; T05_Raman_Counts.csv, Raman spectra counts from the soil samples; T06_Raman_Band_Separation.csv, Raman band separation from selected soil samples; T07_Soil_Organics_Spectra.csv, spectral data of alkane unresolved complex mixtures (UCMs) from soil extracts; T08_Soil_Organics_Summary.csv, a summary of alkane UCMs from soil extracts; T09_Soil_16S_BIOM.csv, microbial operational taxonomic units from the soils; T10_Produced_Water.csv, selected geochemistry and isotopic measurements from produced water samples; T11_Limits_AnalyticalMethods.csv, a listing of analytical detection limits.
Object detection is a vital part of any autonomous vision system and to obtain a high performing object detector data is needed. The object detection task aims to detect and classify different objects using camera input and getting bounding boxes containing the objects as output. This is usually done by utilizing deep neural networks.
When training an object detector a large amount of data is used, however it is not always practical to collect large amounts of data. This has led to multiple different techniques which decreases the amount of data needed. Examples of such techniques are transfer learning and domain adaptation. Working with construction equipment is a time consuming process and we wanted to examine if it was possible to use scale-model data to train a network and then used that network to detect real objects with no additional training.
This small dataset contains training and validation data of a scale dump truck in different environments while the test set contains images of a full size dump truck of similar model. The aim of the dataset is to train a network to classify wheels, cabs and tipping bodies of a scale-model dump truck and use that to classify the same classes on a full-scale dump truck.
The label structure of the dataset is the YOLO v3 structure, where the classes corresponds to a integer value, such that: Wheel: 0 Cab: 1 Tipping body: 2
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.
We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.
cvedataset-patches.zip
file contains fix patches, and postgrescvedumper.sql.zip
contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.
MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).
For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes
If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.
This product uses the NVD API but is not endorsed or certified by the NVD.
This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).
To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:
POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d
Please use this for citation:
title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages={42--51},
year={2024}
}
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From 2016 to 2018, we surveyed the world’s largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:
Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.
Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.
Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.
Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.
This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:
The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:
Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.
Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.
Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for openlegaldata.io bulk case data
Dataset Description
This is the copy of the lastest dump from openlegaldata.io. I will try to keep this updated, since there is no offical Huggingface Dataset Repo.
Homepage: https://de.openlegaldata.io/ Repository: Bulk Data
Dataset Summary
This is the openlegaldata bulk case download from October 2022. Please refer to the offical website (above) for any more information. I have not made any changes for… See the full description on the dataset page: https://huggingface.co/datasets/LennardZuendorf/openlegaldata-bulk-data.
How large is the impact of a dump site on house prices in the area?
You work for a local government agency. They need to locate a new garbage dump site near the city and are looking for the optimal location to minimize its impact on house prices in the area. Your task is to take the available historical data about house prices in the dump sites for two available years and estimate the impact of a dump site’s vicinity on house prices. Present an econometric model.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations.
Release v1.63 contains ROR IDs and metadata for 115299 research organizations in JSON and CSV format, in schema versions 1 and 2. This includes the addition of 287 new records and metadata updates to 105 existing records. See the release notes.
Data format
Beginning with release v1.45 on 11 April 2024, data releases contain JSON and CSV files formatted according to both schema v1 and schema v2. v2 files have _schema_v2
appended to the end of the filename, ex v1.45-2024-04-11-ror-data_schema_v2.json. In order to maintain compatibility with previous releases, v1 files have no version information in the filename, ex v1.45-2024-04-11-ror-data.json
For both versions, the CSV file contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. As ROR records and the ROR schema are maintained in JSON, CSVs are for convenience only. JSON is the format of record.
Release versioning
Beginning with v1.45 in April 2024, ROR has introduced schema versioning, with files available in schema v1 and schema v2. The ROR API default version, however, remains v1 and will be changed to v2 in April 2025. To align with the API, the data dump major version will remain 1 until the API default version is changed to v2. At that time, the data dump major version will be incremented to 2 per below.
Data releases are versioned as follows:
Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure.
Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure.
Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice.
For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.
The ROR data dump is provided under the Creative Commons CC0 Public Domain Dedication. Location data in ROR comes from GeoNames and is licensed under a Creative Commons Attribution 4.0 license.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Note: This dataset is the combination of four related datasets that were originally hosted on OpenfMRI.org: ds000113, ds000113b, ds000113c and ds000113d. The combined dataset is now in BIDS format and is simply referred to as ds000113 on OpenNeuro.org.
For more information about the project visit: http://studyforrest.org
This dataset contains high-resolution functional magnetic resonance (fMRI) data from 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump''). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response pattern to complex auditory stimulation. Among the potential uses of this dataset is the study of auditory attention and cognition, language and music perception as well as social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures — from stimulus creation to data analysis. (https://www.nature.com/articles/sdata20143)
The dataset also contains data from the same twenty participants while being repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm. It also includes raw fMRI data, as well as pre-computed structural alignments for within-subject and group analysis.
Additionally, seven of the twenty subjects participated in another study: empirical ultra high-field fMRI data recorded at four spatial resolutions (0.8 mm, 1.4 mm, 2 mm, and 3 mm isotropic voxel size) for orientation decoding in visual cortex — in order to test hypotheses on the strength and spatial scale of orientation discriminating signals. (https://www.sciencedirect.com/science/article/pii/S2352340917302056)
Finally, there are additional acquisitions for fifteen of the the twenty participants: retinotopic mapping, a localizer paradigm for higher visual areas (FFA, EBA, PPA), and another 2 hour movie recording with 3T full-brain BOLD fMRI with simultaneous 1000 Hz eyetracking.
For more information about the project visit: http://studyforrest.org
./sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt ./sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt ./sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt ./sourcedata/acquisition_protocols/07-DTI_high_2iso.txt ./sourcedata/acquisition_protocols/08-field_map.txt Philips-specific MRI acquisition parameters dumps (plain text) for structural MRI (T1w, T2w, SWI, DTI, fieldmap -- in this order)
./sourcedata/acquisition_protocols/task01_fmri_session1.pdf ./sourcedata/acquisition_protocols/task01_fmri_session2.pdf ./sourcedata/acquisition_protocols/angio_session.pdf Siemens-specific MRI acquisition parameters dumps (PDF format) for functional MRI and angiography.
./stimuli/annotations/german_audio_description.csv
Audio-description transcript
This transcript contains all information on the audio-movie content that cannot be inferred from the DVD release — in a plain text, comma-separated-value table. Start and end time stamp, as well as the spoken text are provided for each continuous audio description segment.
./stimuli/annotations/scenes.csv
Movie scenes
A plain text, comma-separated-value table with start and end time for all 198 scenes in the presented movie cut. In addition, each table row contains whether a scene takes place indoors or outdoors.
./stimuli/generate/generate_melt_cmds.py Python script to generate commands for stimuli generation
./stimuli/psychopy/buttons.csv ./stimuli/psychopy/forrest_gump.psyexp ./stimuli/psychopy/segment_cfg.csv Source code of the stimuli presentation in PsychoPy
Prolonged quasi-natural auditory stimulation (Forrest Gump audio movie)
Eight approximately 15 min long recording runs, together comprising the entire duration of a two-hour presentation of an audio-only version of the Hollywood feature film "Forrest Gump" made for a visually impaired audience (German dubbing).
For each run, there are 4D volumetric images (160x160x36)in NIfTI format , one volume recorded every 2 s, obtain from a Siemens MR scanner at 7 Tesla using a T2*-weighted gradient-echo EPI sequence (1.4 mm isotropic voxel size). These images have partial brain coverage — centered on the auditory cortices in both brain hemispheres and include frontal and posterior portions of the brain. There is no coverage for the upper portion of the brain (e.g. large parts of motor and somato-sensory cortices).
Several flavors of raw and preprocessed data are available:
Raw BOLD functional MRI ~~~~~~~~~~~~~~~~~~~~~~~
These raw data suffer from severe geometric distortions.
Filename examples for subject 01 and run 01
./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold.nii.gz BOLD data
./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-01_bold_dicominfo.txt Image property dump from DICOM conversion
Raw BOLD functional MRI (with applied distortion correction) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Identical to raw BOLD data, but with a scanner-side correction for geometric distortions applied (also include correction for participant motion). These data are most suitable for analysis of individual brains.
Filename examples for subject 01 and run 01
./sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold.nii.gz BOLD data
./derivatives/motion/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_moco_ref.nii.gz Reference volume used for motion correction. Only runs 1 and 5 (first runs in each session)
./sourcedata/dicominfo/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-dico_run-01_bold_dicominfo.txt Image property dump from DICOM conversion
Raw BOLD functional MRI (linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.
Alignment procedure was linear (image projection using an affine transformation). These data are most suitable for group-analyses and inter-individual comparisons.
Filename examples for subject 01 and run 01
./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7Tad_run-01_bold.nii.gz BOLD data
./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMask_run-01_bold.nii.gz Matching brain mask volume
./derivatives/linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-XFMdico7Tad2grpbold7Tad_run-01_bold.mat 4x4 affine transformation matrix (plain text format)
Raw BOLD functional MRI (non-linear anatomical alignment) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These images are motion and distortion corrected and have been anatomically aligned to a BOLD group template image that was generated from the entire group of participants.
Alignment procedure was non-linear (image projection using an affine transformation with additional transformation by non-linear warpfields). These data are most suitable for group-analyses and inter-individual comparisons.
Filename examples for subject 01 and run 01
./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNL_run-01_bold.nii.gz BOLD data
./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadBrainMaskNLBrainMask_run-01_bold.nii.gz Matching brain mask volume
./derivatives/non-linear_anatomical_alignment/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_rec-dico7Tad2grpbold7TadNLWarp_run-01_bold.nii.gz Warpfield (associated affine transformation is identical with "linear" alignment
Participants were repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm.
Filename examples for subject 01 and run 01
./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz ./sub-01/ses-auditoryperception/func/sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv
Filename examples for subject 01 and run
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices.
The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows:
47,254,618 CrossRef records have been enriched with an abstract from MAG;
33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID;
509,588 CrossRef records have been enriched with an ORCID identifier from ORCID.
This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost).
Note that this records comes with two relationships to other results of this experiment:
link to the data paper: for more information on how the dataset is (and can be) generated;
link to the software: to repeat the experiment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-01-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.6:
conda create -n py36 python=3.6
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
conda create -n raw37 python=3.7 -y
conda activate raw37
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.7
When we
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The market for bulk dump trucks is anticipated to register a CAGR of 7% over the forecast period of 2023-2033, reaching a market size of 16,000 million value units by 2033. The growth of the market is attributed to the increasing demand for bulk materials in various industries such as construction, mining, and agriculture. Additionally, the rising adoption of automated and semi-automated bulk dump trucks is expected to further drive market growth. Manual bulk dump trucks currently account for the majority of the market share, but automated and semi-automated trucks are expected to gain traction due to their increased efficiency and safety features. North America and Europe are expected to remain the dominant regions in the bulk dump truck market, with a significant share in the global market. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the increasing demand from developing countries such as China and India. Major companies operating in the bulk dump truck market include Automated Conveyor Company, CDS-LIPE, National Bulk Equipment, TOTE Systems, and Weening Brothers. These companies are focusing on product development and innovation to meet the evolving needs of customers and enhance their competitive advantages in the market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains original data used for data analyses and visualizations discussed in the “Embracing data-driven decision making to manage and communicate the impact of big science collaborations” paper. Four High Energy Physics and Astrophysics projects were studied: ATLAS, BaBar, LIGO, and IceCube. Data for these projects was collected from the INSPIRE HEP (https://inspirehep.net) from dumps available in http://old.inspirehep.net/dumps/inspire-dump.html on Jan 8 2020. Processed folder contains preprocessed data according to the code in the repository: https://github.com/bigscience/bigscience .
Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations. Release v1.21 contains ROR IDs and metadata for 104,834 research organizations. This includes the addition of 104 new records and metadata updates to 265 existing records. See the release notes. Starting with this release, the data dump includes a CSV version of the ROR data file in addition to the canonical JSON file. The data dump zip therefore now contains two files instead of one. If your code currently expects only one file, you will need to update it accordingly. The CSV contains a subset of fields from the JSON file, some of which have been flattened for easier parsing. Beginning with its March 2022 release, ROR is curated independently from GRID. Semantic versioning beginning with v1.0 was added to reflect this departure from GRID. The existing data structure was not changed. From March 2022 onward, data releases are versioned as follows: Minor versions (ex 1.1, 1.2, 1.3): Contain changes to data, such as a new records and updates to existing records. No changes to the data model/structure. Patch versions (ex 1.0.1): Used infrequently to correct errors in a release. No changes to the data model/structure. Major versions (ex 1.x, 2.x, 3.x): Contains changes to data model/structure, as well as the data itself. Major versions will be released with significant advance notice. For convenience, the date is also include in the release file name, ex: v1.0-2022-03-15-ror-data.zip.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of metadata and, where possible, their relative payloads. To this end, CrossRef plays a pivotal role by providing free access to its entire metadata collection, and allowing other initiatives to link and enrich its information. Therefore, a number of key pieces of information result scattered across diverse datasets and resources freely available online. As a result of this fragmentation, researchers in this domain end up struggling with daily integration problems producing a plethora of ad-hoc datasets, therefore incurring in a waste of time, resources, and infringing open science best practices. The latest DOIBoost release is a metadata collection that enriches CrossRef (October 2019 release: 108,048,986 publication records) with inputs from Microsoft Academic Graph (October 2019 release: 76,171,072 publication records), ORCID (October 2019 release: 12,642,131 publication records), and Unpaywall (August 2019 release: 26,589,869 publication records) for the purpose of supporting high-quality and robust research experiments. As a result of DOIBoost, CrossRef records have been "boosted" as follows: 47,254,618 CrossRef records have been enriched with an abstract from MAG; 33,279,428 CrossRef records have been enriched with an affiliation from MAG and/or ORCID; 509,588 CrossRef records have been enriched with an ORCID identifier from ORCID. This entry consists of two files: doiboost_dump-2019-11-27.tar (contains a set of partXYZ.gz files, each one containing the JSON files relative to the enriched CrossRef records), a schemaAndSample.zip, and termsOfUse.doc (contains details on the terms of use of DOIBoost). Note that this records comes with two relationships to other results of this experiment: link to the data paper: for more information on how the dataset is (and can be) generated; link to the software: to repeat the experiment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphs
Data Format
-----------
The dataset comprises one labeled property graph in two different file formats.
#1) Neo4j .dump format
A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/
/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=
The .dump was created with Neo4j v3.5.
#2) .graphml format
A .zip file containing a .graphml file of the entire graph
Data Schema
-----------
The graph is a labeled property graph over business process event data. Each graph uses the following concepts
:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"
:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")
:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node
:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations
:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities
:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.
:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log
:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph
:REL relationship - placeholder for any structural relationship between two :Entity nodes
The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552
Data Contents
-------------
neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)
An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1
This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.
The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).
The data contains the following entities and their events
- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased
Data Size
---------
BPIC19, nodes: 1926651, relationships: 15082099
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.
Papers:
This repository contains three files:
Reproducing the Notebook Study
The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:
gunzip -c db2020-09-22.dump.gz | psql jupyter
Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.
For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)
The sample.tar.gz file contains the repositories obtained during the manual sampling.
Reproducing the Julynter Experiment
The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:
The collected data is stored in the julynter/data folder.
Changelog
2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files