30 datasets found

Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Pre-Processed Power Grid Frequency Time Series
data.subak.org
csv
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.subak.org/dataset/pre-processed-power-grid-frequency-time-series
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Description
Overview

This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources

We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Yearly converted and cleansed data

The folders "

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

TransnetBW: Continental European Time (CE)

Nationalgrid: Great Britain (GB)

Fingrid: Finland (Europe/Helsinki)

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases

We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "

License

This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

We release the code in the folder "Scripts" under the MIT license .

The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

Changelog

Version 2:

Add time zone information to description

Include new frequency data

Update references

Change folder structure to yearly folders

Version 3:

Correct TransnetBW files for missing data in May 2016
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4044635
Dataset updated
Apr 26, 2021
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Feb 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Feb 23, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
f
Data from: S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
T
wikipedia
tensorflow.org
huggingface.co
Updated Aug 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Dataset updated
Aug 9, 2019
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Europe PMC Full Text Corpus
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre (2023). Europe PMC Full Text Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.22848380.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22848380.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.

Corpus Directory Structure

annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.

hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2.

IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.

JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.

articles/: contains the full-text articles annotated in Europe PMC corpus.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.

docs/: contains related documents that were used for generating the corpus.

Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.

pilot/: contains annotations and articles that were used in a pilot study.

annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API.

README.md: a detailed description of the sentencising and fetching of XML articles.

src/: source codes for cleaning annotations and generating IOB files

metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.

License

CCBY

Feedback

For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
T
wikihow
tensorflow.org
paperswithcode.com
+2more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikihow', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
B
Office Light Level
borealisdata.ca
Updated Dec 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav (2020). Office Light Level [Dataset]. http://doi.org/10.5683/SP2/3BCADM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/3BCADM
Dataset updated
Dec 7, 2020
Dataset provided by
Borealis
Authors
Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Nov 16, 2020
Description
Data Owner: Y. Aussat, S. Keshav Data File: 32.8 MB zip file containing the data files and description Data Description: This dataset contains daylight signals collected over approximately 200 days in four unoccupied offices in the Davis Center building at the University of Waterloo. Thus, these measure the available daylight in the room. Light levels were measured using custom-built light sensing modules based on the Omega Onion microcomputer with a light sensor. An example of the module is shown in the file sensing-module.png in this directory. Each sensing module is named using four hex digits. We started all modules on August 30, 2018, which corresponds to minute 0 in the dataset. However, the modules were not deployed immediately. Below are the times when we started collecting the light data in each office and corresponding sensing module names. Office number Devices Start time DC3526 af65, b02d September 6, 2018, 11:00 am DC2518 afa7 September 6, 2018, 11:00 am DC2319 af67, f073 September 21, 2018, 11:00 am DC3502 afa5, b969 September 21, 2018, 11:00 am Moreover, due to some technical problems, the initial 6 days for offices 1 and 2 and initial 21 days for offices 3 and 4 are dummy data and should be ignored. Finally, there were two known outages in DC during the data collection process: from 00:00 AM to 4:00 AM on September 17, 2018 from 11:00pm on 10/9/2018 until 7:45am on October 10, 2018 We stopped collecting the data around 2:45 pm on May 16, 2019. Therefore, we have 217 uninterrupted days of clean collected data from October 11, 2018 to May 15, 2019. To take care of these problems, we have provided a python script process-lighting-data.ipynb that extracts clean data from the raw data. Both raw and processed data are provided as described next. Raw data: Raw data folder names correspond to the device names. The light sensing modules log (minute_count, visible_light, IR_light) every minute to a file. Here, minute 0 corresponds to August 30, 2018. Every 1440 minutes (i.e., 1 day) we saved the current file, created a new one, and started writing to it. The filename format is {device_name}_{starting_minute}. For example Omega-AF65_28800.csv is data collected by Omega-AF65, starting at minute 28800. A metadata file can also be found in each folder with the details of the log file structure. Processed data: The folder named ‘processed_data’ contains the processed data, which results from running the python script. Each file in this directory is named after the device ID, for example af65.csv stores the processed data of the device Omega-AF65. The columns in this file are: Minutes: Consecutive minute of the experiment Illum: Illumination level (lux) Min_from_midnight: Minutes from midnight of the current day Day_of_exp: Count of the day number starting from October 11, 2018 Day_of_year: Day of the year Funding: The Natural Sciences and Engineering Research Council of Canada (NSERC)
c
Data to Estimate Water Use Associated with Oil and Gas Development within...
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Carlsbad, New Mexico
Description
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including _location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
Loan Prediction Problem Dataset
kaggle.com
Updated Mar 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debdatta Chatterjee (2019). Loan Prediction Problem Dataset [Dataset]. https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Debdatta Chatterjee
Description
Dataset

This dataset was created by Debdatta Chatterjee

Contents
Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...
zenodo.org
data.niaid.nih.gov
csv
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eliott Baboz; Eliott Baboz; Jérémy Lavarenne; Jérémy Lavarenne (2024). Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA and RHoMIS datasets [Dataset]. http://doi.org/10.5281/zenodo.10556266
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10556266
Dataset updated
Jul 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eliott Baboz; Eliott Baboz; Jérémy Lavarenne; Jérémy Lavarenne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
West Africa
Description
Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).

The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.

Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).

Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.

Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.

Authors Contributions:

Data treatment: Eliott Baboz, Jérémy Lavarenne.

Documentation: Jérémy Lavarenne.

Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.

Changelog:

v1.0.0 : initial submission
2019 Kaggle Machine Learning & Data Science Survey
kaggle.com
Updated Dec 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EK (2020). 2019 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 22, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
EK
Description
Overview Welcome to Kaggle's third annual Machine Learning and Data Science Survey ― and our second-ever survey data challenge. You can read our executive summary here.

This year, as in 2017 and 2018, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for three weeks in October, and after cleaning the data we finished with 19,717 responses!

There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

Challenge This year Kaggle is launching the second annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community.

In our third year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

Submissions will be evaluated on the following:

Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one notebook, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

No submission is necessary for the Weekly Notebook Award. To be eligible, a notebook must be public and use the 2019 Data Science Survey as a data source.

Submission deadline: 11:59PM UTC, December 2nd, 2019.

Survey Methodology This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.

We excluded respondents who were flagged by our survey system as “Spam”.

Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.

The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. The median response time for those who participated in the survey was approximately 10 minutes.

Not every question was shown to every respondent. You can learn more about the different segments we used in the survey_schema.csv file. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions.

To protect the respondents’ identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. We do not provide a key to match up the multiple choice and free form responses. Further, the free form responses have been randomized column-wise such that the responses that appear on the same row did not necessarily come from the same survey-taker.

Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns. Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category "other".

Data has been released under a CC 2.0 license: https://creativecommons.org/licenses/by/2.0/
P
Kitsune Network Attack Dataset Dataset
paperswithcode.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yisroel Mirsky; Tomer Doitshman; Yuval Elovici; Asaf Shabtai (2023). Kitsune Network Attack Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/kitsune-network-attack-dataset
Explore at:
Dataset updated
Oct 16, 2023
Authors
Yisroel Mirsky; Tomer Doitshman; Yuval Elovici; Asaf Shabtai
Description
Kitsune Network Attack Dataset This is a collection of nine network attack datasets captured from a either an IP-based commercial surveillance system or a network full of IoT devices. Each dataset contains millions of network packets and diffrent cyber attack within it.

For each attack, you are supplied with:

A preprocessed dataset in csv format (ready for machine learning) The corresponding label vector in csv format The original network capture in pcap format (in case you want to engineer your own features)

We will now describe in detail what's in these datasets and how they were collected.

The Network Attacks We have collected a wide variety of attacks which you would find in a real network intrusion. The following is a list of the cyber attack datasets avalaible:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F827271%2F79e305668553e521b0709a2413323c45%2Fkaggle_dataset_table.png?generation=1598461684070844&alt=media" alt="image" width="100">

For more details on the attacks themselves, please refer to our NDSS paper (citation below).

The Data Collection The following figure presents the network topologies which we used to collect the data, and the corrisponding attack vectors at which the attacks were performed. The network capture took place at point 1 and point X at the router (where a network intrusion detection system could feasibly be placed). For each dataset, clean network traffic was captured for the first 1 million packets, then the cyber attack was performed.

The Dataset Format Each preprocessed dataset csv has m rows (packets) and 115 columns (features) with no header. The 115 features were extracted using our AfterImage feature extractor, described in our NDSS paper (see below) and available in Python here. In summary, the 115 features provide a statistical snapshot of the network (hosts and behaviors) in the context of the current packet traversing the network. The AfterImage feature extractor is unique in that it can efficiently process millions of streams (network channels) in real-time, incrementally, making it suitable for handling network traffic.

Citation If you use these datasets, please cite:

@inproceedings{mirsky2018kitsune, title={Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection}, author={Mirsky, Yisroel and Doitshman, Tomer and Elovici, Yuval and Shabtai, Asaf}, booktitle={The Network and Distributed System Security Symposium (NDSS) 2018}, year={2018} }
f
DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip
frontiersin.figshare.com
zip
Updated Oct 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuta Nagano; Benjamin Chain (2023). DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip [Dataset]. http://doi.org/10.3389/fimmu.2023.1276106.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2023.1276106.s002
Dataset updated
Oct 25, 2023
Dataset provided by
Frontiers
Authors
Yuta Nagano; Benjamin Chain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
T cell receptors (TR) underpin the diversity and specificity of T cell activity. As such, TR repertoire data is valuable both as an adaptive immune biomarker, and as a way to identify candidate therapeutic TR. Analysis of TR repertoires relies heavily on computational analysis, and therefore it is of vital importance that the data is standardized and computer-readable. However in practice, the usage of different abbreviations and non-standard nomenclature in different datasets makes this data pre-processing non-trivial. tidytcells is a lightweight, platform-independent Python package that provides easy-to-use standardization tools specifically designed for TR nomenclature. The software is open-sourced under the MIT license and is available to install from the Python Package Index (PyPI). At the time of publishing, tidytcells is on version 2.0.0.
d
Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...
search.dataone.org
figshare.com
Updated Nov 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Szkirpan, Elizabeth (2023). Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/DGBUV7
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DGBUV7
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Szkirpan, Elizabeth
Description
These datasets contain cleaned data survey results from the October 2021-January 2022 survey titled "The Impact of COVID-19 on Technical Services Units". This data was gathered from a Qualtrics survey, which was anonymized to prevent Qualtrics from gathering identifiable information from respondents. These specific iterations of data reflect cleaning and standardization so that data can be analyzed using Python. Ultimately, the three files reflect the removal of survey begin/end times, other data auto-recorded by Qualtrics, blank rows, blank responses after question four (the first section of the survey), and non-United States responses. Note that State names for "What state is your library located in?" (Q36) were also standardized beginning in Impact_of_COVID_on_Tech_Services_Clean_3.csv to aid in data analysis. In this step, state abbreviations were spelled out and spelling errors were corrected.
f
Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s008
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
Z
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
data.niaid.nih.gov
zenodo.org
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lignos, Dimitrios G. (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
Explore at:
Dataset updated
Dec 24, 2022
Dataset provided by
Ozden, Selimcan
Hartloper, Alexander R.
de Castro e Sousa, Albano
Lignos, Dimitrios G.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
h
openlegaldata
huggingface.co
Updated Feb 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshil (2020). openlegaldata [Dataset]. https://huggingface.co/datasets/harshildarji/openlegaldata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2020
Authors
Harshil
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Clean Open Legal Data

Overview | Dataset Structure | Key Fields | Example Entry | Using the Dataset with Python | License Overview

This dataset is a comprehensive collection of open legal case records in JSONL format. It comprises 251,038 cases extracted and processed from the Open Legal Data dump (as of 2022-10-18). The dataset is designed for legal research, data science, and natural language processing applications. While… See the full description on the dataset page: https://huggingface.co/datasets/harshildarji/openlegaldata.
MME-only models trained with clean data for JAMES paper "Machine-learned...
zenodo.org
tar
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Lagerquist; Ryan Lagerquist (2023). MME-only models trained with clean data for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10084394
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10084394
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Lagerquist; Ryan Lagerquist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This tar file contains all 100 trained models in the MME-only ensemble from Experiment 1 (i.e., those trained with clean data, not with lightly perturbed data). To read one of the models into Python, you can use the method neural_net.read_model in the ml4rt library.

Facebook

Twitter

Click to copy link

Link copied

Cite

Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

Dataset updated

Aug 8, 2022

Dataset authored and provided by

Köhler, Juliane

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Pre-Processed Power Grid Frequency Time Series

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Data from: S1 Data -

wikipedia

Europe PMC Full Text Corpus

wikihow

Office Light Level

Data to Estimate Water Use Associated with Oil and Gas Development within...

Loan Prediction Problem Dataset

Dataset

Contents

Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...

2019 Kaggle Machine Learning & Data Science Survey

Kitsune Network Attack Dataset Dataset

DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip

Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...

Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

openlegaldata

MME-only models trained with clean data for JAMES paper "Machine-learned...

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft