100+ datasets found

Z
Level Crossing Warning Bell (LCWB) Dataset
data.niaid.nih.gov
zenodo.org
Updated May 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flammini, Francesco (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7945411
Explore at:
Dataset updated
May 20, 2023
Dataset provided by
De Donato, Lorenzo
Flammini, Francesco
Marrone, Stefano
Vittorini, Valeria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Acknowledgement These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

Disclaimers The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

General Info The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

When using any of these data, please mention:

De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

Content of the folder This folder contains the following subfolders and files.

"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

CSV Files Structure Each "XX_labels.csv" file contains, for each entry, the following information:

The identifier ("index") of the sub-class (which is not relevant in our case);

The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

The extended name of the class ("display_name").

Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

Indeed, each "XX_data.csv" file contains, for each entry, the following information:

ID: the identifier of the entry;

YTID: the YouTube identifier of the video;

start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

positive_labels: the label(s) associated with the audio.

Credits The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Particularly, from AudioSet, we retrieved:

The structure of the CSV files as discussed above.

Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

More about downloading the AudioSet dataset can be found here.
h
economic-for-llama2-ft-just-train-csv
huggingface.co
Updated Apr 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyen Ngoc Dat (2024). economic-for-llama2-ft-just-train-csv [Dataset]. https://huggingface.co/datasets/dchatca/economic-for-llama2-ft-just-train-csv
Explore at:
Dataset updated
Apr 25, 2024
Authors
Nguyen Ngoc Dat
Description
dchatca/economic-for-llama2-ft-just-train-csv dataset hosted on Hugging Face and contributed by the HF Datasets community
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
f
Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
Disease Prediction Using Machine Learning
dataandsons.com
csv, zip
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 31, 2022
Dataset provided by
Authors
test test
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About this Dataset

This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

Category

Machine Learning

Keywords

medicine,disease,Healthcare,ML,Machine Learning

Row Count

4962

Price

$109.00
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Moreo, Alejandro
Bunse, Mirko
Sebastiani, Fabrizio
Senz, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
j
Data from: Data on the Construction Processes of Regression Models
jstagedata.jst.go.jp
jpeg
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa (2023). Data on the Construction Processes of Regression Models [Dataset]. http://doi.org/10.50931/data.kona.22180318.v2
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.50931/data.kona.22180318.v2
Dataset updated
Jul 27, 2023
Dataset provided by
Hosokawa Powder Technology Foundation
Authors
Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times
m
Ransomware and user samples for training and validating ML models
data.mendeley.com
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ransomware and user samples for training and validating ML models [Dataset]. https://data.mendeley.com/datasets/yhg5wk39kf/2
Explore at:
Unique identifier
https://doi.org/10.17632/yhg5wk39kf.2
Dataset updated
Sep 17, 2021
Authors
Eduardo Berrueta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
csv training
kaggle.com
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tờ Rung (2023). csv training [Dataset]. https://www.kaggle.com/datasets/btrung/csv-training/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tờ Rung
Description
Dataset

This dataset was created by Tờ Rung

Contents
📊 Yahoo Answers 10 categories for NLP CSV
kaggle.com
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. http://doi.org/10.34740/kaggle/dsv/5339321
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5339321
Dataset updated
Apr 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yassir Acharki
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

The file classes.txt contains a list of classes corresponding to each label.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Seq2Seq training data.csv
figshare.com
txt
Updated Apr 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jithin cheriyan (2022). Seq2Seq training data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.19513705.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19513705.v1
Dataset updated
Apr 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
jithin cheriyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the training dataset for the seq2seq model that contains source comments that are offensive and target comments which are non-offensive.
train-csv
kaggle.com
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aliaa Osama Esmail (2025). train-csv [Dataset]. https://www.kaggle.com/datasets/aliaaosamaesmail/train-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aliaa Osama Esmail
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Aliaa Osama Esmail

Released under Apache 2.0

Contents
expression data.csv
figshare.com
txt
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihan Wang (2022). expression data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.19093307.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19093307.v1
Dataset updated
Jan 30, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jihan Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this research, we proposed the SNR-PPFS feature selection algorithms to identify key gene signatures for distinguishing COAD tumor samples from normal colon tissues. Using machine learning-based feature selection approaches to select key gene signatures from high-dimensional datasets can be an effective way for studying cancer genomic characteristics.
A
‘Titanic-Dataset (train.csv)’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic-Dataset (train.csv)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-dataset-train-csv-1d8d/f4271729/?iid=006-918&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic-Dataset (train.csv)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hesh97/titanicdataset-traincsv on 28 January 2022.

--- No further description of dataset provided by original source ---

--- Original source retains full ownership of the source dataset ---
ScanGrow Manuscript files
figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Espina; Ross Worth (2023). ScanGrow Manuscript files [Dataset]. http://doi.org/10.6084/m9.figshare.16822924.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16822924.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Laura Espina; Ross Worth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets relative to the manuscript describing the ScanGrow [Proof of Concept] application:

Worth RM and Espina L (2022) ScanGrow: Deep Learning-Based Live Tracking of Bacterial Growth in Broth. Front. Microbiol. 13:900596.

doi: 10.3389/fmicb.2022.900596

The contents of the three compressed folders are described below.

TRAINING_MODEL.ZIP Collection of images and spreadsheets that was used in the training of the image classification model that ScanGrow [PoC] uses by default. This training dataset should be subjected to the pre-processing workflow provided with ScanGrow to obtain the grouped images to be fed to the model training utility.

TEST_MODEL.ZIP

Collection of images and spreadsheets comprising the Test dataset used in the evaluation of the image classification model. This includes: - New scans and spreadsheets (represented in Figure 3 as gray triangles). - Evaluation.csv: combined results of the output files from command "Test Model" when run with: * Dataset Test: these scans and spreadsheets (not used for training), * Dataset Training: the dataset used for training the model, or * Dataset Validation: the Training dataset after having flipped horizontally and offsetting the images and adjusted the spectrophotometric values according to the newly inverted well positions.

SAMPLE_RUN.ZIP

Data from a sample run used to test ScanGrow on a microplate containing different concentrations of several antibiotics. This includes:

Scans used to in the "Sample run" with added antibiotics in the bacterial cultures.

Sample_run_raw.csv: Data exported from the Table view after the run.

Sample_run_processed.csv: Data from the Sample_run_raw.csv file after the introduction of metadata (eg. contents of each well) and calculation of the AUC (area under the curve).

Sample_run_json.json: JSON file showing the results of this run. It can be loaded into a ScanGrow session by clicking on "Show Graphs" -> "Open".

ImageMask.csv: alternative ImageMask to substitute the original one in "C:\Program Files\Riverwell Consultancy Services Ltd\Scan Grow\Configuration". In this alternative ImageMask file, well C11 was modified to overcome an artefact in the scan.
d
E-learning Recommender System Dataset
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafsa, Mounir (2023). E-learning Recommender System Dataset [Dataset]. http://doi.org/10.7910/DVN/BMY3UD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BMY3UD
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Hafsa, Mounir
Description
Mandarine Academy Recommender System (MARS) Dataset is captured from real-world open MOOC {https://mooc.office365-training.com/}. The dataset offers both explicit and implicit ratings, for both French and English versions of the MOOC. Compared with classical recommendation datasets like Movielens, this is a rather small dataset due to the nature of available content (educational). However, the dataset offers insights into real-world ratings and provides testing grounds away from common datasets. All items are available online for viewing in both French and English versions. All selected users had rated at least 1 item. No demographic information is included. Each user is represented by an id and job (if available). For both French and English, the same kind of files is available in .csv format. We provide the following files: Users: contains information about user ids and their jobs. Items: contains information about items (resources) in the selected language. Contains a mix of feature types. Ratings: Both explicit (Watch time) and implicit (page views of items). Formatting and Encoding The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double quotes ("). These files are encoded as UTF-8. User Ids User ids are consistent between explicit_ratings.csv and implicit_ratings.csv and users.csv (i.e., the same id refers to the same user across the dataset). Item Ids Item ids are consistent between explicit_ratings.csv, implicit_ratings.csv, and items.csv (i.e., the same id refers to the same item across the dataset). Ratings Data File Structure All ratings are contained in the files explicit_ratings.csv and implicit_ratings.csv. Each line of this file after the header row represents one rating of one item by one user, and has the following format: item_id,user_id,created_at (implicit_ratings.csv) user_id,item_id,watch_percentage,created_at,rating (explicit_ratings.csv) Item Data File Structure Item information is contained in the file items.csv. Each line of this file after the header row represents one item, and has the following format: item_id,language,name,nb_views,description,created_at,Difficulty,Job,Software,Theme,duration,type
CIFAR-10 Python in CSV
kaggle.com
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
fedesoriano
Description
Context

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

The batches.meta file contains the label names of each class.

The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

Content

Here is the list of the 10 classes in the CIFAR-10:

Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

Acknowledgements

Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

How to load the batches.meta file (Python)

The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)

Facebook

Twitter

Click to copy link

Link copied

Cite

Flammini, Francesco (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7945411

Level Crossing Warning Bell (LCWB) Dataset

Explore at:

Dataset updated

May 20, 2023

Dataset provided by

De Donato, Lorenzo
Flammini, Francesco
Marrone, Stefano
Vittorini, Valeria

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Acknowledgement These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

Disclaimers The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

General Info The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

When using any of these data, please mention:

De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

Content of the folder This folder contains the following subfolders and files.

"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

CSV Files Structure Each "XX_labels.csv" file contains, for each entry, the following information:

The identifier ("index") of the sub-class (which is not relevant in our case);

The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

The extended name of the class ("display_name").

Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

Indeed, each "XX_data.csv" file contains, for each entry, the following information:

ID: the identifier of the entry;

YTID: the YouTube identifier of the video;

start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

positive_labels: the label(s) associated with the audio.

Credits The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Particularly, from AudioSet, we retrieved:

The structure of the CSV files as discussed above.

Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

More about downloading the AudioSet dataset can be found here.

Clear search

Close search

Google apps

Main menu

Level Crossing Warning Bell (LCWB) Dataset

economic-for-llama2-ft-just-train-csv

LLM: 7 prompt training dataset

Datasets

Network traffic and code for machine learning classification

Disease Prediction Using Machine Learning

About this Dataset

Category

Keywords

Row Count

Price

Data Cleaning, Translation & Split of the Dataset for the Automatic...

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Data on the Construction Processes of Regression Models

Ransomware and user samples for training and validating ML models

csv training

Dataset

Contents

📊 Yahoo Answers 10 categories for NLP CSV

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Seq2Seq training data.csv

train-csv

Dataset

Contents

expression data.csv

‘Titanic-Dataset (train.csv)’ analyzed by Analyst-2

ScanGrow Manuscript files

doi: 10.3389/fmicb.2022.900596

E-learning Recommender System Dataset

CIFAR-10 Python in CSV

Context

Content

Acknowledgements

How to load the batches.meta file (Python)

Level Crossing Warning Bell (LCWB) Dataset