100+ datasets found
  1. Z

    Level Crossing Warning Bell (LCWB) Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flammini, Francesco (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7945411
    Explore at:
    Dataset updated
    May 20, 2023
    Dataset provided by
    De Donato, Lorenzo
    Flammini, Francesco
    Marrone, Stefano
    Vittorini, Valeria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Acknowledgement These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

    Disclaimers The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

    This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

    General Info The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

    When using any of these data, please mention:

    De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

    Content of the folder This folder contains the following subfolders and files.

    "Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

    WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

    NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

    GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

    "LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

    IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

    The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

    DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

    "Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

    FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

    Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

    CSV Files Structure Each "XX_labels.csv" file contains, for each entry, the following information:

    The identifier ("index") of the sub-class (which is not relevant in our case);

    The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

    The extended name of the class ("display_name").

    Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

    Indeed, each "XX_data.csv" file contains, for each entry, the following information:

    ID: the identifier of the entry;

    YTID: the YouTube identifier of the video;

    start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

    positive_labels: the label(s) associated with the audio.

    Credits The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

    Particularly, from AudioSet, we retrieved:

    The structure of the CSV files as discussed above.

    Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

    Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

    More about downloading the AudioSet dataset can be found here.

  2. h

    economic-for-llama2-ft-just-train-csv

    • huggingface.co
    Updated Apr 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nguyen Ngoc Dat (2024). economic-for-llama2-ft-just-train-csv [Dataset]. https://huggingface.co/datasets/dchatca/economic-for-llama2-ft-just-train-csv
    Explore at:
    Dataset updated
    Apr 25, 2024
    Authors
    Nguyen Ngoc Dat
    Description

    dchatca/economic-for-llama2-ft-just-train-csv dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  4. f

    Datasets

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Bastian Eichenberger; YinXiu Zhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

  5. m

    Network traffic and code for machine learning classification

    • data.mendeley.com
    Updated Feb 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
    Explore at:
    Dataset updated
    Feb 20, 2020
    Authors
    Víctor Labayen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

    Activities:

    Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

    The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

    The amount of data is stated as follows:

    Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

    The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.

  6. Disease Prediction Using Machine Learning

    • dataandsons.com
    csv, zip
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Oct 31, 2022
    Dataset provided by
    Authors
    test test
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    About this Dataset

    This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

    Category

    Machine Learning

    Keywords

    medicine,disease,Healthcare,ML,Machine Learning

    Row Count

    4962

    Price

    $109.00

  7. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  8. Z

    UCI and OpenML Data Sets for Ordinal Quantification

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Moreo, Alejandro
    Bunse, Mirko
    Sebastiani, Fabrizio
    Senz, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  9. j

    Data from: Data on the Construction Processes of Regression Models

    • jstagedata.jst.go.jp
    jpeg
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa (2023). Data on the Construction Processes of Regression Models [Dataset]. http://doi.org/10.50931/data.kona.22180318.v2
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    Hosokawa Powder Technology Foundation
    Authors
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times

  10. m

    Ransomware and user samples for training and validating ML models

    • data.mendeley.com
    Updated Sep 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ransomware and user samples for training and validating ML models [Dataset]. https://data.mendeley.com/datasets/yhg5wk39kf/2
    Explore at:
    Dataset updated
    Sep 17, 2021
    Authors
    Eduardo Berrueta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

    This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

    The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

    Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

    In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

    The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

    Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

  11. csv training

    • kaggle.com
    Updated May 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tờ Rung (2023). csv training [Dataset]. https://www.kaggle.com/datasets/btrung/csv-training/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tờ Rung
    Description

    Dataset

    This dataset was created by Tờ Rung

    Contents

  12. 📊 Yahoo Answers 10 categories for NLP CSV

    • kaggle.com
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. http://doi.org/10.34740/kaggle/dsv/5339321
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yassir Acharki
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

    The file classes.txt contains a list of classes corresponding to each label.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

  13. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  14. Seq2Seq training data.csv

    • figshare.com
    txt
    Updated Apr 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jithin cheriyan (2022). Seq2Seq training data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.19513705.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 4, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    jithin cheriyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the training dataset for the seq2seq model that contains source comments that are offensive and target comments which are non-offensive.

  15. train-csv

    • kaggle.com
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aliaa Osama Esmail (2025). train-csv [Dataset]. https://www.kaggle.com/datasets/aliaaosamaesmail/train-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aliaa Osama Esmail
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Aliaa Osama Esmail

    Released under Apache 2.0

    Contents

  16. expression data.csv

    • figshare.com
    txt
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihan Wang (2022). expression data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.19093307.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jihan Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this research, we proposed the SNR-PPFS feature selection algorithms to identify key gene signatures for distinguishing COAD tumor samples from normal colon tissues. Using machine learning-based feature selection approaches to select key gene signatures from high-dimensional datasets can be an effective way for studying cancer genomic characteristics.

  17. A

    ‘Titanic-Dataset (train.csv)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic-Dataset (train.csv)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-dataset-train-csv-1d8d/f4271729/?iid=006-918&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Titanic-Dataset (train.csv)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hesh97/titanicdataset-traincsv on 28 January 2022.

    --- No further description of dataset provided by original source ---

    --- Original source retains full ownership of the source dataset ---

  18. ScanGrow Manuscript files

    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Espina; Ross Worth (2023). ScanGrow Manuscript files [Dataset]. http://doi.org/10.6084/m9.figshare.16822924.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Laura Espina; Ross Worth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets relative to the manuscript describing the ScanGrow [Proof of Concept] application:

    Worth RM and Espina L (2022) ScanGrow: Deep Learning-Based Live Tracking of Bacterial Growth in Broth. Front. Microbiol. 13:900596.

    doi: 10.3389/fmicb.2022.900596

    The contents of the three compressed folders are described below.

    1. TRAINING_MODEL.ZIP Collection of images and spreadsheets that was used in the training of the image classification model that ScanGrow [PoC] uses by default. This training dataset should be subjected to the pre-processing workflow provided with ScanGrow to obtain the grouped images to be fed to the model training utility.

    2. TEST_MODEL.ZIP

    Collection of images and spreadsheets comprising the Test dataset used in the evaluation of the image classification model. This includes: - New scans and spreadsheets (represented in Figure 3 as gray triangles). - Evaluation.csv: combined results of the output files from command "Test Model" when run with: * Dataset Test: these scans and spreadsheets (not used for training), * Dataset Training: the dataset used for training the model, or * Dataset Validation: the Training dataset after having flipped horizontally and offsetting the images and adjusted the spectrophotometric values according to the newly inverted well positions.

    1. SAMPLE_RUN.ZIP

    Data from a sample run used to test ScanGrow on a microplate containing different concentrations of several antibiotics. This includes:

    • Scans used to in the "Sample run" with added antibiotics in the bacterial cultures.
    • Sample_run_raw.csv: Data exported from the Table view after the run.
    • Sample_run_processed.csv: Data from the Sample_run_raw.csv file after the introduction of metadata (eg. contents of each well) and calculation of the AUC (area under the curve).
    • Sample_run_json.json: JSON file showing the results of this run. It can be loaded into a ScanGrow session by clicking on "Show Graphs" -> "Open".
    • ImageMask.csv: alternative ImageMask to substitute the original one in "C:\Program Files\Riverwell Consultancy Services Ltd\Scan Grow\Configuration". In this alternative ImageMask file, well C11 was modified to overcome an artefact in the scan.
  19. d

    E-learning Recommender System Dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafsa, Mounir (2023). E-learning Recommender System Dataset [Dataset]. http://doi.org/10.7910/DVN/BMY3UD
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hafsa, Mounir
    Description

    Mandarine Academy Recommender System (MARS) Dataset is captured from real-world open MOOC {https://mooc.office365-training.com/}. The dataset offers both explicit and implicit ratings, for both French and English versions of the MOOC. Compared with classical recommendation datasets like Movielens, this is a rather small dataset due to the nature of available content (educational). However, the dataset offers insights into real-world ratings and provides testing grounds away from common datasets. All items are available online for viewing in both French and English versions. All selected users had rated at least 1 item. No demographic information is included. Each user is represented by an id and job (if available). For both French and English, the same kind of files is available in .csv format. We provide the following files: Users: contains information about user ids and their jobs. Items: contains information about items (resources) in the selected language. Contains a mix of feature types. Ratings: Both explicit (Watch time) and implicit (page views of items). Formatting and Encoding The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double quotes ("). These files are encoded as UTF-8. User Ids User ids are consistent between explicit_ratings.csv and implicit_ratings.csv and users.csv (i.e., the same id refers to the same user across the dataset). Item Ids Item ids are consistent between explicit_ratings.csv, implicit_ratings.csv, and items.csv (i.e., the same id refers to the same item across the dataset). Ratings Data File Structure All ratings are contained in the files explicit_ratings.csv and implicit_ratings.csv. Each line of this file after the header row represents one rating of one item by one user, and has the following format: item_id,user_id,created_at (implicit_ratings.csv) user_id,item_id,watch_percentage,created_at,rating (explicit_ratings.csv) Item Data File Structure Item information is contained in the file items.csv. Each line of this file after the header row represents one item, and has the following format: item_id,language,name,nb_views,description,created_at,Difficulty,Job,Software,Theme,duration,type

  20. CIFAR-10 Python in CSV

    • kaggle.com
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    fedesoriano
    Description

    Context

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

    The batches.meta file contains the label names of each class.

    The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

    Content

    Here is the list of the 10 classes in the CIFAR-10:

    Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

    Acknowledgements

    • Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

    How to load the batches.meta file (Python)

    The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

    Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Flammini, Francesco (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7945411

Level Crossing Warning Bell (LCWB) Dataset

Explore at:
Dataset updated
May 20, 2023
Dataset provided by
De Donato, Lorenzo
Flammini, Francesco
Marrone, Stefano
Vittorini, Valeria
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Acknowledgement These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

Disclaimers The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

General Info The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

When using any of these data, please mention:

De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

Content of the folder This folder contains the following subfolders and files.

"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

CSV Files Structure Each "XX_labels.csv" file contains, for each entry, the following information:

The identifier ("index") of the sub-class (which is not relevant in our case);

The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

The extended name of the class ("display_name").

Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

Indeed, each "XX_data.csv" file contains, for each entry, the following information:

ID: the identifier of the entry;

YTID: the YouTube identifier of the video;

start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

positive_labels: the label(s) associated with the audio.

Credits The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Particularly, from AudioSet, we retrieved:

The structure of the CSV files as discussed above.

Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

More about downloading the AudioSet dataset can be found here.

Search
Clear search
Close search
Google apps
Main menu