100+ datasets found

CSV Export
kaggle.com
zip
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruv loves school Shah (2024). CSV Export [Dataset]. https://www.kaggle.com/datasets/dhruvlovesschoolshah/csv-export
Explore at:
zip(766 bytes)Available download formats
Dataset updated
Aug 25, 2024
Authors
Dhruv loves school Shah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Dhruv loves school Shah

Released under CC0: Public Domain

Contents

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Wikipedia Biographies Text Generation Dataset
kaggle.com
zip
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
Explore at:
zip(269983242 bytes)Available download formats
Dataset updated
Dec 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

By wiki_bio (From Huggingface) [source]

About this dataset

The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

How to use the dataset

Overview:

This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.

The dataset is provided in three separate files: train.csv, val.csv, and test.csv.

Each file contains pairs of input text and target text.

File Descriptions:

train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).

val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.

test.csv: This file can be used to generate complete biographies based on the given input texts.

Column Information:

a) For train.csv:

input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.

target_text: Target text column containing the complete biography text for each entry.

b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

Usage Guidelines:

Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

Additional Information and Tips:

The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

The target text is the complete biography for each entry.

While working with this dataset, make sure to preprocess and

Research Ideas

Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
h
hsk-dataset
huggingface.co
Updated Sep 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Liaw (2025). hsk-dataset [Dataset]. https://huggingface.co/datasets/willfliaw/hsk-dataset
Explore at:
Dataset updated
Sep 13, 2025
Authors
William Liaw
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HSK Dataset (CSV)

A curated CSV export of HSK vocabulary (here: selected levels) produced with the Chinese2PDF tooling.

GitHub repository: https://github.com/willfliaw/Chinese2PDF Builder script: https://github.com/willfliaw/Chinese2PDF/blob/main/scripts/build_hsk_dataset.py

This dataset is generated by running the builder (example): python ./scripts/build_hsk_dataset.py --out ./scripts/hsk_words.csv

Files

data/hsk_words.csv

Schema

column… See the full description on the dataset page: https://huggingface.co/datasets/willfliaw/hsk-dataset.
CORD-19 Dataset(Minified)
kaggle.com
zip
Updated Jan 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PRATIYUSH MISHRA (2021). CORD-19 Dataset(Minified) [Dataset]. https://www.kaggle.com/pratiyushmishra/cord19-dataset
Explore at:
zip(79580283 bytes)Available download formats
Dataset updated
Jan 15, 2021
Authors
PRATIYUSH MISHRA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is the dataset I created while working with the original CORD-19 Dataset.I preprocessed the dataset to generate.the final CSV file but I felt that even that was too large to effectively execute operations on it especially when working with different models on the CSV file and this is why I created a subset of the data which can be opened and browsed easily as well as apply different models quickly and effectively to establish a good baseline model.

Content

The original dataset that is available for the CORD-19 needs to be compiled and preprocessed before we can use it to apply any models.Moreover the final compiled CSV is too large to test different models quickly to establish a good baseline model.This is where the small subset of the original dataset comes to play since it is small it can be used to quickly test out different models and understand the data in a much more better and effective way.

Acknowledgements

This dataset is totally based on the original dataset of CORD-19 Open Research Dataset and I would like to thank all the original creators and collaborators of the dataset without which this CSV would not be possible. Also I took help from the following notebook while preprocessing the original dataset. https://www.kaggle.com/danielwolffram/cord-19-create-dataframe

Inspiration

This dataset is mainly used for easy testing of different models on the dataset along with easy feature engineering which can after thorough testing can be easily applied and tested on the larger dataset.
h
AI_World_Generator_csv
huggingface.co
Updated Aug 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jensin (2025). AI_World_Generator_csv [Dataset]. https://huggingface.co/datasets/Jensin/AI_World_Generator_csv
Explore at:
Dataset updated
Aug 3, 2025
Authors
Jensin
Area covered
World
Description
Jensin/AI_World_Generator_csv dataset hosted on Hugging Face and contributed by the HF Datasets community
Export Excel fieldbook to csv-file
figshare.com
mp4
Updated Jul 6, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wouter Marra (2016). Export Excel fieldbook to csv-file [Dataset]. http://doi.org/10.6084/m9.figshare.3472199.v1
Explore at:
mp4Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3472199.v1
Dataset updated
Jul 6, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Wouter Marra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Screencast on how to export field observations with gps coordinates in Excel to a .csv file.
The Canada Trademarks Dataset
zenodo.org
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4999655
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files

/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset

/dta: contains the .dta versions of the data files

/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Podcast PR Contacts - Self-Service CSV Batch Export
datarade.ai
.csv, .xls
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast PR Contacts - Self-Service CSV Batch Export [Dataset]. https://datarade.ai/data-products/podcast-pr-contacts-self-service-csv-batch-export-listen-notes
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Bulgaria, Kuwait, Algeria, French Polynesia, Dominican Republic, Israel, Gibraltar, Costa Rica, Congo, Benin
Description
== Quick starts ==

Batch export podcast metadata to CSV files:

1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/

2) Export by category: https://www.listennotes.com/podcast-datasets/category/

== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
R
Csv To Txt Dataset
universe.roboflow.com
zip
Updated Mar 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
conver3 (2023). Csv To Txt Dataset [Dataset]. https://universe.roboflow.com/conver3/csv-to-txt-fs3su/model/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 9, 2023
Dataset authored and provided by
conver3
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Tooth Bounding Boxes
Description
Csv To Txt

## Overview Csv To Txt is a dataset for object detection tasks - it contains Tooth annotations for 297 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
human_ai_generated_text
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DV, human_ai_generated_text [Dataset]. http://doi.org/10.57967/hf/1617
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1617
Authors
DV
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Human or AI-Generated Text

The data can be valuable for educators, policymakers, and researchers interested in the evolving education landscape, particularly in detecting or identifying texts written by Humans or Artificial Intelligence systems.

File Name

model_training_dataset.csv

File Structure

id: Unique identifier for each record. human_text: Human-written content. ai_text: AI-generated texts. instructions: Description of the task given to both Humans and… See the full description on the dataset page: https://huggingface.co/datasets/dmitva/human_ai_generated_text.
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sokolowski, Daniel; Spielmann, David; Salvaneschi, Guido (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
University of St. Gallen
Authors
Sokolowski, Daniel; Spielmann, David; Salvaneschi, Guido
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
m
KU-MG2: A Dataset for Hybrid Photovoltaic-Natural Gas Generator Microgrid...
data.mendeley.com
search.datacite.org
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah-Al Nahid (2020). KU-MG2: A Dataset for Hybrid Photovoltaic-Natural Gas Generator Microgrid Model of a Residential Area. (For Padma residential area, Rajshahi, Bangladesh) [Dataset]. http://doi.org/10.17632/js5mtkf5yk.1
Explore at:
Unique identifier
https://doi.org/10.17632/js5mtkf5yk.1
Dataset updated
Jul 28, 2020
Authors
Abdullah-Al Nahid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Padma Residential Area, Bangladesh, Rajshahi
Description
a renewable energy resource-based sustainable microgrid model for a residential area is designed by HOMER PRO microgrid software. A small-sized residential area of 20 buildings of about 60 families with 219 MWh and an electric vehicle charging station of daily 10 batteries with 18.3MWh annual energy consumption considered for Padma residential area, Rajshahi (24°22.6'N, 88°37.2'E) is selected as our case study. Solar panels, natural gas generator, inverter and Li-ion batteries are required for our proposed model. The HOMER PRO microgrid software is used to optimize our designed microgrid model. Data were collected from HOMER PRO for the year 2007. We have compared our daily load demand 650KW with the results varying the load by 10%, 5%, 2.5% more and less to find out the best case according to our demand. We have a total of 7 different datasets for different load conditions where each dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data file contents: Data 1:: original_load.csv: This file contains data for 650KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data arrangement is given below: Column 1: Date and time of data recording in the format of MM-DD- YYYY [hh]:[mm]. Time is in 24-hour format. Column 2: Solar power output in KW unit. Column 3: Generator power output in KW unit. Column 4: Total Electrical load served in KW unit. Column 5: Excess electrical production in KW unit. Column 6: Li-ion battery energy content in KWh unit. Column 7: Li-ion battery state of charge in % unit.

Data 2:: 2.5%_more_load.csv: This file contains data for 677KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.

Data 3:: 2.5%_less_load.csv: This file contains data for 622KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.

Data 4:: 5%_more_load.csv: This file contains data for 705KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 5:: 5%_less_load.csv: This file contains data for 595KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 6:: 10%_more_load.csv: This file contains data for the 760KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 7:: 10%_less_load.csv: This file contains data for 540KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
ChatGPT Classification Dataset
kaggle.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi (2023). ChatGPT Classification Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimaktabdar/chatgpt-classification-dataset
Explore at:
zip(718710 bytes)Available download formats
Dataset updated
Sep 7, 2023
Authors
Mahdi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We have compiled a dataset that consists of textual articles including common terminology, concepts and definitions in the field of computer science, artificial intelligence, and cyber security. This dataset consists of both human-generated text and OpenAI’s ChatGPT-generated text. Human-generated answers were collected from different computer science dictionaries and encyclopedias including “The Encyclopedia of Computer Science and Technology” and "Encyclopedia of Human-Computer Interaction". AI-generated content in our dataset was produced by simply posting questions to OpenAI’s ChatGPT and manually documenting the resulting responses. A rigorous data-cleaning process has been performed to remove unwanted Unicode characters, styling and formatting tags. To structure our dataset for binary classification, we combined both AI-generated and Human-generated answers into a single column and assigned appropriate labels to each data point (Human-generated = 0 and AI-generated = 1).

This creates our article-level dataset (article_level_data.csv) which consists of a total of 1018 articles, 509 AI-generated and 509 Human-generated. Additionally, we have divided each article into its sentences and labelled them accordingly. This is mainly to evaluate the performance of classification models and pipelines when it comes to shorter sentence-level data points. This constructs our sentence-level dataset (sentence_level_data.csv) which consists of a total of 7344 entries (4008 AI-generated and 3336 Human-generated).

We appreciate it, if you cite the following article if you happen to use this dataset in any scientific publication:

Maktab Dar Oghaz, M., Dhame, K., Singaram, G., & Babu Saheer, L. (2023). Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models. Frontiers in Artificial Intelligence.

https://www.techrxiv.org/users/692552/articles/682641/master/file/data/ChatGPT_generated_Content_Detection/ChatGPT_generated_Content_Detection.pdf
Screw and hole detection output results dataset in an autonomous screwing...
data.europa.eu
unknown
Updated Jan 31, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2019). Screw and hole detection output results dataset in an autonomous screwing operation [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-2555108
Explore at:
unknown(83)Available download formats
Dataset updated
Jan 31, 2019
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a set of csv files containing the results of a screwing and hole detection developed for a screwing mobile robot. Thee different scenarios were built, and a total of 25 missions were executed in order to build the datasets presented in appendix 5.7. The first scenario (missions 1 to 15), represents a normal operation procedure, meaning, a screwing 6 hole screwing and 5 screws refilling. The second scenario (missions 16 to 20), represents an operation with an exception handling during the refill (the second, third and fourth screws are missing in the refill support). Lastly, the third scenario (missions 21 to 25) has another exception handling, this time on both hole and screw detections (the third, fourth and fifth holes are already screwed, and the screwdriver trigger is not pressed, so the screw is not tightened resulting in a screw detecting after the refilling phase. Also three different group of csv files can be found: hole_detection_mission_xx.csv1: contains the hole detection result (detected or non detected) and the located hole coordinates (u,v), in pixels; detect_screw_screwing_result_mission_xx.csv: contains the screw detection result performed after a screwing to detect if the screw was screwed successfully. 1 means the screwing was ok (there is no screw on the screwdriver), 0 means the screwing was not ok (there was a screw on the screwdriver); detect_screw_refill_mission_xx.csv: contains the screw detection result performed after a refill attempt, to check if there was any screw missing in the refill support. 0 means that no screw was missing on the refill support (there was a screw on the screwdriver), 1 means there was a screw missing on the support (there was no screw on the screwdriver). The mission_xx suffix represents which mission file (rosbag) the mission was acquired in and, each csv file contains the timestamp in which it happens. 1 xx represents the mission number
m
Dataset for twitter Sentiment Analysis using Roberta and Vader
data.mendeley.com
Updated May 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannatul Ferdoshi Jannatul Ferdoshi (2023). Dataset for twitter Sentiment Analysis using Roberta and Vader [Dataset]. http://doi.org/10.17632/2sjt22sb55.1
Explore at:
Unique identifier
https://doi.org/10.17632/2sjt22sb55.1
Dataset updated
May 14, 2023
Authors
Jannatul Ferdoshi Jannatul Ferdoshi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our dataset comprises 1000 tweets, which were taken from Twitter using the Python programming language. The dataset was stored in a CSV file and generated using various modules. The random module was used to generate random IDs and text, while the faker module was used to generate random user names and dates. Additionally, the textblob module was used to assign a random sentiment to each tweet.

This systematic approach ensures that the dataset is well-balanced and represents different types of tweets, user behavior, and sentiment. It is essential to have a balanced dataset to ensure that the analysis and visualization of the dataset are accurate and reliable. By generating tweets with a range of sentiments, we have created a diverse dataset that can be used to analyze and visualize sentiment trends and patterns.

In addition to generating the tweets, we have also prepared a visual representation of the data sets. This visualization provides an overview of the key features of the dataset, such as the frequency distribution of the different sentiment categories, the distribution of tweets over time, and the user names associated with the tweets. This visualization will aid in the initial exploration of the dataset and enable us to identify any patterns or trends that may be present.
Z
TRAVEL: A Dataset with Toolchains for Test Generation and Regression Testing...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pouria Derakhshanfar; Annibale Panichella; Alessio Gambi; Vincenzo Riccio; Christian Birchler; Sebastiano Panichella (2024). TRAVEL: A Dataset with Toolchains for Test Generation and Regression Testing of Self-driving Cars Software [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5911160
Explore at:
Dataset updated
Jul 17, 2024
Dataset provided by
University of Passau
Delft University of Technology
Zurich University of Applied Sciences
Università della Svizzera Italiana
Authors
Pouria Derakhshanfar; Annibale Panichella; Alessio Gambi; Vincenzo Riccio; Christian Birchler; Sebastiano Panichella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.

Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.

This dataset builds on top of our previous work in this area, including work on

test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),

test selection: SDC-Scissor and related tool

test prioritization: automated test cases prioritization work for SDCs.

Dataset Overview

The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).

The following sections describe what each of those files contains.

Experiment Description

The experiment_description.csv contains the settings used to generate the data, including:

Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.

The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.

The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.

The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.

The speed limit. The maximum speed at which the driving agent under test can travel.

Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.

Experiment Statistics

The generation_stats.csv contains statistics about the test generation, including:

Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.

Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.

The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.

Test Cases and Executions

Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.

The data about the test case definition include:

The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)

The test ID. The unique identifier of the test in the experiment.

Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)

The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.

{ "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }

Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).

The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.

{ "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }

Dataset Content

The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).

SBST CPS Tool Competition Data

The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).

Name Map Size (m x m) Max Speed (Km/h) Budget (h) OOB Tolerance (%) Test Subject DEFAULT 200 × 200 120 5 (real time) 0.95 BeamNG.AI - 0.7 SBST 200 × 200 70 2 (real time) 0.5 BeamNG.AI - 0.7

Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.

SDC Scissor

With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.

Name Map Size (m x m) Max Speed (Km/h) Budget (h) OOB Tolerance (%) Test Subject SDC-SCISSOR 200 × 200 120 16 (real time) 0.5 BeamNG.AI - 1.5

The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.

Dataset Statistics

Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).

Generating new Data

Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.

Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;
LLM Prompt Recovery - Synthetic Datastore
kaggle.com
zip
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
Explore at:
zip(988448 bytes)Available download formats
Dataset updated
Feb 29, 2024
Authors
Darien Schettler
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
High Level Description

This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

Contributors

Please go upvote these other datasets as my work is not possible without them

thedrcat's dataset - LLM Prompt Recovery Data

TBD

First Dataset - 1000 Examples From @thedrcat

Update 1 - February 29, 2024

The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

' sentence removed.

I generated things using the following setup:

# I used a vLLM server to host Gemma 7B on paperspace (A100) # Step 1 - Install vLLM >>> pip install vllm # Step 2 - Authenticate HuggingFace CLI (for model weights) >>> huggingface-cli login --token
h
Data from: playlist-generator
huggingface.co
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nima Boscarino (2022). playlist-generator [Dataset]. https://huggingface.co/datasets/NimaBoscarino/playlist-generator
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Nima Boscarino
Description
Playlist Generator Dataset

This dataset contains three files, used in the Playlist Generator space. Visit the blog post to learn more about the project: https://huggingface.co/blog/your-first-ml-project

verse-embeddings.pkl contains Sentence Transformer embeddings for each verse for each song in a private (unreleased) dataset of song lyrics. The embeddings were generated using this model: https://huggingface.co/sentence-transformers/msmarco-MiniLM-L-6-v3 verses.csv maps each verse… See the full description on the dataset page: https://huggingface.co/datasets/NimaBoscarino/playlist-generator.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dhruv loves school Shah (2024). CSV Export [Dataset]. https://www.kaggle.com/datasets/dhruvlovesschoolshah/csv-export

CSV Export

Explore at:

zip(766 bytes)Available download formats

Dataset updated

Aug 25, 2024

Authors

Dhruv loves school Shah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Dhruv loves school Shah

Released under CC0: Public Domain

Clear search

Close search

Google apps

Main menu

CSV Export

Dataset

Contents

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

About this dataset

How to use the dataset

Research Ideas

hsk-dataset

CORD-19 Dataset(Minified)

Context

Content

Acknowledgements

Inspiration

AI_World_Generator_csv

Export Excel fieldbook to csv-file

The Canada Trademarks Dataset

Podcast PR Contacts - Self-Service CSV Batch Export

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Csv To Txt Dataset

Csv To Txt

human_ai_generated_text

PIPr: A Dataset of Public Infrastructure as Code Programs

KU-MG2: A Dataset for Hybrid Photovoltaic-Natural Gas Generator Microgrid...

ChatGPT Classification Dataset

Screw and hole detection output results dataset in an autonomous screwing...

Dataset for twitter Sentiment Analysis using Roberta and Vader

TRAVEL: A Dataset with Toolchains for Test Generation and Regression Testing...

LLM Prompt Recovery - Synthetic Datastore

High Level Description

Contributors

First Dataset - 1000 Examples From @thedrcat

Data from: playlist-generator

CSV Export

Dataset

Contents