78 datasets found

Z
Data from: A Large-scale Dataset of (Open Source) License Text Variants
data.niaid.nih.gov
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
Explore at:
Dataset updated
Mar 31, 2022
Dataset authored and provided by
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
The Canada Trademarks Dataset
zenodo.org
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4999655
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Canada
Description
The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files

/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset

/dta: contains the .dta versions of the data files

/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Data from: Dataflow Analysis-Inspired Deep Learning for Efficient...
figshare.com
zip
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Steenhoek (2024). Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21225413.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21225413.v3
Dataset updated
Jan 11, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benjamin Steenhoek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data package for "Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection", published in ICSE 2024, with updates from Artifact Evaluation.Paper link: https://www.computer.org/csdl/proceedings-article/icse/2024/021700a166/1RLIWqviwEMSee Github repo for updates: https://github.com/ISU-PAAL/DeepDFAData dictionary:before.zip: CFGs of Big-Vul dataset, generated by Joern.preprocessed_data.zip: preprocessed data from Big-Vul for running DeepDFA, including preprocessed Joern CFGs and abstract dataflow embeddings.DeepDFA-code.zip: most recent version of the code as of the publication of this artifact, see Github repo for updates: https://github.com/ISU-PAAL/DeepDFAMSR_data_cleaned.csv: original Big-Vul dataset, see original source: https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_DatasetMSR_LineVul: LineVul's preprocessed version of the Big-Vul dataset, see original source: https://github.com/awsm-research/LineVulChangelog:v1 2023-09-20: original data package and Github repo published.v2 2024-01-04: added full instructions and bug fixes for Artifact Evaluation.v3 2024-01-10: integrated feedback from Artifact Evaluation.
(No) Influence of Continuous Integration on the Development Activity in...
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. http://doi.org/10.5281/zenodo.1291582
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1291582
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),

were active for at least one year (365 days) before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

have Java or Ruby as their project language

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)

have commit activity for at least two years (730 days)

are engineered software projects (at least 10 watchers)

were not in the TravisTorrent dataset

In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

This dataset contains the following files:

tr_projects_sample_filtered_2.csv
A CSV file with information about the 113 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).

comparison_project_sample_800.csv
A CSV file with information about the 800 projects in the comparison sample.

commits_default_branch_before_mid.csv
commits_default_branch_after_mid.csv
One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

merges_default_branch_before_mid.csv
merges_default_branch_after_mid.csv
One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
1000 Empirical Time series
figshare.com
researchdata.edu.au
png
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5436136.v10
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ben Fulcher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...
springernature.figshare.com
application/csv
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26133973.v1
Dataset updated
Sep 24, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

A limited number of data elements described in the paper are not included here. The following elements are excluded:

The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

The free-text comments written by raters during the ratings process.

Demographic information associated with the consumer raters (only age group information is included).

References

Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

Description of files and sheets

Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

Version history

Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
Z
GPT vs Stack Overflow: data collection (A2I2 T2 2023)
data.niaid.nih.gov
zenodo.org
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heath, Mark (2023). GPT vs Stack Overflow: data collection (A2I2 T2 2023) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8403467
Explore at:
Dataset updated
Oct 6, 2023
Dataset authored and provided by
Heath, Mark
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About

The dataset components produced by this repo. Please see the documentation there for more information.

Each CSV has been individually zipped so that you only have to download the specific file(s) that you want.

Overview of Files

From using the Stack Exchange Data Dump as the data source (these zip files have a DD_ prefix):

Raw dataset before processing: saved_dataset.csv (DD_saved_dataset.zip)

Completed tag count: tag_count.csv (DD_tag_count.zip)

Processed dataset with completed evaluations: dataset_results.csv (DD_dataset_results.zip)

From using Google BigQuery as the data source (these zip files have a BQ_ prefix):

Raw dataset before processing: saved_dataset.csv (BQ_saved_dataset.zip)

Completed tag count: tag_count.csv (BQ_tag_count.zip)

No large-scale evaluation was completed when using BigQuery as a data source.

As noted in the linked repo, the use of Google BigQuery as a data source is not recommended for this work, but the working code and dataset have nonetheless been provided for completeness.

License

This dataset is licensed under the CC BY-SA 4.0 license, the same license used by the Stack Exchange Data Dump.
P
News Interactions on Globo.com Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel de Souza Pereira Moreira; Dietmar Jannach; Adilson Marques da Cunha, News Interactions on Globo.com Dataset [Dataset]. https://paperswithcode.com/dataset/news-interactions-on-globo-com
Explore at:
Authors
Gabriel de Souza Pereira Moreira; Dietmar Jannach; Adilson Marques da Cunha
Description
Context This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

The first version (v1) (download) of this dataset was released for reproducibility of the experiments presented in the following paper:

Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018. News Session-Based Recommendations using Deep Neural Networks. In 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328

A second version (v2) (download) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:

Included four additional user contextual attributes (click_os, click_country, click_region, click_referrer_type) Removed repeated clicks (clicks in the same articles) within sessions. Those sessions with less than two clicks (minimum for the next-click prediction task) were removed

Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019. Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1904.10367, 49 pages

You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). If used for research, please cite the above papers.

Content The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

It is composed by three files/folders:

clicks.zip - Folder with CSV files (one per hour), containing user sessions interactions in the news portal. articles_metadata.csv - CSV file with metadata information about all (364047) published articles articles_embeddings.pickle Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see paper for details) for 364047 published articles. P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neural Networks to represent their content. See this paper for a t-SNE visualization of these embeddings, colored by category.

Acknowledgements I would like to acknowledge Globo.com for providing this dataset for this research and for the academic community, in special to Felipe Ferreira for preparing the original dataset by Globo.com.

Dataset banner photo by rawpixel on Unsplash

Inspiration This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.

If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller dataset.
d
Data release: Flood and Storm Tracker (FaST) data
catalog.data.gov
s.cnmilf.com
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data release: Flood and Storm Tracker (FaST) data [Dataset]. https://catalog.data.gov/dataset/data-release-flood-and-storm-tracker-fast-data
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
This product summarizes data used in the analysis portion of our Flood and Storm Tracker (FaST) manuscript (see larger work citation). The dataset titled HUCsppMatrices2012-2022.csv has each Hydraulic Unit Code (HUC) with an introduced taxon in each storm and the HUC it connected to by flood waters (lateral or longitudinal). The dataset titled ConnectionPoints_2012-2022.csv has each lateral (not longitudinal or downstream) connection point for each storm event. The dataset titled LongitudinalConnectionPoints_2012-2022.csv has each longitudinal or downstream connection point for each storm event.
Z
Data from: The Software Heritage License Dataset (2022 Edition)
data.niaid.nih.gov
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Montes-Leon (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
Explore at:
Dataset updated
Jan 10, 2024
Dataset provided by
Jesus M. Gonzalez-Barahona
Sergio Montes-Leon
Gregorio Robles
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

blobs/ is the root directory containing all license blobs

8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

where:

SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

SHA1: blob SHA1

MIME_TYPE: blob MIME type, as detected by libmagic

ENCODING: blob character encoding, as detected by libmagic

LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

SIZE: blob size in bytes

blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

SHA1: blob SHA1

LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

SCORE: confidence score in the result, as a decimal number between 0 and 100

There may be zero or arbitrarily many lines for each blob.

blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

sha1: blob SHA1

licenses: output of scancode.api.get_licenses(..., min_score=0)

copyrights: output of scancode.api.get_copyrights(...)

There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

Two blobs are missing because the computation crashes:

swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

This issue will be fixed in a future version of the dataset

blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

SWHID: blob SWHID

EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

OCCURRENCES: number of known commits containing the blob

replication-package.tar.gz: code and scripts used to produce the dataset

licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

Changes since the 2021-03-23 dataset

More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

blobs-nb-origins.csv.zst is added.

blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

blobs-scancode.ndjson.zst is added.

Errata

A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

Citation

If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

References

The dataset has been built using primarily the data sources described in the following papers:

[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

Errata (v2, 2024-01-09)

licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jul 13, 2025
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
A
‘🍷 Alcohol vs Life Expectancy’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 24, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘🍷 Alcohol vs Life Expectancy’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-alcohol-vs-life-expectancy-bdda/590be6d0/?iid=002-384&v=presentation
Explore at:
Dataset updated
Dec 24, 2016
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘🍷 Alcohol vs Life Expectancy’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/alcohol-vs-life-expectancye on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Credit: This dataset was created by Jonathan Ortiz! All credits for the original go to the original author!

About this dataset

Intro
2016">https://data.world/uncc-dsba/dsba-6100-fall-2016

Findings

There is a surprising relationship between alcohol consumption and life expectancy. In fact, the data suggest that life expectancy and alcohol consumption are positively correlated - 1.2 additional years for every 1 liter of alcohol consumed annually. This is, of course, a spurious finding, because the correlation of this relationship is very low - 0.28. This indicates that other factors in those countries where alcohol consumption is comparatively high or low are contributing to differences in life expectancy, and further analysis is warranted.

https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">

Methods

1. Addressing Missing Values

The original drinks.csv file in the UNCC/DSBA-6100 dataset was missing values for The Bahamas, Denmark, and Macedonia for the wine, spirits, and beer attributes, respectively. Drinks_solution.csv shows these values filled in, for which I used the Mean of the rest of the data column.

Other methods were considered and ruled out:

Deleting the Bahamas, Denmark, and Macedonia instances altogether - This is a possible route, but the data file itself is just under 200 rows, and there is only one observation for each country. Because the dataset is relatively small by number of instances, removal should be avoided in order to give the model more data to use.

Imputing missing values with k-Nearest Neighbors - Another possible route, knn impute can yield higher accuracy in certain cases when the dataset is fairly large. However, this particular dataset only contains 3 attributes, all of which seem unrelated to each other. If we had more columns with more data like availability, annual sales, preferences, etc. of the different drinks, it would be possible to predict these values with knn, but this approach should be avoided given the data we have.

Filling missing values with a MODE - By visualizing the data, it is easy to see that each column is fairly skewed, with many countries reporting 0 in one or more of the servings columns. Using the MODE would fill these missing entries with 0 for all three (beer_servings, spirit_servings, and wine_servings), and upon reviewing the Bahamas, Denmark, and Macedonia more closely, it is apparent that 0 would be a poor choice for the missing values, as all three countries clearly consume alcohol.

Filling missing values with MEDIAN - Due to the skewness mentioned just above in the MODE section, using a MEDIAN of the whole column would also be a poor choice, as the MEDIAN is pulled down by several countries reporting 0 or 1. A MEDIAN of only the observations reporting 1 or more servings--or another cutoff--could be used, however, and this would be acceptable.

Filling missing values with MEAN - In the case of the drinks dataset, this is the best approach. The MEAN averages for the columns happen to be very close to the actual data from where we sourced this exercise. In addition, the MEAN will not skew the data, which the prior approaches would do.

2. Calculating New Attributes

The original drinks.csv dataset also had an empty data column: total_litres_of_pure_alcohol. This column needed to be calculated in order to do a simple 2D plot and trendline. It would have been possible to instead run a multi-variable regression on the data and therefore skip this step, but this adds an extra layer of complication to understanding the analysis - not to mention the point of the exercise is to go through an example of calculating new attributes (or "feature engineering") using domain knowledge.

The graphic found at the Wikipedia / Standard Drink page shows the following breakdown:

Beer - 12 fl oz per serving - 5% average ABV

Wing - 5 fl oz - 12% ABV

Spirits - 1.5 fl oz - 40% ABV

The conversion factor from fl oz to L is 1 fl oz : 0.0295735 L

Therefore, the following formula was used to compute the empty column:
total_litres_of_pure_alcohol
=
(beer_servings * 12 fl oz per serving * 0.05 ABV + spirit_servings * 1.5 fl oz * 0.4 ABV + wine_servings * 5 fl oz * 0.12 ABV) * 0.0295735 liters per fl oz

3. Joining To External Data

The lifeexpectancy.csv datafile in the https://data.world/uncc-dsba/dsba-6100-fall-2016 dataset contains life expectancy data for each country. The following query will join this data to the cleaned drinks.csv data file:

# Life Expectancy vs Alcohol Consumption

Life Expectancy vs. Alcohol Consumption with countryTable

PREFIX drinks: <http://data.world/databeats/alcohol-vs-life-expectancy/drinks_solution.csv/drinks_solution#> PREFIX life: <http://data.world/uncc-dsba/dsba-6100-fall-2016/lifeexpectancy.csv/lifeexpectancy#> PREFIX countries: <http://data.world/databeats/alcohol-vs-life-expectancy/countryTable.csv/countryTable#> SELECT ?country ?alc ?years WHERE { SERVICE <https://query.data.world/sparql/databeats/alcohol-vs-life-expectancy> { ?r1 drinks:total_litres_of_pure_alcohol ?alc . ?r1 drinks:country ?country . ?r2 countries:drinksCountry ?country . ?r2 countries:leCountry ?leCountry . } SERVICE <https://query.data.world/sparql/uncc-dsba/dsba-6100-fall-2016> { ?r3 life:CountryDisplay ?leCountry . ?r3 life:GhoCode ?gho_code . ?r3 life:Numeric ?years . ?r3 life:YearCode ?reporting_year . ?r3 life:SexDisplay ?sex . } FILTER ( ?gho_code = "WHOSIS_000001" && ?reporting_year = 2013 && ?sex = "Both sexes" ) } ORDER BY ?country

4. Plotting

The resulting joined data can then be saved to local disk and imported into any analysis tool like Excel, Numbers, R, etc. to make a simple scatterplot. A trendline and R^2 should be added to determine the relationship between Alcohol Consumption and Life Expectancy (if any).

https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">

This dataset was created by Jonathan Ortiz and contains around 200 samples along with Beer Servings, Spirit Servings, technical information and other features such as: - Total Litres Of Pure Alcohol - Wine Servings - and more.

How to use this dataset

Analyze Beer Servings in relation to Spirit Servings

Study the influence of Total Litres Of Pure Alcohol on Wine Servings

More datasets

Acknowledgements

If you use this dataset in your research, please credit Jonathan Ortiz

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
o
Simple Multimodal Algorithmic Reasoning Task Dataset (SMART-101)
explore.openaire.eu
data.niaid.nih.gov
Updated Mar 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anoop Cherian; Kuan-Chuan Peng; Suhas Lohit; Kevin A. Smith; Joshua B. Tenenbaum (2023). Simple Multimodal Algorithmic Reasoning Task Dataset (SMART-101) [Dataset]. http://doi.org/10.5281/zenodo.7761799
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7761799
Dataset updated
Mar 23, 2023
Authors
Anoop Cherian; Kuan-Chuan Peng; Suhas Lohit; Kevin A. Smith; Joshua B. Tenenbaum
Description
Introduction Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appearance, associated natural language question, and its solution. To foster research and make progress in the quest for artificial general intelligence, we are publicly releasing our SMART-101 dataset, consisting of the full set of programmatically-generated instances of 101 puzzles and their solutions. The dataset was introduced in our paper Are Deep Neural Networks SMARTer than Second Graders? by Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A. Smith, and Joshua B. Tenenbaum, CVPR 2023 Files in the unzipped folder: ./README.md: This Markdown file ./SMART101-Data: Folder containing all the puzzle data. See below for details. ./puzzle_type_info.csv: Puzzle categorization (into 8 skill classes). Dataset Organization The dataset consists of 101 folders (numbered from 1-101); each folder corresponds to one distinct puzzle (root puzzle). There are 2000 puzzle instances programmatically created for each root puzzle, numbered from 1-2000. Every root puzzle index (in [1,101]) folder contains: (i) img/ and (ii) puzzle_<index>.csv. The folder img/ is the location where the puzzle instance images are stored, and puzzle_<index>.csv the non-image part of a puzzle. Specifically, a row of puzzle_<index>.csv is the following tuple: <id, Question, image, A, B, C, D, E, Answer>, where id is the puzzle instance id (in [1,2000]), Question is the puzzle question associated with the instance, image is the name of the image (in img/ folder) corresponding to this instance id, A, B, C, D, E are the five answer candidates, and Answer is the answer to the question. At a Glance The size of the unzipped dataset is ~12GB. The dataset consists of 101 folders (numbered from 1-101); each folder corresponds to one distinct puzzle (root puzzle). There are 2000 puzzle instances programmatically created for each root puzzle, numbered from 1-2000. Every root puzzle index (in [1,101]) folder contains: (i) img/ and (ii) puzzle_<index>.csv. The folder img/ is the location where the puzzle instance images are stored, and puzzle_<index>.csv contains the non-image part of a puzzle. Specifically, a row of puzzle_<index>.csv is the following tuple: <id, Question, image, A, B, C, D, E, Answer>, where id is the puzzle instance id (in [1,2000]), Question is the puzzle question associated with the instance, image is the name of the image (in img/ folder) corresponding to this instance id, A, B, C, D, E are the five answer candidates, and Answer is the correct answer to the question. Other Details In our paper Are Deep Neural Networks SMARTer than Second Graders?, we provide four different dataset splits for evaluation: (i) Instance Split (IS), (ii) Answer Split (AS), (iii) Puzzle Split (PS), and (iv) Few-shot Split (FS). Below, we provide the details of each split to make fair comparisons to the results reported in our paper. Puzzle Split (PS) We use the following root puzzle ids as the Train and Test sets. Split Root Puzzle Id Sets Test { 94,95, 96, 97, 98, 99, 101, 61,62, 65, 66,67, 69, 70, 71,72,73,74,75,76,77} Train {1,2,...,101} \ Test Evaluation is done on all the Test puzzles and their accuracies averaged. For the 'Test' puzzles, we use the instance indices 1701-2000 in the evaluation. Few-shot Split (FS) We randomly select k number of instances from the Test sets (that are used in the PS split above) for training in FS split (e.g., k=100). These k few-shot samples are taken from instance indices 1-1600 of the respective puzzles and evaluation is conducted on all instance ids from 1701-2000. Instance Split (IS) We split the instances under every root puzzle as: Train = 1-1600, Val = 1601-1700, Test = 1701-2000. We train the neural network models using the Train split puzzle instances from all the root puzzles together and evaluate on the Test split of all puzzles. Answer Split (AS) We find the median answer value among all the...
h
Data from: json-schema
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Unity Lab, json-schema [Dataset]. https://huggingface.co/datasets/dataunitylab/json-schema
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Data Unity Lab
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
JSON Schema Dataset

This dataset consists of a collection of JSON Schema documents collected from GitHub by searching using the Sourcegraph API.

Step 1: Find a list of JSON Schema paths

The Sourcegraph code search API is used to find files with a .json extension and containing { "$schema": "https://json-schema.org/". This is somewhat restrictive, but still manages to find a large number of schemas. pipenv run python slurp.py --outfile repos.csv

Step 2:… See the full description on the dataset page: https://huggingface.co/datasets/dataunitylab/json-schema.
h
RF-Collection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Mining and Information Systems Lab, RF-Collection [Dataset]. https://huggingface.co/datasets/dmis-lab/RF-Collection
Explore at:
Authors
Data Mining and Information Systems Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RF_Collection

Dataset Description

We construct a large-scale dataset called RF-Collection, containing Retrievers' Feedback on oer 410k query rewrites across 12K conversations.

Dataset Files

The dataset is organized into several CSV files, each corresponding to different retrieval and datasets:

TopiOCQA_train_bm25.csv: Contains the retrieval results using the BM25 on the TopiOCQA dataset. TopiOCQA_train_ance.csv: Contains the retrieval results using the ANCE on… See the full description on the dataset page: https://huggingface.co/datasets/dmis-lab/RF-Collection.
Data and code associated with "The Observed Availability of Data and Code in...
zenodo.org
csv, text/x-python +1
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern (2025). Data and code associated with "The Observed Availability of Data and Code in Earth Science and Artificial Intelligence" [Dataset]. http://doi.org/10.5281/zenodo.14902836
Explore at:
csv, txt, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14902836
Dataset updated
Feb 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 2025
Area covered
Earth
Description
Data and code associated with "The Observed Availability of Data and Code in Earth Science
and Artificial Intelligence" by Erin A. Jones, Brandon McClung, Hadi Fawad, and Amy McGovern.

Instructions: To reproduce figures, download all associated Python and CSV files and place
in a single directory.
Run BAMS_plot.py as you would run Python code on your system.

Code:
BAMS_plot.py: Python code for categorizing data availability statements based on given data
documented below and creating figures 1-3.

Code was originally developed for Python 3.11.7 and run in the Spyder
(version 5.4.3) IDE.

Libraries utilized:
numpy (version 1.26.4)
pandas (version 2.1.4)
matplotlib (version 3.8.0)

For additional documentation, please see code file.

Data:
ASDC_AIES.csv: CSV file containing relevant availability statement data for Artificial
Intelligence for the Earth Systems (AIES)
ASDC_AI_in_Geo.csv: CSV file containing relevant availability statement data for Artificial
Intelligence in Geosciences (AI in Geo.)
ASDC_AIJ.csv: CSV file containing relevant availability statement data for Artificial
Intelligence (AIJ)
ASDC_MWR.csv: CSV file containing relevant availability statement data for Monthly
Weather Review (MWR)

Data documentation:
All CSV files contain the same format of information for each journal. The CSV files above are
needed for the BAMS_plot.py code attached.

Records were analyzed based on the criteria below.

Records:
1) Title of paper
The title of the examined journal article.
2) Article DOI (or URL)
A link to the examined journal article. For AIES, AI in Geo., MWR, the DOI is
generally given. For AIJ, the URL is given.
3) Journal name
The name of the journal where the examined article is published. Either a full
journal name (e.g., Monthly Weather Review), or the acronym used in the
associated paper (e.g., AIES) is used.
4) Year of publication
The year the article was posted online/in print.
5) Is there an ASDC?
If the article contains an availability statement in any form, "yes" is
recorded. Otherwise, "no" is recorded.
6) Justification for non-open data?
If an availability statement contains some justification for why data is not
openly available, the justification is summarized and recorded as one of the
following options: 1) Dataset too large, 2) Licensing/Proprietary, 3) Can be
obtained from other entities, 4) Sensitive information, 5) Available at later
date. If the statement indicates any data is not openly available and no
justification is provided, or if no statement is provided is provided "None"
is recorded. If the statement indicates openly available data or no data
produced, "N/A" is recorded.
7) All data available
If there is an availability statement and data is produced, "y" is recorded
if means to access data associated with the article are given and there is no
indication that any data is not openly available; "n" is recorded if no means
to access data are given or there is some indication that some or all data is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
8) At least some data available
If there is an availability statement and data is produced, "y" is recorded
if any means to access data associated with the article are given; "n" is
recorded if no means to access data are given. If there is no availability
statement or no data is produced, the record is left blank.
9) All code available
If there is an availability statement and data is produced, "y" is recorded
if means to access code associated with the article are given and there is no
indication that any code is not openly available; "n" is recorded if no means
to access code are given or there is some indication that some or all code is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
10) At least some code available
If there is an availability statement and data is produced, "y" is recorded
if any means to access code associated with the article are given; "n" is
recorded if no means to access code are given. If there is no availability
statement or no data is produced, the record is left blank.
11) All data available upon request
If there is an availability statement indicating data is produced and no data
is openly available, "y" is recorded if any data is available upon request to
the authors of the examined journal article (not a request to any other
entity); "n" is recorded if no data is available upon request to the authors
of the examined journal article. If there is no availability statement, any
data is openly available, or no data is produced, the record is left blank.
12) At least some data available upon request
If there is an availability statement indicating data is produced and not all
data is openly available, "y" is recorded if all data is available upon
request to the authors of the examined journal article (not a request to any
other entity); "n" is recorded if not all data is available upon request to
the authors of the examined journal article. If there is no availability
statement, all data is openly available, or no data is produced, the record
is left blank.
13) no data produced
If there is an availability statement that indicates that no data was
produced for the examined journal article, "y" is recorded. Otherwise, the
record is left blank.
14) links work
If the availability statement contains one or more links to a data or code
repository, "y" is recorded if all links work; "n" is recorded if one or more
links do not work. If there is no availability statement or the statement
does not contain any links to a data or code repository, the record is left
blank.

Data for "Direct and indirect Rod and Frame effect: A virtual reality study"...

data.mendeley.com

Updated Feb 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Michał Adamski (2025). Data for "Direct and indirect Rod and Frame effect: A virtual reality study" [Dataset]. http://doi.org/10.17632/pcf2n8b4rd.1

Explore at:

Unique identifier

https://doi.org/10.17632/pcf2n8b4rd.1

Dataset updated

Feb 12, 2025

Authors

Michał Adamski

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains the raw experimental data and supplementary materials for the "Asymmetry Effects in Virtual Reality Rod and Frame Test". The materials included are:

•  Raw Experimental Data: older.csv and young.csv
•  Mathematica Notebooks: a collection of Mathematica notebooks used for data analysis and visualization. These notebooks provide scripts for processing the experimental data, performing statistical analyses, and generating the figures used in the project.
•  Unity Package: a Unity package featuring a sample scene related to the project. The scene was built using Unity’s Universal Rendering Pipeline (URP). To utilize this package, ensure that URP is enabled in your Unity project. Instructions for enabling URP can be found in the Unity URP Documentation.

Requirements:

•  For Data Files: software capable of opening CSV files (e.g., Microsoft Excel, Google Sheets, or any programming language that can read CSV formats).
•  For Mathematica Notebooks: Wolfram Mathematica software to run and modify the notebooks.
•  For Unity Package: Unity Editor version compatible with URP (2019.3 or later recommended). URP must be installed and enabled in your Unity project.

Usage Notes:

•  The dataset facilitates comparative studies between different age groups based on the collected variables.
•  Users can modify the Mathematica notebooks to perform additional analyses.
•  The Unity scene serves as a reference to the project setup and can be expanded or integrated into larger projects.

Citation: Please cite this dataset when using it in your research or publications.

MetaGraspNet Single Class Mutiple Instance Dataset
kaggle.com
Updated Apr 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhao Chen (2022). MetaGraspNet Single Class Mutiple Instance Dataset [Dataset]. https://www.kaggle.com/metagrasp/metagraspnet-single-class-mutiple-instance-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yuhao Chen
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
MetaGraspNet dataset

This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).

There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.

To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).

https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">

Citing MetaGraspNet

If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry. BibTeX @article{chen2021metagraspnet, author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and Alexander Wong}, title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis}, journal = {arXiv preprint arXiv:2112.14663}, year = {2021} }

File Structure

This dataset is arranged in the following file structure:

root |-- meta-grasp |-- scene0 |-- 0_camera_params.json |-- 0_depth.png |-- 0_rgb.png |-- 0_order.csv ... |-- scene1 ... |-- difficulty-n-coco-label.json

Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json), a depth image (x_depth.png), an rgb image (x_rgb.png), as well as a matrix representation of the ordering of each object (x_order.csv). The full label for the image are all available in difficulty-n-coco-label.json (where n is the difficulty level of the dataset) in the coco data format.

Understanding order.csv

The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object: relationship_matrix[child_id, parent_id] = -1
h
AIME_Hallucination_Detection
huggingface.co
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gollam Rabby (2025). AIME_Hallucination_Detection [Dataset]. https://huggingface.co/datasets/tourist800/AIME_Hallucination_Detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2025
Authors
Gollam Rabby
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AIME Hallucination Detection Dataset

This dataset is created for detecting hallucinations in Large Language Models (LLMs), particularly focusing on complex mathematical problems. It can be used for tasks like model evaluation, fine-tuning, and research.

Dataset Details

Name: AIME Hallucination Detection Dataset
Format: CSV
Size: (14.6 MB)
Files Included:
AIME-hallucination-detection-dataset.csv: Contains the dataset.

Content Description

The dataset… See the full description on the dataset page: https://huggingface.co/datasets/tourist800/AIME_Hallucination_Detection.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Explore at:

Dataset updated

Mar 31, 2022

Dataset authored and provided by

Stefano Zacchiroli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

Clear search

Close search

Google apps

Main menu

Data from: A Large-scale Dataset of (Open Source) License Text Variants

The Canada Trademarks Dataset

Data from: Dataflow Analysis-Inspired Deep Learning for Efficient...

(No) Influence of Continuous Integration on the Development Activity in...

LLM: 7 prompt training dataset

1000 Empirical Time series

Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

References

Description of files and sheets

Version history

GPT vs Stack Overflow: data collection (A2I2 T2 2023)

News Interactions on Globo.com Dataset

Data release: Flood and Storm Tracker (FaST) data

Data from: The Software Heritage License Dataset (2022 Edition)

Open Data Portal Catalogue

‘🍷 Alcohol vs Life Expectancy’ analyzed by Analyst-2

Credit: This dataset was created by Jonathan Ortiz! All credits for the original go to the original author!

About this dataset

Intro

Findings

Methods

1. Addressing Missing Values

2. Calculating New Attributes

3. Joining To External Data

Life Expectancy vs. Alcohol Consumption with countryTable

4. Plotting

How to use this dataset

Acknowledgements

Start A New Notebook!

Simple Multimodal Algorithmic Reasoning Task Dataset (SMART-101)

Data from: json-schema

RF-Collection

Data and code associated with "The Observed Availability of Data and Code in...

Data for "Direct and indirect Rod and Frame effect: A virtual reality study"...

MetaGraspNet Single Class Mutiple Instance Dataset

MetaGraspNet dataset

Citing MetaGraspNet

File Structure

Understanding order.csv

AIME_Hallucination_Detection

Data from: A Large-scale Dataset of (Open Source) License Text Variants