93 datasets found

food data cleaning
kaggle.com
zip
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 13, 2024
Authors
AbdElRahman16
Description
Dataset

This dataset was created by AbdElRahman16

Contents
_labels1.csv. This data set representss the label of the corresponding...
figshare.com
txt
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24270088.v1
Dataset updated
Oct 9, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
naillah gul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
ToS;DR policies dataset (clean)
zenodo.org
csv
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (clean) [Dataset]. http://doi.org/10.5281/zenodo.15013541
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15013541
Dataset updated
May 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Istaiti; Mahmoud Istaiti
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Overview

This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.

File Descriptions

1. clean_tosdr_all_data.csv

This file contains a comprehensive collection of terms of service data.

Each row represents a statement (or "point") extracted from a service's terms.

Key columns:

point_quote_text: Extracted text from the terms of service.

case_id: Unique identifier for the case.

case_title: Brief description of the case.

topic_id: Unique identifier for the topic.

topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

2. clean_tosdr_privacy_data.csv

This file is a subset of clean_tosdr_all_data.csv containing only privacy-related entries.

Includes cases related to tracking, data collection, account deletion policies, and more.

Has the same structure as clean_tosdr_all_data.csv but filtered to include only privacy-related topics.

Usage

Use clean_tosdr_all_data.csv for a broad analysis of various terms of service aspects.

Use clean_tosdr_privacy_data.csv for focused studies on privacy-related clauses.
Datasets and scripts related to the paper: "*Can Generative AI Help us in...
zenodo.org
explore.openaire.eu
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13134104
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"

The replication package is organized into two directories:

- `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.

- `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.

In the following, we describe the content of each directory:

## manual_analysis

- `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).

- The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Instance ID

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding

- *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts

- The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Specific ID for the instance

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding

- *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts.

- Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:

- *Dataset*: The dataset considered

- *Human*: The label assigned by the human in the original paper

- *Machine*: The label assigned by GPT-4-Turbo

- *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”

- *Final*: final label assigned after the resolution of the incompatibilities

- Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.

- `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.

- The `DL_Faults_redundant_tag.csv` file contains the following columns:

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags are redundant matching or not

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The `Functional_redundant_tag.csv` file contains the same columns as before

- The `DRL_Challenges_redundant_tag.csv` file is organized as follows:

- *Tags Suggested*: The final tag suggested by GPT-4-Turbo

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:

- *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset

- *Mapped_tags*: list of tags mapped by GPT-4-Turbo

- *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo

- *Aggregating_tags*: list of consolidated tags

- *Final_tags*: list of final tags after the consolidation task

## stats

- `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.

- `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.

- `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:

- *Link*: Link to the artifact being tagged

- *Prompt*: Prompt sent to GPT-4-Turbo

- *FinalTag*: Artifact coding from the replicated study

- *chatgpt\_output_text*: GPT-4-Turbo output

- *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output

- *Author1*: Annotator 1 evaluation of the coding

- *Author2*: Annotator 2 evaluation of the coding

- *FinalOutput*: Consolidated evaluation

- `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)

- `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<

- `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.

- `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:

- RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.

- Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).

- Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*

- Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:

- *HumanTags*: List of tags from the original dataset

- *InitialTags*: Set of tags from RQ1,

- *ConsolidatedTags*: Tags that have been consolidated,

- *FinalTags*: Final set of tags (results of RQ2, used in RQ3)

- *NewTags*: New tags created during consolidation

- `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
11 Benchmark Clean-Clean ER datasets in CSV format
zenodo.org
application/gzip
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). 11 Benchmark Clean-Clean ER datasets in CSV format [Dataset]. http://doi.org/10.5281/zenodo.14923071
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14923071
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contains:

D1: Contains restaurant descriptions, first introduced in OAEI 2010.

D2: Includes duplicate products from Abt.com and Buy.com.

D3: Matches product descriptions from Amazon and Google Base.

D4: Compares bibliographic data from DBLP and ACM.

D5, D6, D7: Contain descriptions of television shows and movies from TheTVDB, IMDb, and TMDb.

D8: Matches product descriptions from Walmart and Amazon.

D9: Involves bibliographic data from DBLP and Google Scholar.

D10: Links movie descriptions from IMDb and DBpedia.

D11: A large-scale dataset with millions of heterogeneous entities from two DBpedia versions spanning a 3-year gap.

Pokemon TCG Pocket Dataset

kaggle.com

Updated Jun 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

JoaoCoelho03 (2025). Pokemon TCG Pocket Dataset [Dataset]. https://www.kaggle.com/datasets/joaocoelho03/pocket-tcg-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 26, 2025

Dataset provided by

Kaggle

Authors

JoaoCoelho03

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Pokémon TCG Pocket Card Dataset

This dataset contains detailed information about all cards available in the Pokémon Trading Card Game Pocket mobile app. The data has been carefully curated and cleaned to provide Pokémon enthusiasts and developers with accurate and comprehensive card information.

Dataset Contents

8+ Complete Sets: All major card sets including latest expansions
1000+ Cards: Every card with detailed metadata and classifications
Clean Format: CSV format optimized for analysis, machine learning, and research

Key Features

🃏 Complete Card Data

Card names and numbers with proper formatting
Complete set and pack organization structure
Release dates for all sets and expansions
Total card counts per set for completion tracking

💎 Rarity Classifications

7+ Rarity Types including:
- Common, Uncommon, Rare
- Ultra Rare, Secret Rare, Special Art Rare
- Crown Rare and other premium classifications
Includes shiny and special variant cards
Standardized rarity naming conventions

Use Cases

📊 Data Analysis & Research

Card rarity distribution analysis across sets
Set completion and collection tracking

🤖 Machine Learning & AI

Card classification models
Recommendation systems for collectors
Rarity prediction algorithms
Collection optimization models

📈 Visualization & Dashboards

Interactive card browsers
Collection progress tracking
Rarity distribution charts
Set release timeline visualizations

Data Quality

✅ Manually Verified: All card information cross-checked for accuracy
✅ Standardized Format: Consistent naming and classification across all entries
✅ Complete Coverage: All available cards from the mobile game
✅ Clean Structure: Optimized for both human readability and machine processing

Technical Specifications

📋 File Format

Format: CSV (Comma Separated Values)
Encoding: UTF-8 with full international character support
Delimiter: Comma (,)
Headers: Included in first row

🗂️ Column Structure (9 columns)

Column	Description	Example
`set_name`	Full name of the card set	"Eevee Grove"
`set_code`	Official set identifier	"a3b"
`set_release_date`	Set release date	"June 26, 2025"
`set_total_cards`	Total cards in the set	107
`pack_name`	Name of the specific pack	"Eevee Grove"
`card_name`	Full card name	"Leafeon"
`card_number`	Card number within set	"2"
`card_rarity`	Rarity classification	"Rare"
`card_type`	Card type category	"Pokémon"

If you find this dataset useful, consider giving it an upvote — it really helps others discover it too! 🔼😊

Happy analyzing! 🎯📊

Alpaca Cleaned
kaggle.com
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca Cleaned

Improving Pretrained Language Model Understanding

By Huggingface Hub [source]

About this dataset

Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

To make the most out of this dataset it is recommended to:

Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

Research Ideas

Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.

Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.

Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Z
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
data.niaid.nih.gov
zenodo.org
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lignos, Dimitrios G. (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
Explore at:
Dataset updated
Dec 24, 2022
Dataset provided by
Hartloper, Alexander R.
Ozden, Selimcan
de Castro e Sousa, Albano
Lignos, Dimitrios G.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_Specimen_processed_data.csv" files in the "Clean_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_Specimen_processed_data.csv" files in the "Unreduced_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshavarz, Hossein (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Nagappan, Meiyappan
Keshavarz, Hossein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
A Twitter Dataset of 150+ million tweets related to COVID-19 for open...
zenodo.org
application/gzip, csv +1
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 150+ million tweets related to COVID-19 for open research [Dataset]. http://doi.org/10.5281/zenodo.3738018
Explore at:
application/gzip, csv, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3738018
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (152,920,832 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (30,990,645 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
Z
ZOOOM Literature Review Clean Dataset
data.niaid.nih.gov
zenodo.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
serpico, davide (2023). ZOOOM Literature Review Clean Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10143324
Explore at:
Dataset updated
Nov 21, 2023
Dataset authored and provided by
serpico, davide
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The csv file contains the dataset of literature search produced by the ZOOOM EU Funded Project on open software, open hardware, open data business models.
Z
Sound field image dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenji (2024). Sound field image dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8357752
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Takehiro
Daiki
Noboru
Kenji
Description
Description

This sound field image dataset contains clean-noisy pairs of complex-valued sound-field images generated by 2D acoustic simulations. The dataset was initially prepared for deep sound-field denoiser (https://github.com/nttcslab/deep-sound-field-denoiser), a DNN-based denoising method for optically measured sound fields. Since the data is a two-dimensional sound field based on the Helmholtz equation, one can use this dataset for any acoustic application. Please check our GitHub repository and paper for details.

Directory structure

The dataset contains three directories: training, validation, and evaluation. Each directory contains "soundsource#" sub-directories (# represents the number of sound sources used in the acoustic simulation). Each sub-directory has three h5 files for data (clean, white noise, and speckle noise) and three CSV files listing random parameter values used in the simulation.

/training

/soundsource#

constants.csv

random_variable_ranges.csv

random_variables.csv

sf_true.h5

sf_noise_white.h5

sf_noise_speckle.h5

Condition of use

This dataset is available under the attached license file. Read the terms and conditions in NTTSoftwareLicenseAgreement.pdf carefully.

Citation

If you use this dataset, please cite the following paper.

K. Ishikawa, D. Takeuchi, N. Harada, and T. Moriya ``Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network,'' arXiv:2304.14923 (2023).

EcoAdvDepression Dataset

zenodo.org

application/gzip, pdf

Updated Sep 13, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Arpita Ghosh; Aakash Madhav Rao; Prarthana Middha; Shambhavi Rai; Bittu Kaveri Rajaraman; Arpita Ghosh; Aakash Madhav Rao; Prarthana Middha; Shambhavi Rai; Bittu Kaveri Rajaraman (2024). EcoAdvDepression Dataset [Dataset]. http://doi.org/10.5281/zenodo.13755105

Explore at:

application/gzip, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13755105

Dataset updated

Sep 13, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Arpita Ghosh; Aakash Madhav Rao; Prarthana Middha; Shambhavi Rai; Bittu Kaveri Rajaraman; Arpita Ghosh; Aakash Madhav Rao; Prarthana Middha; Shambhavi Rai; Bittu Kaveri Rajaraman

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This record corresponds to the data collected, analysed, and used in the "Investigating Context-Specific Advantages of Depression-like behaviour in Wild-type Zebrafish (Danio rerio)" paper. The total dataset size exceeds 50GB and has hence been split into individual Zenodo records and have been linked in the table listed below. This record consists of the data used to train the YOLOv8 Model as well as the various train-time inferences and parameters of the model. The final trained model is also linked in this record.

Intervention Stage	Data	Link	Description
Pre	Processed	https://doi.org/10.5281/zenodo.13733970	Processed videos of social interaction, converted to grey-scale, removed audio, and in .MP4 format, at the pre-intervention stage. Contains video data for 26 fish with alternatively positioned shoals, totalling 52 videos.
Pre	Resized	https://doi.org/10.5281/zenodo.13735945	Final cropped video files used for analysis at the pre-intervention stage. Model predictions were run on these processed video files. Contains video data for 26 fish with alternatively positioned shoals, totalling 52 videos.
Pre	Tracked, Metrics, and Clean Tracked Centres	https://doi.org/10.5281/zenodo.13743184	Tracked files include the raw frame-by-frame predictions of the YOLOv8 Model over the Resized video files for 26 fish across 52 trials stored as CSV files. Clean Tracked Centres include the cleaned predictions consisting of the centres of the predicted bounding boxes, additionally accounting for incorrect predictions for 26 fish across 52 trials stored as CSV files. Metrics consist of the analysed inferences of all the clean tracked centres producing various movement and social interaction parameters in a single CSV file. Data for the pre-intervention stage.
Post	Processed	https://doi.org/10.5281/zenodo.13754882	Processed videos of social interaction, converted to grey-scale, removed audio, and in .MP4 format, at the post-intervention stage. Contains video data for 26 fish with alternatively positioned shoals, totalling 52 videos.
Post	Resized	https://doi.org/10.5281/zenodo.13743218	Final cropped video files used for analysis at the post-intervention stage. Model predictions were run on these processed video files. Contains video data for 26 fish with alternatively positioned shoals, totalling 52 videos.
Post	Tracked, Metrics, and Clean Tracked Centres	https://doi.org/10.5281/zenodo.13743210	Tracked files include the raw frame-by-frame predictions of the YOLOv8 Model over the Resized video files for 26 fish across 52 trials stored as CSV files. Clean Tracked Centres include the cleaned predictions consisting of the centres of the predicted bounding boxes, additionally accounting for incorrect predictions for 26 fish across 52 trials stored as CSV files. Metrics consist of the analysed inferences of all the clean tracked centres producing various movement and social interaction parameters in a single CSV file. Data for the post-intervention stage.

o
Data from: A large-scale COVID-19 Twitter chatter dataset for open...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Aug 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell (2020). A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration [Dataset]. http://doi.org/10.5281/zenodo.3977558
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3977558
Dataset updated
Aug 9, 2020
Authors
Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell
Description
Version 22 of the dataset, we have refactored the full_dataset.tsv and full_dataset_clean.tsv files (since version 20) to include two additional columns: language and place country code (when available). This change now includes language and country code for ALL the tweets in the dataset, not only clean tweets. With this change we have removed the clean_place_country.tar.gz and clean_languages.tar.gz files. With our refactoring of the dataset generating code we also found a small bug that made some of the retweets not be counted properly, hence the extra increase on tweets available. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (602,921,788 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (142,360,288 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.
f
NHANES 1988-2018
figshare.com
application/gzip
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v2
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).
c
BBC News Dataset – February 2023 Edition
crawlfeeds.com
csv, zip
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). BBC News Dataset – February 2023 Edition [Dataset]. https://crawlfeeds.com/datasets/bbc-news-dataset-feb-2023
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Get access to a comprehensive and structured dataset of BBC News articles, freshly crawled and compiled in February 2023. This collection includes 1 million records from one of the world’s most trusted news organizations — perfect for training NLP models, sentiment analysis, and trend detection across global topics.

💾 Format: CSV (available in ZIP archive)

📢 Status: Published and available for immediate access

Use Cases

Train language models to summarize or categorize news

Detect media bias and compare narrative framing

Conduct research in journalism, politics, and public sentiment

Enrich news aggregation platforms with clean metadata

Analyze content distribution across categories (e.g. health, politics, tech)

This dataset ensures reliable and high-quality information sourced from a globally respected outlet. The format is optimized for quick ingestion into your pipelines — with clean text, timestamps, image links, and more.

Need a filtered dataset or want this refreshed for a later date? We offer on-demand news scraping as well.

👉 Request access or sample now
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
o
Turbine Bending Moment Dataset
explore.openaire.eu
Updated Sep 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Ding (2021). Turbine Bending Moment Dataset [Dataset]. http://doi.org/10.5281/zenodo.5516560
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5516560
Dataset updated
Sep 19, 2021
Authors
Yu Ding
Description
his dataset includes two parts. The first part is three sets of physically measured blade-root flapwise bending moments on three respective turbines, courtesy of Riso-DTU (Technical University of Denmark). The basic characteristics of the three turbines can be found in Table 10.1 of the Data Science for Wind Energy book. These datasets include three columns. The first column is the 10-min average wind speed, the second column is the standard deviation of wind speed within a 10-min block, and the third column is the maximum bending moment, in the unit of MN-m, recorded in a 10-min block. The second part of the dataset is the simulated load data used in Section 10.6.5 of the same book. This part has two sets. The first set is the training data that has 1,000 observations and is used to fit an extreme load model. The second set is the test data that consists of 100 subsets, each of which has 100,000 observations. In other words, the second dataset for testing has a total of 10,000,000 observations, which are used to verify the extreme load extrapolation made by a respective model. Both simulated datasets have two columns: the first is the 10-min average wind speed and the second is the maximum bending moment in the corresponding 10-min block. While all other datasets are saved in CSV file format, this simulated test dataset is saved in a text file format, due to its large size. The data simulation procedure is explained in Section 10.6.5. {"references": ["Ding, Y. (2019) Data Science for Wind Energy, Chapman & Hall/CRC Press, Boca Raton, FL"]}

Facebook

Twitter

Click to copy link

Link copied

Cite

AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n

food data cleaning

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Apr 13, 2024

Authors

AbdElRahman16

Description

Dataset

This dataset was created by AbdElRahman16

Clear search

Close search

Google apps

Main menu

food data cleaning

Dataset

Contents

_labels1.csv. This data set representss the label of the corresponding...

Data Cleaning Sample

ToS;DR policies dataset (clean)

Overview

File Descriptions

1. clean_tosdr_all_data.csv

2. clean_tosdr_privacy_data.csv

Usage

Datasets and scripts related to the paper: "*Can Generative AI Help us in...

11 Benchmark Clean-Clean ER datasets in CSV format

Pokemon TCG Pocket Dataset

Pokémon TCG Pocket Card Dataset

Dataset Contents

Key Features

🃏 Complete Card Data

💎 Rarity Classifications

Use Cases

📊 Data Analysis & Research

🤖 Machine Learning & AI

📈 Visualization & Dashboards

Data Quality

Technical Specifications

📋 File Format

🗂️ Column Structure (9 columns)

Alpaca Cleaned

Alpaca Cleaned

Improving Pretrained Language Model Understanding

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Cleaned NHANES 1988-2018

A Twitter Dataset of 150+ million tweets related to COVID-19 for open...

ZOOOM Literature Review Clean Dataset

Sound field image dataset

EcoAdvDepression Dataset

Data from: A large-scale COVID-19 Twitter chatter dataset for open...

NHANES 1988-2018

BBC News Dataset – February 2023 Edition

Use Cases

LSC (Leicester Scientific Corpus)

Turbine Bending Moment Dataset

food data cleaning

Dataset

Contents