100+ datasets found

Clean Data.csv
figshare.com
txt
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24718401.v1
Dataset updated
Dec 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Zaid Hattab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.
food data cleaning
kaggle.com
zip
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 13, 2024
Authors
AbdElRahman16
Description
Dataset

This dataset was created by AbdElRahman16

Contents
Cleaned Contoso Dataset
kaggle.com
Updated Aug 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu (2023). Cleaned Contoso Dataset [Dataset]. https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bhanu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here

Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.

Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.
B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
ToS;DR policies dataset (clean)
zenodo.org
csv
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (clean) [Dataset]. http://doi.org/10.5281/zenodo.15013541
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15013541
Dataset updated
May 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Istaiti; Mahmoud Istaiti
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Overview

This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.

File Descriptions

1. clean_tosdr_all_data.csv

This file contains a comprehensive collection of terms of service data.

Each row represents a statement (or "point") extracted from a service's terms.

Key columns:

point_quote_text: Extracted text from the terms of service.

case_id: Unique identifier for the case.

case_title: Brief description of the case.

topic_id: Unique identifier for the topic.

topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

2. clean_tosdr_privacy_data.csv

This file is a subset of clean_tosdr_all_data.csv containing only privacy-related entries.

Includes cases related to tracking, data collection, account deletion policies, and more.

Has the same structure as clean_tosdr_all_data.csv but filtered to include only privacy-related topics.

Usage

Use clean_tosdr_all_data.csv for a broad analysis of various terms of service aspects.

Use clean_tosdr_privacy_data.csv for focused studies on privacy-related clauses.
BBC Full Text Preprocessed
kaggle.com
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheemanth Bhat (2023). BBC Full Text Preprocessed [Dataset]. https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2023
Dataset provided by
Kaggle
Authors
Dheemanth Bhat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original Dataset

Original dataset consists of 2225 documents (as text files) from the BBC news website corresponding to stories in five topical areas from 2004-2005. Files are segregated into 5 folders:

business

entertainment

politics

sport

tech

This Dataset

As part of Data Wrangling, original dataset is pre-processed in three stages:

Stage 1: Extract Metadata from files that are segregated in 5 folders into a single csv.

Stage 2: Clean and compress text content (remove extra spaces and newlines) in files into a single csv.

Stage 3: Process English language (stop-word removal, lemmatization and NER) using spaCy.

Note: Every next stage persists and improves data from previous stage into a new csv file.
N
NYC Clean Heat Dataset (Historical)
data.cityofnewyork.us
catalog.data.gov
+2more
application/rdfxml +5
Updated Apr 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayor's Office of Climate and Environmental Justice (MOCEJ) (2019). NYC Clean Heat Dataset (Historical) [Dataset]. https://data.cityofnewyork.us/City-Government/NYC-Clean-Heat-Dataset-Historical-/8isn-pgv3
Explore at:
json, csv, application/rdfxml, xml, tsv, application/rssxmlAvailable download formats
Dataset updated
Apr 30, 2019
Dataset authored and provided by
Mayor's Office of Climate and Environmental Justice (MOCEJ)
Area covered
New York
Description
NYC Clean Heat dataset
Datasets and scripts related to the paper: "*Can Generative AI Help us in...
zenodo.org
explore.openaire.eu
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13134104
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"

The replication package is organized into two directories:

- `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.

- `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.

In the following, we describe the content of each directory:

## manual_analysis

- `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).

- The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Instance ID

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding

- *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts

- The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Specific ID for the instance

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding

- *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts.

- Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:

- *Dataset*: The dataset considered

- *Human*: The label assigned by the human in the original paper

- *Machine*: The label assigned by GPT-4-Turbo

- *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”

- *Final*: final label assigned after the resolution of the incompatibilities

- Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.

- `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.

- The `DL_Faults_redundant_tag.csv` file contains the following columns:

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags are redundant matching or not

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The `Functional_redundant_tag.csv` file contains the same columns as before

- The `DRL_Challenges_redundant_tag.csv` file is organized as follows:

- *Tags Suggested*: The final tag suggested by GPT-4-Turbo

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:

- *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset

- *Mapped_tags*: list of tags mapped by GPT-4-Turbo

- *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo

- *Aggregating_tags*: list of consolidated tags

- *Final_tags*: list of final tags after the consolidation task

## stats

- `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.

- `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.

- `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:

- *Link*: Link to the artifact being tagged

- *Prompt*: Prompt sent to GPT-4-Turbo

- *FinalTag*: Artifact coding from the replicated study

- *chatgpt\_output_text*: GPT-4-Turbo output

- *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output

- *Author1*: Annotator 1 evaluation of the coding

- *Author2*: Annotator 2 evaluation of the coding

- *FinalOutput*: Consolidated evaluation

- `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)

- `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<

- `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.

- `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:

- RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.

- Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).

- Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*

- Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:

- *HumanTags*: List of tags from the original dataset

- *InitialTags*: Set of tags from RQ1,

- *ConsolidatedTags*: Tags that have been consolidated,

- *FinalTags*: Final set of tags (results of RQ2, used in RQ3)

- *NewTags*: New tags created during consolidation

- `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
Melanoma_clean_csv
kaggle.com
Updated May 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NISHCHHAL PACHOURI (2024). Melanoma_clean_csv [Dataset]. https://www.kaggle.com/nishchhalpachouri/melanoma-clean-csv/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NISHCHHAL PACHOURI
Description
Dataset

This dataset was created by NISHCHHAL PACHOURI

Contents
🔍 Diverse CSV Dataset Samples
kaggle.com
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samy Baladram
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">
d
Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Guatemala, Guinea-Bissau, Saint Barthélemy, Namibia, Guadeloupe, Andorra, Hungary, Niue, Panama, Chile
Description
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
Z
ZOOOM Literature Review Clean Dataset
data.niaid.nih.gov
zenodo.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
serpico, davide (2023). ZOOOM Literature Review Clean Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10143324
Explore at:
Dataset updated
Nov 21, 2023
Dataset authored and provided by
serpico, davide
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The csv file contains the dataset of literature search produced by the ZOOOM EU Funded Project on open software, open hardware, open data business models.
Enhanced Latin Lemma Dataset
zenodo.org
huggingface.co
+1more
csv
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristiyan Simeonov; Kristiyan Simeonov (2024). Enhanced Latin Lemma Dataset [Dataset]. http://doi.org/10.57967/hf/3130
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/3130
Dataset updated
Sep 25, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Kristiyan Simeonov; Kristiyan Simeonov
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Feb 15, 2024
Description
Overview

The Latin Lexicon Dataset contains information about Latin words collected through webscraping from Wiktionary. The dataset includes various linguistic features such as part of speech, lemma, aspect, tense, verb form, voice, mood, number, person, case, and gender. Additionally, it provides source URLs and links to the Wiktionary pages for further reference. The dataset aims to contribute to linguistic research and analysis of Latin language elements.

Versions of the Dataset

This dataset is available in three versions, each offering varying levels of refinement:

wiki_latin_data_v1.csv(v1): The initial raw version, containing all webscraped data without extensive cleaning or filtering.

wiki_latin_data_v2.csv(v2): A more processed version, where some inconsistencies and duplicates were removed, and linguistic features were better aligned.

wiki_latin_data_v3.csv (v3): The most refined version, offering a clean, well-organized dataset with comprehensive linguistic features and translation equivalents with minimal errors. This version is recommended for most use cases.

Data Source:

Webscraped from Wiktionary

Produced by:

Python-based web scraping algorithms

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

ENTSO-E Hydropower modelling data (PECD) in CSV format
zenodo.org
csv
Updated Aug 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3950048
Dataset updated
Aug 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The zipped file: https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/seasonal/SOR2020/data/Hydro.zip

The documentation file (v 1.0): https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/MAF/2019/Hydropower_Modelling_New_database_and_methodology.pdf

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 5 CSV files:

PECD-hydro-capacities.csv: installed capacities

PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping

PECD-hydro-daily-ror-generation.csv: daily run-of-river generation

PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation

PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5

sheet Reservoir, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51

sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231

sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12

sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir
h
text2cypher-gpt4o-clean
huggingface.co
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomaž Bratanič (2024). text2cypher-gpt4o-clean [Dataset]. https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2024
Authors
Tomaž Bratanič
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic dataset created with GPT-4o

Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.
A
‘Disease Symptom Prediction’ analyzed by Analyst-2
analyst-2.ai
Updated May 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Disease Symptom Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-disease-symptom-prediction-154b/335de7fc/?iid=006-793&v=presentation
Explore at:
Dataset updated
May 25, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Disease Symptom Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/itachi9604/disease-symptom-description-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot

Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv

Content

There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. The user only needs to understand how rows and coloumns are arranged.

Acknowledgements

I have created this dataset with help of a friend Pratik Rathod. As there was an existing dataset like this which was difficult to clean.

Query

uchihaitachi9604@gmail.com

--- Original source retains full ownership of the source dataset ---
g
Clean points | gimi9.com
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clean points | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-abertos-xunta-gal-catalogo-medio-abiente-dataset-0303-puntos-limpos/
Explore at:
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
List of the clean points of the Waste Information System of Galicia. Cleaning points are facilities with adequate equipment for the reception, selective separation and temporary storage of waste of domestic origin of special characteristics. The data are available in .kml format (with the basic contact information, schedule and georeferencing) and in .csv (which also incorporates the information of the entity that owns the point, its current state of operation, year and cost of execution, municipalities to which it provides service and the reference of the entity or company managing the installation). View in Services on the map
CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER
data.europa.eu
csv
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LISI AUTOMOTIVE SAS (2023). CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER [Dataset]. https://data.europa.eu/data/datasets/65081fd6f24090e1db9e52ee?locale=sv
Explore at:
csv(1521)Available download formats
Dataset updated
Nov 22, 2023
Dataset provided by
LISI Automotive SAS
Authors
LISI AUTOMOTIVE SAS
Description
CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER
d
Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoinAPI, Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3 [Dataset]. https://datarade.ai/data-products/coinapi-comprehensive-crypto-market-data-in-flat-files-tra-coinapi
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Coinapi Ltd
Authors
CoinAPI
Area covered
Solomon Islands, Montserrat, Kyrgyzstan, Qatar, Liechtenstein, Norfolk Island, Iraq, Tanzania, Latvia, Northern Mariana Islands
Description
When you need to analyze crypto market history, batch processing often beats streaming APIs. That's why we built the Flat Files S3 API - giving analysts and researchers direct access to structured historical cryptocurrency data without the integration complexity of traditional APIs.

Pull comprehensive historical data across 800+ cryptocurrencies and their trading pairs, delivered in clean, ready-to-use CSV formats that drop straight into your analysis tools. Whether you're building backtest environments, training machine learning models, or running complex market studies, our flat file approach gives you the flexibility to work with massive datasets efficiently.

Why work with us?

Market Coverage & Data Types: - Comprehensive historical data since 2010 (for chosen assets) - Comprehensive order book snapshots and updates - Trade-by-trade data

Technical Excellence: - 99,9% uptime guarantee - Standardized data format across exchanges - Flexible Integration - Detailed documentation - Scalable Architecture

CoinAPI serves hundreds of institutions worldwide, from trading firms and hedge funds to research organizations and technology providers. Our S3 delivery method easily integrates with your existing workflows, offering familiar access patterns, reliable downloads, and straightforward automation for your data team. Our commitment to data quality and technical excellence, combined with accessible delivery options, makes us the trusted choice for institutions that demand both comprehensive historical data and real-time market intelligence

Facebook

Twitter

Click to copy link

Link copied

Cite

Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1

Clean Data.csv

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24718401.v1

Dataset updated

Dec 3, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Zaid Hattab

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.

Clear search

Close search

Google apps

Main menu

Clean Data.csv

food data cleaning

Dataset

Contents

Cleaned Contoso Dataset

Data Cleaning Sample

ToS;DR policies dataset (clean)

Overview

File Descriptions

1. clean_tosdr_all_data.csv

2. clean_tosdr_privacy_data.csv

Usage

BBC Full Text Preprocessed

Original Dataset

This Dataset

NYC Clean Heat Dataset (Historical)

Datasets and scripts related to the paper: "*Can Generative AI Help us in...

Melanoma_clean_csv

Dataset

Contents

🔍 Diverse CSV Dataset Samples

Overview

Files

Format

Quality Assurance

Acknowledgements

Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

ZOOOM Literature Review Clean Dataset

Enhanced Latin Lemma Dataset

Overview

Versions of the Dataset

Data Source:

Produced by:

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

ENTSO-E Hydropower modelling data (PECD) in CSV format

text2cypher-gpt4o-clean

‘Disease Symptom Prediction’ analyzed by Analyst-2

Context

Content

Acknowledgements

Query

Clean points | gimi9.com

CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER

Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3

Clean Data.csv