100+ datasets found
  1. Clean Data.csv

    • figshare.com
    txt
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Zaid Hattab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.

  2. food data cleaning

    • kaggle.com
    zip
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 13, 2024
    Authors
    AbdElRahman16
    Description

    Dataset

    This dataset was created by AbdElRahman16

    Contents

  3. Cleaned Contoso Dataset

    • kaggle.com
    Updated Aug 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanu (2023). Cleaned Contoso Dataset [Dataset]. https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhanu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here

    Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.

    Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.

  4. B

    Data Cleaning Sample

    • borealisdata.ca
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  5. ToS;DR policies dataset (clean)

    • zenodo.org
    csv
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (clean) [Dataset]. http://doi.org/10.5281/zenodo.15013541
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahmoud Istaiti; Mahmoud Istaiti
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    Overview

    This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.

    File Descriptions

    1. clean_tosdr_all_data.csv

    • This file contains a comprehensive collection of terms of service data.
    • Each row represents a statement (or "point") extracted from a service's terms.
    • Key columns:
      • point_quote_text: Extracted text from the terms of service.
      • case_id: Unique identifier for the case.
      • case_title: Brief description of the case.
      • topic_id: Unique identifier for the topic.
      • topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

    2. clean_tosdr_privacy_data.csv

    • This file is a subset of clean_tosdr_all_data.csv containing only privacy-related entries.
    • Includes cases related to tracking, data collection, account deletion policies, and more.
    • Has the same structure as clean_tosdr_all_data.csv but filtered to include only privacy-related topics.

    Usage

    • Use clean_tosdr_all_data.csv for a broad analysis of various terms of service aspects.
    • Use clean_tosdr_privacy_data.csv for focused studies on privacy-related clauses.
  6. BBC Full Text Preprocessed

    • kaggle.com
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheemanth Bhat (2023). BBC Full Text Preprocessed [Dataset]. https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    Kaggle
    Authors
    Dheemanth Bhat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original Dataset

    Original dataset consists of 2225 documents (as text files) from the BBC news website corresponding to stories in five topical areas from 2004-2005. Files are segregated into 5 folders:

    1. business
    2. entertainment
    3. politics
    4. sport
    5. tech

    This Dataset

    As part of Data Wrangling, original dataset is pre-processed in three stages:

    1. Stage 1: Extract Metadata from files that are segregated in 5 folders into a single csv.
    2. Stage 2: Clean and compress text content (remove extra spaces and newlines) in files into a single csv.
    3. Stage 3: Process English language (stop-word removal, lemmatization and NER) using spaCy.

    Note: Every next stage persists and improves data from previous stage into a new csv file.

  7. N

    NYC Clean Heat Dataset (Historical)

    • data.cityofnewyork.us
    • catalog.data.gov
    • +2more
    application/rdfxml +5
    Updated Apr 30, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayor's Office of Climate and Environmental Justice (MOCEJ) (2019). NYC Clean Heat Dataset (Historical) [Dataset]. https://data.cityofnewyork.us/City-Government/NYC-Clean-Heat-Dataset-Historical-/8isn-pgv3
    Explore at:
    json, csv, application/rdfxml, xml, tsv, application/rssxmlAvailable download formats
    Dataset updated
    Apr 30, 2019
    Dataset authored and provided by
    Mayor's Office of Climate and Environmental Justice (MOCEJ)
    Area covered
    New York
    Description

    NYC Clean Heat dataset

  8. Datasets and scripts related to the paper: "*Can Generative AI Help us in...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"
    The replication package is organized into two directories:
    - `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.
    - `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.
    In the following, we describe the content of each directory:
    ## manual_analysis
    - `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).
    - The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Instance ID
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding
    - *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts
    - The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Specific ID for the instance
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding
    - *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts.
    - Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:
    - *Dataset*: The dataset considered
    - *Human*: The label assigned by the human in the original paper
    - *Machine*: The label assigned by GPT-4-Turbo
    - *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”
    - *Final*: final label assigned after the resolution of the incompatibilities
    - Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.
    - `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.
    - The `DL_Faults_redundant_tag.csv` file contains the following columns:
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags are redundant matching or not
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The `Functional_redundant_tag.csv` file contains the same columns as before
    - The `DRL_Challenges_redundant_tag.csv` file is organized as follows:
    - *Tags Suggested*: The final tag suggested by GPT-4-Turbo
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:
    - *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset
    - *Mapped_tags*: list of tags mapped by GPT-4-Turbo
    - *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo
    - *Aggregating_tags*: list of consolidated tags
    - *Final_tags*: list of final tags after the consolidation task
    ## stats
    - `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.
    - `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.
    - `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:
    - *Link*: Link to the artifact being tagged
    - *Prompt*: Prompt sent to GPT-4-Turbo
    - *FinalTag*: Artifact coding from the replicated study
    - *chatgpt\_output_text*: GPT-4-Turbo output
    - *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output
    - *Author1*: Annotator 1 evaluation of the coding
    - *Author2*: Annotator 2 evaluation of the coding
    - *FinalOutput*: Consolidated evaluation
    - `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)
    - `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<
    - `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.
    - `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:
    - RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.
    - Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).
    - Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*
    - Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:
    - *HumanTags*: List of tags from the original dataset
    - *InitialTags*: Set of tags from RQ1,
    - *ConsolidatedTags*: Tags that have been consolidated,
    - *FinalTags*: Final set of tags (results of RQ2, used in RQ3)
    - *NewTags*: New tags created during consolidation
    - `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
  9. Melanoma_clean_csv

    • kaggle.com
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NISHCHHAL PACHOURI (2024). Melanoma_clean_csv [Dataset]. https://www.kaggle.com/nishchhalpachouri/melanoma-clean-csv/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    NISHCHHAL PACHOURI
    Description

    Dataset

    This dataset was created by NISHCHHAL PACHOURI

    Contents

  10. 🔍 Diverse CSV Dataset Samples

    • kaggle.com
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samy Baladram
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    https://i.imgur.com/PcSDv8A.png" alt="Imgur">

    Overview

    The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

    Files

    Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

    Format

    The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

    Quality Assurance

    The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

    Acknowledgements

    The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

    This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">

  11. d

    Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Guatemala, Guinea-Bissau, Saint Barthélemy, Namibia, Guadeloupe, Andorra, Hungary, Niue, Panama, Chile
    Description

    This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  12. Z

    ZOOOM Literature Review Clean Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    serpico, davide (2023). ZOOOM Literature Review Clean Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10143324
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    serpico, davide
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The csv file contains the dataset of literature search produced by the ZOOOM EU Funded Project on open software, open hardware, open data business models.

  13. Enhanced Latin Lemma Dataset

    • zenodo.org
    • huggingface.co
    • +1more
    csv
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristiyan Simeonov; Kristiyan Simeonov (2024). Enhanced Latin Lemma Dataset [Dataset]. http://doi.org/10.57967/hf/3130
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Kristiyan Simeonov; Kristiyan Simeonov
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Feb 15, 2024
    Description

    Overview

    The Latin Lexicon Dataset contains information about Latin words collected through webscraping from Wiktionary. The dataset includes various linguistic features such as part of speech, lemma, aspect, tense, verb form, voice, mood, number, person, case, and gender. Additionally, it provides source URLs and links to the Wiktionary pages for further reference. The dataset aims to contribute to linguistic research and analysis of Latin language elements.

    Versions of the Dataset

    This dataset is available in three versions, each offering varying levels of refinement:

    • wiki_latin_data_v1.csv(v1): The initial raw version, containing all webscraped data without extensive cleaning or filtering.
    • wiki_latin_data_v2.csv(v2): A more processed version, where some inconsistencies and duplicates were removed, and linguistic features were better aligned.
    • wiki_latin_data_v3.csv (v3): The most refined version, offering a clean, well-organized dataset with comprehensive linguistic features and translation equivalents with minimal errors. This version is recommended for most use cases.

    Data Source:

    • Webscraped from Wiktionary

    Produced by:

    • Python-based web scraping algorithms
  14. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  15. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 5 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  16. h

    text2cypher-gpt4o-clean

    • huggingface.co
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomaž Bratanič (2024). text2cypher-gpt4o-clean [Dataset]. https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2024
    Authors
    Tomaž Bratanič
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic dataset created with GPT-4o

    Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.

  17. A

    ‘Disease Symptom Prediction’ analyzed by Analyst-2

    • analyst-2.ai
    Updated May 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Disease Symptom Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-disease-symptom-prediction-154b/335de7fc/?iid=006-793&v=presentation
    Explore at:
    Dataset updated
    May 25, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Disease Symptom Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/itachi9604/disease-symptom-description-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot

    Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv

    Content

    There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. The user only needs to understand how rows and coloumns are arranged.

    Acknowledgements

    I have created this dataset with help of a friend Pratik Rathod. As there was an existing dataset like this which was difficult to clean.

    Query

    uchihaitachi9604@gmail.com

    --- Original source retains full ownership of the source dataset ---

  18. g

    Clean points | gimi9.com

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clean points | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-abertos-xunta-gal-catalogo-medio-abiente-dataset-0303-puntos-limpos/
    Explore at:
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    List of the clean points of the Waste Information System of Galicia. Cleaning points are facilities with adequate equipment for the reception, selective separation and temporary storage of waste of domestic origin of special characteristics. The data are available in .kml format (with the basic contact information, schedule and georeferencing) and in .csv (which also incorporates the information of the entity that owns the point, its current state of operation, year and cost of execution, municipalities to which it provides service and the reference of the entity or company managing the installation). View in Services on the map

  19. CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER

    • data.europa.eu
    csv
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LISI AUTOMOTIVE SAS (2023). CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER [Dataset]. https://data.europa.eu/data/datasets/65081fd6f24090e1db9e52ee?locale=sv
    Explore at:
    csv(1521)Available download formats
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    LISI Automotive SAS
    Authors
    LISI AUTOMOTIVE SAS
    Description

    CSV Clean Fleet Vehicles LISI AUTOMOTIVE FORMER

  20. d

    Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoinAPI, Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3 [Dataset]. https://datarade.ai/data-products/coinapi-comprehensive-crypto-market-data-in-flat-files-tra-coinapi
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Coinapi Ltd
    Authors
    CoinAPI
    Area covered
    Solomon Islands, Montserrat, Kyrgyzstan, Qatar, Liechtenstein, Norfolk Island, Iraq, Tanzania, Latvia, Northern Mariana Islands
    Description

    When you need to analyze crypto market history, batch processing often beats streaming APIs. That's why we built the Flat Files S3 API - giving analysts and researchers direct access to structured historical cryptocurrency data without the integration complexity of traditional APIs.

    Pull comprehensive historical data across 800+ cryptocurrencies and their trading pairs, delivered in clean, ready-to-use CSV formats that drop straight into your analysis tools. Whether you're building backtest environments, training machine learning models, or running complex market studies, our flat file approach gives you the flexibility to work with massive datasets efficiently.

    Why work with us?

    Market Coverage & Data Types: - Comprehensive historical data since 2010 (for chosen assets) - Comprehensive order book snapshots and updates - Trade-by-trade data

    Technical Excellence: - 99,9% uptime guarantee - Standardized data format across exchanges - Flexible Integration - Detailed documentation - Scalable Architecture

    CoinAPI serves hundreds of institutions worldwide, from trading firms and hedge funds to research organizations and technology providers. Our S3 delivery method easily integrates with your existing workflows, offering familiar access patterns, reliable downloads, and straightforward automation for your data team. Our commitment to data quality and technical excellence, combined with accessible delivery options, makes us the trusted choice for institutions that demand both comprehensive historical data and real-time market intelligence

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1
Organization logoOrganization logo

Clean Data.csv

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Dec 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Zaid Hattab
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.

Search
Clear search
Close search
Google apps
Main menu