96 datasets found
  1. WikiTableQuestions (Semi-structured Tables Q&A)

    • kaggle.com
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Investigation of Semi-Structured Tables: WikiTableQuestions

    A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

    By [source]

    About this dataset

    The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

    To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

    Happy Kaggling!

    Research Ideas

    • The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 0.csv

    File: 1.csv

    File: 10.csv

    File: 11.csv

    File: 12.csv

    File: 14.csv

    File: 15.csv

    File: 17.csv

    File: 18.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  2. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  3. Data from: Indicators Table

    • kaggle.com
    Updated May 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MarcoMarchetti (2018). Indicators Table [Dataset]. https://www.kaggle.com/datasets/marcomarchetti/indicators-table
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MarcoMarchetti
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by MarcoMarchetti

    Released under CC0: Public Domain

    Contents

  4. Chinook CSV Dataset

    • kaggle.com
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anurag Verma (2023). Chinook CSV Dataset [Dataset]. https://www.kaggle.com/datasets/anurag629/chinook-csv-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anurag Verma
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is an export of the tables from the Chinook sample database into CSV files. The Chinook database contains information about a fictional digital media store, including tables for artists, albums, media tracks, invoices, customers, and more.

    The CSV file for each table contains the columns and all rows of data. The column headers match the table schema. Refer to the Chinook schema documentation for more details on each table and column.

    The files are encoded as UTF-8. The delimiter is a comma. Strings are quoted. Null values are represented by empty strings.

    Files

    1. albums.csv
    2. artists.csv
    3. customers.csv
    4. employees.csv
    5. genres.csv
    6. invoice_items.csv
    7. invoices.csv
    8. media_types.csv
    9. playlist_track.csv
    10. playlists.csv
    11. tracks.csv

    Usage

    This dataset can be used to analyze the Chinook store data. For example, you could build models on customer purchases, track listening patterns, identify trends in genres or artists,etc.

    The data is ideal for practicing Pandas, Numpy, PySpark, etc libraries. The database schema provides a realistic set of tables and relationships.

  5. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  6. General Table Detection Dataset

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit singh (2022). General Table Detection Dataset [Dataset]. https://www.kaggle.com/datasets/rhtsingh/general-table-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rohit singh
    Description

    Dataset

    This dataset was created by Rohit singh

    Released under Data files © Original Authors

    Contents

  7. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

    Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

  8. t

    NBA Player Dataset & Prediction Model Artifacts

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43
    Explore at:
    csv, text/markdown, png, bin, txt, jsonAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Burak Baltali; Burak Baltali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

    Brief overview of Files

    1. end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;

    2. the Jupyter notebook (Analysis.ipynb); All the code can be executed in there

    3. the trained model binary (nba_model.pkl); Serialized Random Forest model artifact

    4. Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here

    5. FAIR4ML metadata (fair4ml_metadata.jsonld);
      see README.md and abbreviations.txt for file details.”

    6. For further information you can go to the github site (Link below)

    File Details

    Notebook

    Analysis.ipynb: Involves the graphica output of the trained and tested data.

    Trained/ Test csv Data

    NameDescriptionPID
    regular_train.csvFor training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose4421e56c-4cd3-4ec1-a566-a89d7ec0bced
    regular_test.csv:For testing purpose of the regular season, the 2022-2023 season was selectedf9d84d5e-db01-4475-b7d1-80cfe9fe0e61
    playoff_train.csvFor training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected bcb3cf2b-27df-48cc-8b76-9e49254783d0
    playoff_test.csvFor testing purpose of the playoff season, 2023-2024 season was selectedde37d568-e97f-4cb9-bc05-2e600cc97102

    Others

    abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

    Additional Notes

    Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

    Some preprocessing has to be done before uploading into dbrepo

    Plots have also been uploaded as an output for visual purposes.

    A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

  9. Text-audio pairs (4 of 4)

    • kaggle.com
    zip
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorvan (2024). Text-audio pairs (4 of 4) [Dataset]. https://www.kaggle.com/datasets/jorvan/text-audio-pairs-4-of-4
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 14, 2024
    Authors
    Jorvan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is the fourth of the four datasets that we have created, for audio-text training tasks. These collect pairs of texts and audios, based on the audio-image pairs from our datasets [1, 2, 3]. These are only intended for research purposes.

    For the conversion, .csv tables were created, where audio values were separated in 16,000 columns and images were transformed into texts using the public model BLIP [4]. The original images are also preserved for future reference.

    To allow other researchers a quick evaluation of the potential usefulness of our datasets for their purposes, we have made available a public page where anyone can check 60 random samples that we extracted from all of our data [5].

    [1] Jorge E. León. Image-audio pairs (1 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-1-of-3. [2] Jorge E. León. Image-audio pairs (2 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-2-of-3. [3] Jorge E. León. Image-audio pairs (3 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-3-of-3. [4] Junnan Li et al. “BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation”. En: ArXiv 2201.12086 (2022). [5] Jorge E. León. AVT Multimodal Dataset. 2024. url: https://jorvan758.github.io/AVT-Multimodal-Dataset/.

  10. h

    stackoverflow_python

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Koutcheme, stackoverflow_python [Dataset]. https://huggingface.co/datasets/koutch/stackoverflow_python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Charles Koutcheme
    Description

    Dataset Card for "stackoverflow_python"

      Dataset Summary
    

    This dataset comes originally from kaggle. It was originally split into three tables (CSV files) (Questions, Answers, and Tags) now merged into a single table. Each row corresponds to a pair (question-answer) and their associated tags. The dataset contains all questions asked between August 2, 2008 and Ocotober 19, 2016.

      Supported Tasks and Leaderboards
    

    This might be useful for open-domain… See the full description on the dataset page: https://huggingface.co/datasets/koutch/stackoverflow_python.

  11. A

    ‘COVID vaccination vs. mortality ’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘COVID vaccination vs. mortality ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-vaccination-vs-mortality-cbd8/06c8ccd2/?iid=010-492&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID vaccination vs. mortality ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sinakaraji/covid-vaccination-vs-death on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    The COVID-19 outbreak has brought the whole planet to its knees.More over 4.5 million people have died since the writing of this notebook, and the only acceptable way out of the disaster is to vaccinate all parts of society. Despite the fact that the benefits of vaccination have been proved to the world many times, anti-vaccine groups are springing up all over the world. This data set was generated to investigate the impact of coronavirus vaccinations on coronavirus mortality.

    Content

    countryiso_codedatetotal_vaccinationspeople_vaccinatedpeople_fully_vaccinatedNew_deathspopulationratio
    country nameiso code for each countrydate that this data belongnumber of all doses of COVID vaccine usage in that countrynumber of people who got at least one shot of COVID vaccinenumber of people who got full vaccine shotsnumber of daily new deaths2021 country population% of vaccinations in that country at that date = people_vaccinated/population * 100

    Data Collection

    This dataset is a combination of the following three datasets:

    1.https://www.kaggle.com/gpreda/covid-world-vaccination-progress

    2.https://covid19.who.int/WHO-COVID-19-global-data.csv

    3.https://www.kaggle.com/rsrishav/world-population

    you can find more detail about this dataset by reading this notebook:

    https://www.kaggle.com/sinakaraji/simple-linear-regression-covid-vaccination

    Countries in this dataset:

    AfghanistanAlbaniaAlgeriaAndorraAngola
    AnguillaAntigua and BarbudaArgentinaArmeniaAruba
    AustraliaAustriaAzerbaijanBahamasBahrain
    BangladeshBarbadosBelarusBelgiumBelize
    BeninBermudaBhutanBolivia (Plurinational State of)Brazil
    Bosnia and HerzegovinaBotswanaBrunei DarussalamBulgariaBurkina Faso
    CambodiaCameroonCanadaCabo VerdeCayman Islands
    Central African RepublicChadChileChinaColombia
    ComorosCook IslandsCosta RicaCroatiaCuba
    CuraçaoCyprusDenmarkDjiboutiDominica
    Dominican RepublicEcuadorEgyptEl SalvadorEquatorial Guinea
    EstoniaEthiopiaFalkland Islands (Malvinas)FijiFinland
    FranceFrench PolynesiaGabonGambiaGeorgia
    GermanyGhanaGibraltarGreeceGreenland
    GrenadaGuatemalaGuineaGuinea-BissauGuyana
    HaitiHondurasHungaryIcelandIndia
    IndonesiaIran (Islamic Republic of)IraqIrelandIsle of Man
    IsraelItalyJamaicaJapanJordan
    KazakhstanKenyaKiribatiKuwaitKyrgyzstan
    Lao People's Democratic RepublicLatviaLebanonLesothoLiberia
    LibyaLiechtensteinLithuaniaLuxembourgMadagascar
    MalawiMalaysiaMaldivesMaliMalta
    MauritaniaMauritiusMexicoRepublic of MoldovaMonaco
    MongoliaMontenegroMontserratMoroccoMozambique
    MyanmarNamibiaNauruNepalNetherlands
    New CaledoniaNew ZealandNicaraguaNigerNigeria
    NiueNorth MacedoniaNorwayOmanPakistan
    occupied Palestinian territory, including east Jerusalem
    PanamaPapua New GuineaParaguayPeruPhilippines
    PolandPortugalQatarRomaniaRussian Federation
    RwandaSaint Kitts and NevisSaint Lucia
    Saint Vincent and the GrenadinesSamoaSan MarinoSao Tome and PrincipeSaudi Arabia
    SenegalSerbiaSeychellesSierra LeoneSingapore
    SlovakiaSloveniaSolomon IslandsSomaliaSouth Africa
    Republic of KoreaSouth SudanSpainSri LankaSudan
    SurinameSwedenSwitzerlandSyrian Arab RepublicTajikistan
    United Republic of TanzaniaThailandTogoTongaTrinidad and Tobago
    TunisiaTurkeyTurkmenistanTurks and Caicos IslandsTuvalu
    UgandaUkraineUnited Arab EmiratesThe United KingdomUnited States of America
    UruguayUzbekistanVanuatuVenezuela (Bolivarian Republic of)Viet Nam
    Wallis and FutunaYemenZambiaZimbabwe

    --- Original source retains full ownership of the source dataset ---

  12. Purchase Order Data

    • data.ca.gov
    • catalog.data.gov
    csv, docx, pdf
    Updated Oct 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
    Explore at:
    docx, csv, pdfAvailable download formats
    Dataset updated
    Oct 23, 2019
    Dataset authored and provided by
    California Department of General Services
    Description

    The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

    Data Limitations:
    Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

    Data Collection Methodology:

    The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

    Secondary/Related Resources:

  13. Cyclistic_csv_data_Pivot_table

    • kaggle.com
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Aidoo (2024). Cyclistic_csv_data_Pivot_table [Dataset]. https://www.kaggle.com/datasets/stevenaidoo/cyclistic-csv-data-pivot-table/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stephen Aidoo
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Stephen Aidoo

    Released under Database: Open Database, Contents: Database Contents

    Contents

  14. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  15. A

    ‘Rare Pepes’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Rare Pepes’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-rare-pepes-3b0c/6139e02f/?iid=001-911&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Rare Pepes’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/rare-pepese on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    Data behind the story Can The Blockchain Turn Pepe The Frog Into Modern Art?
    There are four data files, described below. You can also find further information about individual Rare Pepe assets at Rare Pepe Wallet.

    ordermatches_all.csv contains all Rare Pepe order matches from the beginning of the project, in late 2016, until Feb. 3. All order matches include a pair of assets (a “forward asset” and a “backward asset”) one of which is a Rare Pepe and the other of which is either XCP, the native Counterparty token, or Pepe Cash. The time of the order match can be determined by the block.

    HeaderDescription
    BlockThe block number
    ForwardAssetThe type of forward asset
    ForwardQuantityThe quantity of forward asset
    BackwardAssetThe type of backward asset
    BackwardQuantityThe quantity of backward asset

    blocks_timestamps.csv is a pairing of block and timestamp. This can be used to determine the actual time an order match occurred, which can then be used to determine the dollar value of Pepe Cash or XCP at the time of the trade.

    HeaderDescription
    BlockThe block number
    TimestampA Unix timestamp

    pepecash_prices.csv contains the dollar price of Pepe Cash over time.

    HeaderDescription
    TimestampA Unix timestamp
    PriceThe price of Pepe Cash in dollars

    xcp_prices.csv contains the dollar price of XCP over time.

    HeaderDescription
    TimestampA Unix timestamp
    PriceThe price of XCP in dollars

    Source: Rare Pepe Foundation

    The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.

    Source: https://github.com/fivethirtyeight/data

    This dataset was created by FiveThirtyEight and contains around 30000 samples along with Backward Quantity, Block, technical information and other features such as: - Forward Quantity - Backward Asset - and more.

    How to use this dataset

    • Analyze Forward Asset in relation to Backward Quantity
    • Study the influence of Block on Forward Quantity
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit FiveThirtyEight

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  16. A

    ‘Hotel Prices - Beginner Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Hotel Prices - Beginner Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hotel-prices-beginner-dataset-6aca/74a157b1/?iid=000-816&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Hotel Prices - Beginner Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sveneschlbeck/hotel-prices-beginner-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset addresses Data Science students and/or Beginners who want to dive into Regression or Clustering without the need to pre-clean the data first.

    Content

    This dataset consists of a pre-cleaned .csv table that has been translated from German to English.

    There are four columns in this dataset:

    • Profit (How much money does this hotel make in a year)
    • Price in Millions (€)
    • Square Meter (Hotel Area)
    • City

    Here, "Hotel Prices" does not refer to the cost of spending a night at those hotels but the price for buying them. This would be an interesting chart for someone who wants to buy a hotel and needs to judge whether he/she is overpaying or getting a great deal depending on similar objects in other comparable cities.

    --- Original source retains full ownership of the source dataset ---

  17. A

    ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-winter-olympics-prediction-fantasy-draft-picks-2684/07d15ca8/?iid=004-753&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Winter Olympics Prediction - Fantasy Draft Picks’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ericsbrown/winter-olympics-prediction-fantasy-draft-picks on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Olympic Draft Predictive Model

    Our family runs an Olympic Draft - similar to fantasy football or baseball - for each Olympic cycle. The purpose of this case study is to identify trends in medal count / point value to create a predictive analysis of which teams should be selected in which order.

    There are a few assumptions that will impact the final analysis: Point Value - Each medal is worth the following: Gold - 6 points Silver - 4 points Bronze - 3 points For analysis reviewing the last 10 Olympic cycles. Winter Olympics only.

    All GDP numbers are in USD

    My initial hypothesis is that larger GDP per capita and size of contingency are correlated with better points values for the Olympic draft.

    All Data pulled from the following Datasets:

    Winter Olympics Medal Count - https://www.kaggle.com/ramontanoeiro/winter-olympic-medals-1924-2018 Worldwide GDP History - https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2020&start=1984&view=chart

    GDP data was a wide format when downloaded from the World Bank. Opened file in Excel, removed irrelevant years, and saved as .csv.

    Process

    In RStudio utilized the following code to convert wide data to long:

    install.packages("tidyverse") library(tidyverse) library(tidyr)

    Converting to long data from wide

    long <- newgdpdata %>% gather(year, value, -c("Country Name","Country Code"))

    Completed these same steps for GDP per capita.

    Primary Key Creation

    Differing types of data between these two databases and there is not a good primary key to utilize. Used CONCAT to create a new key column in both combining the year and country code to create a unique identifier that matches between the datasets.

    SELECT *, CONCAT(year,country_code) AS "Primary" FROM medal_count

    Saved as new table "medals_w_primary"

    Utilized Excel to concatenate the primary key for GDP and GDP per capita utilizing:

    =CONCAT()

    Saved as new csv files.

    Uploaded all to SSMS.

    Contingent Size

    Next need to add contingent size.

    No existing database had this information. Pulled data from Wikipedia.

    2018 - No problem, pulled existing table. 2014 - Table was not created. Pulled information into excel, needed to convert the country NAMES into the country CODES.

    Created excel document with all ISO Country Codes. Items were broken down between both formats, either 2 or 3 letters. Example:

    AF/AFG

    Used =RIGHT(C1,3) to extract only the country codes.

    For the country participants list in 2014, copied source data from Wikipedia and pasted as plain text (not HTML).

    Items then showed as: Albania (2)

    Broke cells using "(" as the delimiter to separate country names and numbers, then find and replace to remove all parenthesis from this data.

    We were left with: Albania 2

    Used VLOOKUP to create correct country code: =VLOOKUP(A1,'Country Codes'!A:D,4,FALSE)

    This worked for almost all items with a few exceptions that didn't match. Based on nature and size of items, manually checked on which items were incorrect.

    Chinese Taipei 3 #N/A Great Britain 56 #N/A Virgin Islands 1 #N/A

    This was relatively easy to fix by adding corresponding line items to the Country Codes sheet to account for future variability in the country code names.

    Copied over to main sheet.

    Repeated this process for additional years.

    Once complete created sheet with all 10 cycles of data. In total there are 731 items.

    Data Cleaning

    Filtered by Country Code since this was an issue early on.

    Found a number of N/A Country Codes:

    Serbia and Montenegro FR Yugoslavia FR Yugoslavia Czechoslovakia Unified Team Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia

    Appears to be issues with older codes, Soviet Union block countries especially. Referred to historical data and filled in these country codes manually. Codes found on iso.org.

    Filled all in, one issue that was more difficult is the Unified Team of 1992 and Soviet Union. For simplicity used code for Russia - GDP data does not recognize the Soviet Union, breaks the union down to constituent countries. Using Russia is a reasonable figure for approximations and analysis to attempt to find trends.

    From here created a filter and scanned through the country names to ensure there were no obvious outliers. Found the following:

    Olympic Athletes from Russia[b] -- This is a one-off due to the recent PED controversy for Russia. Amended the Country Code to RUS to more accurately reflect the trends.

    Korea[a] and South Korea -- both were listed in 2018. This is due to the unified Korean team that competed. This is an outlier and does not warrant standing on its own as the 2022 Olympics will not have this team (as of this writing on 01/14/2022). Removed the COR country code item.

    Confirmed Primary Key was created for all entries.

    Ran minimum and maximum years, no unexpected values. Ran minimum and maximum Athlete numbers, no unexpected values. Confirmed length of columns for Country Code and Primary Key.

    No NULL values in any columns. Ready to import to SSMS.

    SQL work

    We now have 4 tables, joined together to create the master table:

    SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes FROM medals_w_primary INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY year DESC

    This left us with the following table:

    https://i.imgur.com/tpNhiNs.png" alt="Imgur">

    Performed some basic cleaning tasks to ensure no outliers:

    Checked GDP numbers: 1992 North Korea shows as null. Updated this row with information from countryeconomy.com - $12,458,000,000

    Checked GDP per capita:

    1992 North Korea again missing. Updated this to $595, utilized same source.

    UPDATE [OlympicDraft].[dbo].[gdp_w_primary] SET [OlympicDraft].[dbo].[gdp_w_primary].[value] = 12458000000 WHERE [OlympicDraft].[dbo].[gdp_w_primary].[year_country] = '1992PRK'

    UPDATE [OlympicDraft].[dbo].[convertedgdpdatapercapita] SET [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita] = 595 WHERE [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year_country] = '1992PRK'

    Liechtenstein showed as an outlier with GDP per capita at 180,366 in 2018. Confirmed this number is correct per the World Bank, appears Liechtenstein does not often have atheletes in the winter olympics. Performing a quick SQL search to verify this shows that they fielded 3 atheletes in 2018, with a Bronze medal being won. Initially this appears to be a good ratio for win/loss.

    Finally, need to create a column that shows the total point value for each of these rows based on the above formula (6 points for Gold, 4 points for Silver, 3 points for Bronze).

    Updated query as follows:

    SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes, (Gold*6) + (Silver*4) + (Bronze*3) AS 'Total_Points' FROM [OlympicDraft].[dbo].[medals_w_primary] INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year]

    Spot checked, calculating correctly.

    Saved result as winter_olympics_study.csv.

    We can now see that all relevant information is in this table:

    https://i.imgur.com/ceZvqCA.png" alt="Imgur">

    RStudio Work

    To continue our analysis, opened this CSV in RStudio.

    install.packages("tidyverse") library(tidyverse) library(ggplot2) install.packages("forecast") library(forecast) install.packages("GGally") library(GGally) install.packages("modelr") library(modelr)

    View(winter_olympic_study)

    Finding correlation between gdp_per_capita and Total_Points

    ggplot(data = winter_olympic_study) + geom_point(aes(x=gdp_per_capita,y=Total_Points,color=country_name)) + facet_wrap(~country_name)

    cor(winter_olympic_study$gdp_per_capita, winter_olympic_study$Total_Points, method = c("pearson"))

    Result is .347, showing a moderate correlation between these two figures.

    Looked next at GDP vs. Total_Points ggplot(data = winter_olympic_study) + geom_point(aes(x=GDP,y=Total_Points,color=country_name))+ facet_wrap(~country_name)

    cor(winter_olympic_study$GDP, winter_olympic_study$Total_Points, method = c("pearson")) This resulted in 0.35, statistically insignificant difference between this and GDP Per Capita

    Next looked at contingent size vs. total points ggplot(data = winter_olympic_study) + geom_point(aes(x=Atheletes,y=Total_Points,color=country_name)) +

  18. News Ninja Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon; anon (2024). News Ninja Dataset [Dataset]. http://doi.org/10.5281/zenodo.10683029
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anon; anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About
    Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.

    General
    This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.

    Description of the Data Files
    This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:

    ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.

    AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).

    demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.

    Collection Process
    Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.

    The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.

  19. BigQuery Sample Tables

    • kaggle.com
    zip
    Updated Sep 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/datasets/bigquery/samples
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 4, 2018
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

    Content

    • gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

    • github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

    • github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

    • natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

    • shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

    • trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

    • wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

    Fork this kernel to get started.

    Acknowledgements

    Data Source: https://cloud.google.com/bigquery/sample-tables

    Banner Photo by Mervyn Chan from Unplash.

    Inspiration

    How many babies were born in New York City on Christmas Day?

    How many words are in the play Hamlet?

  20. Find the Ship

    • kaggle.com
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis (2025). Find the Ship [Dataset]. https://www.kaggle.com/datasets/lireyesc/find-the-ship
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_185533_Location-2D_Heading-West_Ship-Freighter.jpg?raw=true" alt="Freighter">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190017_Location-4C_Heading-East_Ship-Cruiser-3.jpg?raw=true" alt="Cruiser-3">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190746_Location-1C_Heading-West_Ship-Cruiser-2.jpg?raw=true" alt="Cruiser-2">
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190339_Location-3C_Heading-East_Ship-Fishing-2.jpg?raw=true" alt="Fishing-2">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190437_Empty.jpg?raw=true" alt="Empty">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_214355_Location-6A_Heading-East_Ship-Fishing-1.jpg?raw=true" alt="Fishing-1">
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_214605_Location-3A_Heading-East_Ship-Cruiser-1.jpg?raw=true" alt="Cruiser-1">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171106_184157_Location-4A_Heading-West_Ship-Cruiser-2.jpg?raw=true" alt="Cruiser-2">https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171106_192310_Location-7C_Heading-West_Ship-Freighter.jpg?raw=true" alt="Freighter">

    This is a multi-task classification dataset I made for fun in late 2017 using a cheap webcam, wood, glue, paint, yarn and scotch tape. It consists of 2035 images of a board representing a ficticious ocean area where 6 ship models operate. Every image is 640x480 pixels with three color channels (RGB). Each non-empty image sample contains exactly one scaled model of a ship with a particular location and heading. The tasks are: * 1. Determine whether or not the image is non-empty (i.e., contains a ship). * 2. If the image is non-empty: * A. Determine the ship's location. * B. Determine the ship's heading. * C. Determine the ship's model.

    The data split is as follows: * Directory /set-A_train contains 1635 image samples for training * Directory /set-B_test contains 400 image samples for testing (validation)

    Needless to say, you may choose any other data split you find useful for your purposes.

    Board

    The board consists of 28 locations, with rows ranging from 1 through 7 and columns ranging from A through D. Each non-empty image sample contains exactly one ship, and the ship may be facing either West (towards the left of the board) or East (towards the right of the board). The following image sample shows an empty board with each location labeled.

    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Board.png?raw=true" alt="Board">

    Ship Models

    Each non-empty image sample contains exactly one of six possible ship models, facing either West (towards the left of the board) or East (towards the right of the board). The following table displays sample images of each ship model.

    Sample ImageShip ModelShown Facing
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-1.jpg?raw=true" alt="Cruiser-1">Cruiser-1West
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-2.jpg?raw=true" alt="Cruiser-2">Cruiser-2East
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-3.jpg?raw=true" alt="Cruiser-3">Cruiser-3East
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Fishing-1.jpg?raw=true" alt="Fishing-1">Fishing-1West
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Fishing-2.jpg?raw=true" alt="Fishing-2">Fishing-2East
    https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Freighter.jpg?raw=true" alt="Freighter">FreighterWest

    Image Labels

    The image labels can be found in the image_labels.csv files inside each dataset directory. These CSV files contain tables where each row corresponds to an image sample. The columns are structured are as follows.

    | Column | Values | |-------------|---------------------------------...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
Organization logo

WikiTableQuestions (Semi-structured Tables Q&A)

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

  • The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

  • The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

  • The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Search
Clear search
Close search
Google apps
Main menu