96 datasets found

WikiTableQuestions (Semi-structured Tables Q&A)
kaggle.com
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

Data from: Indicators Table
kaggle.com
Updated May 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MarcoMarchetti (2018). Indicators Table [Dataset]. https://www.kaggle.com/datasets/marcomarchetti/indicators-table
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MarcoMarchetti
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by MarcoMarchetti

Released under CC0: Public Domain

Contents
Chinook CSV Dataset
kaggle.com
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anurag Verma (2023). Chinook CSV Dataset [Dataset]. https://www.kaggle.com/datasets/anurag629/chinook-csv-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anurag Verma
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is an export of the tables from the Chinook sample database into CSV files. The Chinook database contains information about a fictional digital media store, including tables for artists, albums, media tracks, invoices, customers, and more.

The CSV file for each table contains the columns and all rows of data. The column headers match the table schema. Refer to the Chinook schema documentation for more details on each table and column.

The files are encoded as UTF-8. The delimiter is a comma. Strings are quoted. Null values are represented by empty strings.

Files

albums.csv

artists.csv

customers.csv

employees.csv

genres.csv

invoice_items.csv

invoices.csv

media_types.csv

playlist_track.csv

playlists.csv

tracks.csv

Usage

This dataset can be used to analyze the Chinook store data. For example, you could build models on customer purchases, track listening patterns, identify trends in genres or artists,etc.

The data is ideal for practicing Pandas, Numpy, PySpark, etc libraries. The database schema provides a realistic set of tables and relationships.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
General Table Detection Dataset
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit singh (2022). General Table Detection Dataset [Dataset]. https://www.kaggle.com/datasets/rhtsingh/general-table-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rohit singh
Description
Dataset

This dataset was created by Rohit singh

Released under Data files © Original Authors

Contents
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

NBA Player Dataset & Prediction Model Artifacts

test.researchdata.tuwien.ac.at

bin, csv, json, png +2

Updated Apr 28, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43

Explore at:

csv, text/markdown, png, bin, txt, jsonAvailable download formats

Unique identifier

https://doi.org/10.70124/ymgzs-z3s43

Dataset updated

Apr 28, 2025

Dataset provided by

TU Wien

Authors

Burak Baltali; Burak Baltali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

Brief overview of Files

end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
For further information you can go to the github site (Link below)

File Details

Notebook

Analysis.ipynb: Involves the graphica output of the trained and tested data.

Trained/ Test csv Data

Name	Description	PID
regular_train.csv	For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose	4421e56c-4cd3-4ec1-a566-a89d7ec0bced
regular_test.csv:	For testing purpose of the regular season, the 2022-2023 season was selected	f9d84d5e-db01-4475-b7d1-80cfe9fe0e61
playoff_train.csv	For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected	bcb3cf2b-27df-48cc-8b76-9e49254783d0
playoff_test.csv	For testing purpose of the playoff season, 2023-2024 season was selected	de37d568-e97f-4cb9-bc05-2e600cc97102

Others

abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

Additional Notes

Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

Some preprocessing has to be done before uploading into dbrepo

Plots have also been uploaded as an output for visual purposes.

A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

Text-audio pairs (4 of 4)
kaggle.com
zip
Updated Aug 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorvan (2024). Text-audio pairs (4 of 4) [Dataset]. https://www.kaggle.com/datasets/jorvan/text-audio-pairs-4-of-4
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 14, 2024
Authors
Jorvan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is the fourth of the four datasets that we have created, for audio-text training tasks. These collect pairs of texts and audios, based on the audio-image pairs from our datasets [1, 2, 3]. These are only intended for research purposes.

For the conversion, .csv tables were created, where audio values were separated in 16,000 columns and images were transformed into texts using the public model BLIP [4]. The original images are also preserved for future reference.

To allow other researchers a quick evaluation of the potential usefulness of our datasets for their purposes, we have made available a public page where anyone can check 60 random samples that we extracted from all of our data [5].

[1] Jorge E. León. Image-audio pairs (1 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-1-of-3. [2] Jorge E. León. Image-audio pairs (2 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-2-of-3. [3] Jorge E. León. Image-audio pairs (3 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-3-of-3. [4] Junnan Li et al. “BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation”. En: ArXiv 2201.12086 (2022). [5] Jorge E. León. AVT Multimodal Dataset. 2024. url: https://jorvan758.github.io/AVT-Multimodal-Dataset/.
h
stackoverflow_python
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Koutcheme, stackoverflow_python [Dataset]. https://huggingface.co/datasets/koutch/stackoverflow_python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Charles Koutcheme
Description
Dataset Card for "stackoverflow_python"

Dataset Summary

This dataset comes originally from kaggle. It was originally split into three tables (CSV files) (Questions, Answers, and Tags) now merged into a single table. Each row corresponds to a pair (question-answer) and their associated tags. The dataset contains all questions asked between August 2, 2008 and Ocotober 19, 2016.

Supported Tasks and Leaderboards

This might be useful for open-domain… See the full description on the dataset page: https://huggingface.co/datasets/koutch/stackoverflow_python.

‘COVID vaccination vs. mortality ’ analyzed by Analyst-2

analyst-2.ai

Updated Aug 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘COVID vaccination vs. mortality ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-vaccination-vs-mortality-cbd8/06c8ccd2/?iid=010-492&v=presentation

Explore at:

Dataset updated

Aug 4, 2020

Dataset authored and provided by

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘COVID vaccination vs. mortality ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sinakaraji/covid-vaccination-vs-death on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

The COVID-19 outbreak has brought the whole planet to its knees.More over 4.5 million people have died since the writing of this notebook, and the only acceptable way out of the disaster is to vaccinate all parts of society. Despite the fact that the benefits of vaccination have been proved to the world many times, anti-vaccine groups are springing up all over the world. This data set was generated to investigate the impact of coronavirus vaccinations on coronavirus mortality.

Content

country	iso_code	date	total_vaccinations	people_vaccinated	people_fully_vaccinated	New_deaths	population	ratio
country name	iso code for each country	date that this data belong	number of all doses of COVID vaccine usage in that country	number of people who got at least one shot of COVID vaccine	number of people who got full vaccine shots	number of daily new deaths	2021 country population	% of vaccinations in that country at that date = people_vaccinated/population * 100

Data Collection

This dataset is a combination of the following three datasets:

1.https://www.kaggle.com/gpreda/covid-world-vaccination-progress

2.https://covid19.who.int/WHO-COVID-19-global-data.csv

3.https://www.kaggle.com/rsrishav/world-population

you can find more detail about this dataset by reading this notebook:

https://www.kaggle.com/sinakaraji/simple-linear-regression-covid-vaccination

Countries in this dataset:


Afghanistan	Albania	Algeria	Andorra	Angola
Anguilla	Antigua and Barbuda	Argentina	Armenia	Aruba
Australia	Austria	Azerbaijan	Bahamas	Bahrain
Bangladesh	Barbados	Belarus	Belgium	Belize
Benin	Bermuda	Bhutan	Bolivia (Plurinational State of)	Brazil
Bosnia and Herzegovina	Botswana	Brunei Darussalam	Bulgaria	Burkina Faso
Cambodia	Cameroon	Canada	Cabo Verde	Cayman Islands
Central African Republic	Chad	Chile	China	Colombia
Comoros	Cook Islands	Costa Rica	Croatia	Cuba
Curaçao	Cyprus	Denmark	Djibouti	Dominica
Dominican Republic	Ecuador	Egypt	El Salvador	Equatorial Guinea
Estonia	Ethiopia	Falkland Islands (Malvinas)	Fiji	Finland
France	French Polynesia	Gabon	Gambia	Georgia
Germany	Ghana	Gibraltar	Greece	Greenland
Grenada	Guatemala	Guinea	Guinea-Bissau	Guyana
Haiti	Honduras	Hungary	Iceland	India
Indonesia	Iran (Islamic Republic of)	Iraq	Ireland	Isle of Man
Israel	Italy	Jamaica	Japan	Jordan
Kazakhstan	Kenya	Kiribati	Kuwait	Kyrgyzstan
Lao People's Democratic Republic	Latvia	Lebanon	Lesotho	Liberia
Libya	Liechtenstein	Lithuania	Luxembourg	Madagascar
Malawi	Malaysia	Maldives	Mali	Malta
Mauritania	Mauritius	Mexico	Republic of Moldova	Monaco
Mongolia	Montenegro	Montserrat	Morocco	Mozambique
Myanmar	Namibia	Nauru	Nepal	Netherlands
New Caledonia	New Zealand	Nicaragua	Niger	Nigeria
Niue	North Macedonia	Norway	Oman	Pakistan
occupied Palestinian territory, including east Jerusalem
Panama	Papua New Guinea	Paraguay	Peru	Philippines
Poland	Portugal	Qatar	Romania	Russian Federation
Rwanda	Saint Kitts and Nevis	Saint Lucia
Saint Vincent and the Grenadines	Samoa	San Marino	Sao Tome and Principe	Saudi Arabia
Senegal	Serbia	Seychelles	Sierra Leone	Singapore
Slovakia	Slovenia	Solomon Islands	Somalia	South Africa
Republic of Korea	South Sudan	Spain	Sri Lanka	Sudan
Suriname	Sweden	Switzerland	Syrian Arab Republic	Tajikistan
United Republic of Tanzania	Thailand	Togo	Tonga	Trinidad and Tobago
Tunisia	Turkey	Turkmenistan	Turks and Caicos Islands	Tuvalu
Uganda	Ukraine	United Arab Emirates	The United Kingdom	United States of America
Uruguay	Uzbekistan	Vanuatu	Venezuela (Bolivarian Republic of)	Viet Nam
Wallis and Futuna	Yemen	Zambia	Zimbabwe

--- Original source retains full ownership of the source dataset ---

Purchase Order Data
data.ca.gov
catalog.data.gov
csv, docx, pdf
Updated Oct 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
Explore at:
docx, csv, pdfAvailable download formats
Dataset updated
Oct 23, 2019
Dataset authored and provided by
California Department of General Services
Description
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

Data Collection Methodology:

The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

Secondary/Related Resources:

State Contract Manual (SCM) vol. 2 http://www.dgs.ca.gov/pd/Resources/publications/SCM2.aspx

State Contract Manual (SCM) vol. 3 http://www.dgs.ca.gov/pd/Resources/publications/SCM3.aspx

Buying Green http://www.dgs.ca.gov/buyinggreen/Home.aspx

United Nations Standard Products and Services Code, http://www.unspsc.org/
Cyclistic_csv_data_Pivot_table
kaggle.com
Updated Dec 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Aidoo (2024). Cyclistic_csv_data_Pivot_table [Dataset]. https://www.kaggle.com/datasets/stevenaidoo/cyclistic-csv-data-pivot-table/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stephen Aidoo
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by Stephen Aidoo

Released under Database: Open Database, Contents: Database Contents

Contents
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

‘Rare Pepes’ analyzed by Analyst-2

analyst-2.ai

Updated Jan 28, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Rare Pepes’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-rare-pepes-3b0c/6139e02f/?iid=001-911&v=presentation

Explore at:

Dataset updated

Jan 28, 2022

Dataset authored and provided by

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘Rare Pepes’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/rare-pepese on 28 January 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Data behind the story Can The Blockchain Turn Pepe The Frog Into Modern Art?
There are four data files, described below. You can also find further information about individual Rare Pepe assets at Rare Pepe Wallet.

ordermatches_all.csv contains all Rare Pepe order matches from the beginning of the project, in late 2016, until Feb. 3. All order matches include a pair of assets (a “forward asset” and a “backward asset”) one of which is a Rare Pepe and the other of which is either XCP, the native Counterparty token, or Pepe Cash. The time of the order match can be determined by the block.

Header Description
Block The block number
ForwardAsset The type of forward asset
ForwardQuantity The quantity of forward asset
BackwardAsset The type of backward asset
BackwardQuantity The quantity of backward asset
blocks_timestamps.csv is a pairing of block and timestamp. This can be used to determine the actual time an order match occurred, which can then be used to determine the dollar value of Pepe Cash or XCP at the time of the trade.

Header Description
Block The block number
Timestamp A Unix timestamp
pepecash_prices.csv contains the dollar price of Pepe Cash over time.

Header Description
Timestamp A Unix timestamp
Price The price of Pepe Cash in dollars
xcp_prices.csv contains the dollar price of XCP over time.

Header Description
Timestamp A Unix timestamp
Price The price of XCP in dollars
Source: Rare Pepe Foundation

The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.

Source: https://github.com/fivethirtyeight/data
This dataset was created by FiveThirtyEight and contains around 30000 samples along with Backward Quantity, Block, technical information and other features such as: - Forward Quantity - Backward Asset - and more.

Header	Description
`Block`	The block number
`ForwardAsset`	The type of forward asset
`ForwardQuantity`	The quantity of forward asset
`BackwardAsset`	The type of backward asset
`BackwardQuantity`	The quantity of backward asset

Header	Description
`Block`	The block number
`Timestamp`	A Unix timestamp

Header	Description
`Timestamp`	A Unix timestamp
`Price`	The price of Pepe Cash in dollars

Header	Description
`Timestamp`	A Unix timestamp
`Price`	The price of XCP in dollars

How to use this dataset

Analyze Forward Asset in relation to Backward Quantity

Study the influence of Block on Forward Quantity

More datasets

Acknowledgements

If you use this dataset in your research, please credit FiveThirtyEight

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---

A
‘Hotel Prices - Beginner Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Hotel Prices - Beginner Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hotel-prices-beginner-dataset-6aca/74a157b1/?iid=000-816&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Hotel Prices - Beginner Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sveneschlbeck/hotel-prices-beginner-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset addresses Data Science students and/or Beginners who want to dive into Regression or Clustering without the need to pre-clean the data first.

Content

This dataset consists of a pre-cleaned .csv table that has been translated from German to English.

There are four columns in this dataset:

Profit (How much money does this hotel make in a year)

Price in Millions (€)

Square Meter (Hotel Area)

City

Here, "Hotel Prices" does not refer to the cost of spending a night at those hotels but the price for buying them. This would be an interesting chart for someone who wants to buy a hotel and needs to judge whether he/she is overpaying or getting a great deal depending on similar objects in other comparable cities.

--- Original source retains full ownership of the source dataset ---
A
‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-winter-olympics-prediction-fantasy-draft-picks-2684/07d15ca8/?iid=004-753&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Winter Olympics Prediction - Fantasy Draft Picks’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ericsbrown/winter-olympics-prediction-fantasy-draft-picks on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Olympic Draft Predictive Model

Our family runs an Olympic Draft - similar to fantasy football or baseball - for each Olympic cycle. The purpose of this case study is to identify trends in medal count / point value to create a predictive analysis of which teams should be selected in which order.

There are a few assumptions that will impact the final analysis: Point Value - Each medal is worth the following: Gold - 6 points Silver - 4 points Bronze - 3 points For analysis reviewing the last 10 Olympic cycles. Winter Olympics only.

All GDP numbers are in USD

My initial hypothesis is that larger GDP per capita and size of contingency are correlated with better points values for the Olympic draft.

All Data pulled from the following Datasets:

Winter Olympics Medal Count - https://www.kaggle.com/ramontanoeiro/winter-olympic-medals-1924-2018 Worldwide GDP History - https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2020&start=1984&view=chart

GDP data was a wide format when downloaded from the World Bank. Opened file in Excel, removed irrelevant years, and saved as .csv.

Process

In RStudio utilized the following code to convert wide data to long:

install.packages("tidyverse") library(tidyverse) library(tidyr)

Converting to long data from wide

long <- newgdpdata %>% gather(year, value, -c("Country Name","Country Code"))

Completed these same steps for GDP per capita.

Primary Key Creation

Differing types of data between these two databases and there is not a good primary key to utilize. Used CONCAT to create a new key column in both combining the year and country code to create a unique identifier that matches between the datasets.

SELECT *, CONCAT(year,country_code) AS "Primary" FROM medal_count

Saved as new table "medals_w_primary"

Utilized Excel to concatenate the primary key for GDP and GDP per capita utilizing:

=CONCAT()

Saved as new csv files.

Uploaded all to SSMS.

Contingent Size

Next need to add contingent size.

No existing database had this information. Pulled data from Wikipedia.

2018 - No problem, pulled existing table. 2014 - Table was not created. Pulled information into excel, needed to convert the country NAMES into the country CODES.

Created excel document with all ISO Country Codes. Items were broken down between both formats, either 2 or 3 letters. Example:

AF/AFG

Used =RIGHT(C1,3) to extract only the country codes.

For the country participants list in 2014, copied source data from Wikipedia and pasted as plain text (not HTML).

Items then showed as: Albania (2)

Broke cells using "(" as the delimiter to separate country names and numbers, then find and replace to remove all parenthesis from this data.

We were left with: Albania 2

Used VLOOKUP to create correct country code: =VLOOKUP(A1,'Country Codes'!A:D,4,FALSE)

This worked for almost all items with a few exceptions that didn't match. Based on nature and size of items, manually checked on which items were incorrect.

Chinese Taipei 3 #N/A Great Britain 56 #N/A Virgin Islands 1 #N/A

This was relatively easy to fix by adding corresponding line items to the Country Codes sheet to account for future variability in the country code names.

Copied over to main sheet.

Repeated this process for additional years.

Once complete created sheet with all 10 cycles of data. In total there are 731 items.

Data Cleaning

Filtered by Country Code since this was an issue early on.

Found a number of N/A Country Codes:

Serbia and Montenegro FR Yugoslavia FR Yugoslavia Czechoslovakia Unified Team Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia

Appears to be issues with older codes, Soviet Union block countries especially. Referred to historical data and filled in these country codes manually. Codes found on iso.org.

Filled all in, one issue that was more difficult is the Unified Team of 1992 and Soviet Union. For simplicity used code for Russia - GDP data does not recognize the Soviet Union, breaks the union down to constituent countries. Using Russia is a reasonable figure for approximations and analysis to attempt to find trends.

From here created a filter and scanned through the country names to ensure there were no obvious outliers. Found the following:

Olympic Athletes from Russia[b] -- This is a one-off due to the recent PED controversy for Russia. Amended the Country Code to RUS to more accurately reflect the trends.

Korea[a] and South Korea -- both were listed in 2018. This is due to the unified Korean team that competed. This is an outlier and does not warrant standing on its own as the 2022 Olympics will not have this team (as of this writing on 01/14/2022). Removed the COR country code item.

Confirmed Primary Key was created for all entries.

Ran minimum and maximum years, no unexpected values. Ran minimum and maximum Athlete numbers, no unexpected values. Confirmed length of columns for Country Code and Primary Key.

No NULL values in any columns. Ready to import to SSMS.

SQL work

We now have 4 tables, joined together to create the master table:

SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes FROM medals_w_primary INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY year DESC

This left us with the following table:

https://i.imgur.com/tpNhiNs.png" alt="Imgur">

Performed some basic cleaning tasks to ensure no outliers:

Checked GDP numbers: 1992 North Korea shows as null. Updated this row with information from countryeconomy.com - $12,458,000,000

Checked GDP per capita:

1992 North Korea again missing. Updated this to $595, utilized same source.

UPDATE [OlympicDraft].[dbo].[gdp_w_primary] SET [OlympicDraft].[dbo].[gdp_w_primary].[value] = 12458000000 WHERE [OlympicDraft].[dbo].[gdp_w_primary].[year_country] = '1992PRK'

UPDATE [OlympicDraft].[dbo].[convertedgdpdatapercapita] SET [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita] = 595 WHERE [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year_country] = '1992PRK'

Liechtenstein showed as an outlier with GDP per capita at 180,366 in 2018. Confirmed this number is correct per the World Bank, appears Liechtenstein does not often have atheletes in the winter olympics. Performing a quick SQL search to verify this shows that they fielded 3 atheletes in 2018, with a Bronze medal being won. Initially this appears to be a good ratio for win/loss.

Finally, need to create a column that shows the total point value for each of these rows based on the above formula (6 points for Gold, 4 points for Silver, 3 points for Bronze).

Updated query as follows:

SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes, (Gold*6) + (Silver*4) + (Bronze*3) AS 'Total_Points' FROM [OlympicDraft].[dbo].[medals_w_primary] INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year]

Spot checked, calculating correctly.

Saved result as winter_olympics_study.csv.

We can now see that all relevant information is in this table:

https://i.imgur.com/ceZvqCA.png" alt="Imgur">

RStudio Work

To continue our analysis, opened this CSV in RStudio.

install.packages("tidyverse") library(tidyverse) library(ggplot2) install.packages("forecast") library(forecast) install.packages("GGally") library(GGally) install.packages("modelr") library(modelr)

View(winter_olympic_study)

Finding correlation between gdp_per_capita and Total_Points

ggplot(data = winter_olympic_study) + geom_point(aes(x=gdp_per_capita,y=Total_Points,color=country_name)) + facet_wrap(~country_name)

cor(winter_olympic_study$gdp_per_capita, winter_olympic_study$Total_Points, method = c("pearson"))

Result is .347, showing a moderate correlation between these two figures.

Looked next at GDP vs. Total_Points ggplot(data = winter_olympic_study) + geom_point(aes(x=GDP,y=Total_Points,color=country_name))+ facet_wrap(~country_name)

cor(winter_olympic_study$GDP, winter_olympic_study$Total_Points, method = c("pearson")) This resulted in 0.35, statistically insignificant difference between this and GDP Per Capita

Next looked at contingent size vs. total points ggplot(data = winter_olympic_study) + geom_point(aes(x=Atheletes,y=Total_Points,color=country_name)) +
News Ninja Dataset
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon; anon (2024). News Ninja Dataset [Dataset]. http://doi.org/10.5281/zenodo.10683029
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10683029
Dataset updated
Feb 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anon; anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About
Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.

General
This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.

Description of the Data Files
This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:

ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.

AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).

demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.

Collection Process
Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.

The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.
BigQuery Sample Tables
kaggle.com
zip
Updated Sep 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/datasets/bigquery/samples
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 4, 2018
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

Content

gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

Fork this kernel to get started.

Acknowledgements

Data Source: https://cloud.google.com/bigquery/sample-tables

Banner Photo by Mervyn Chan from Unplash.

Inspiration

How many babies were born in New York City on Christmas Day?

How many words are in the play Hamlet?

Find the Ship

kaggle.com

Updated Jan 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Luis (2025). Find the Ship [Dataset]. https://www.kaggle.com/datasets/lireyesc/find-the-ship

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 21, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Luis

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_185533_Location-2D_Heading-West_Ship-Freighter.jpg?raw=true" alt="Freighter">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190017_Location-4C_Heading-East_Ship-Cruiser-3.jpg?raw=true" alt="Cruiser-3">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190746_Location-1C_Heading-West_Ship-Cruiser-2.jpg?raw=true" alt="Cruiser-2">
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190339_Location-3C_Heading-East_Ship-Fishing-2.jpg?raw=true" alt="Fishing-2">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_190437_Empty.jpg?raw=true" alt="Empty">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_214355_Location-6A_Heading-East_Ship-Fishing-1.jpg?raw=true" alt="Fishing-1">
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171105_214605_Location-3A_Heading-East_Ship-Cruiser-1.jpg?raw=true" alt="Cruiser-1">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171106_184157_Location-4A_Heading-West_Ship-Cruiser-2.jpg?raw=true" alt="Cruiser-2">	https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/set-A_train/20171106_192310_Location-7C_Heading-West_Ship-Freighter.jpg?raw=true" alt="Freighter">

This is a multi-task classification dataset I made for fun in late 2017 using a cheap webcam, wood, glue, paint, yarn and scotch tape. It consists of 2035 images of a board representing a ficticious ocean area where 6 ship models operate. Every image is 640x480 pixels with three color channels (RGB). Each non-empty image sample contains exactly one scaled model of a ship with a particular location and heading. The tasks are: * 1. Determine whether or not the image is non-empty (i.e., contains a ship). * 2. If the image is non-empty: * A. Determine the ship's location. * B. Determine the ship's heading. * C. Determine the ship's model.

The data split is as follows: * Directory /set-A_train contains 1635 image samples for training * Directory /set-B_test contains 400 image samples for testing (validation)

Needless to say, you may choose any other data split you find useful for your purposes.

Board

The board consists of 28 locations, with rows ranging from 1 through 7 and columns ranging from A through D. Each non-empty image sample contains exactly one ship, and the ship may be facing either West (towards the left of the board) or East (towards the right of the board). The following image sample shows an empty board with each location labeled.

https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Board.png?raw=true" alt="Board">

Ship Models

Each non-empty image sample contains exactly one of six possible ship models, facing either West (towards the left of the board) or East (towards the right of the board). The following table displays sample images of each ship model.

Sample Image	Ship Model	Shown Facing
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-1.jpg?raw=true" alt="Cruiser-1">	Cruiser-1	West
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-2.jpg?raw=true" alt="Cruiser-2">	Cruiser-2	East
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Cruiser-3.jpg?raw=true" alt="Cruiser-3">	Cruiser-3	East
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Fishing-1.jpg?raw=true" alt="Fishing-1">	Fishing-1	West
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Fishing-2.jpg?raw=true" alt="Fishing-2">	Fishing-2	East
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Freighter.jpg?raw=true" alt="Freighter">	Freighter	West

Image Labels

The image labels can be found in the image_labels.csv files inside each dataset directory. These CSV files contain tables where each row corresponds to an image sample. The columns are structured are as follows.

| Column | Values | |-------------|---------------------------------...

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl

WikiTableQuestions (Semi-structured Tables Q&A)

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 27, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Clear search

Close search

Google apps

Main menu

WikiTableQuestions (Semi-structured Tables Q&A)

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

Data from: Indicators Table

Dataset

Contents

Chinook CSV Dataset

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

General Table Detection Dataset

Dataset

Contents

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

NBA Player Dataset & Prediction Model Artifacts

Description

Brief overview of Files

File Details

Text-audio pairs (4 of 4)

stackoverflow_python

‘COVID vaccination vs. mortality ’ analyzed by Analyst-2

Context

Content

Data Collection

Countries in this dataset:

Purchase Order Data

Cyclistic_csv_data_Pivot_table

Dataset

Contents

Hospital Management Dataset

‘Rare Pepes’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Start A New Notebook!

‘Hotel Prices - Beginner Dataset’ analyzed by Analyst-2

Context

Content

‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2

Olympic Draft Predictive Model

Process

Converting to long data from wide

Primary Key Creation

Contingent Size

Data Cleaning

SQL work

RStudio Work

Finding correlation between gdp_per_capita and Total_Points

News Ninja Dataset

BigQuery Sample Tables

Context

Content

Acknowledgements

Inspiration

Find the Ship

Board

Ship Models

Image Labels

WikiTableQuestions (Semi-structured Tables Q&A)

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns