https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
Column | Description |
code_blocks_index | Global index linking code blocks to markup_data.csv. |
kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
code_block_id |
Position of the code block within the notebook. |
code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
Column | Description |
kernel_id | Identifier for the Kaggle Jupyter notebook. |
kaggle_score | Performance metric of the notebook. |
kaggle_comments | Number of comments on the notebook. |
kaggle_upvotes | Number of upvotes the notebook received. |
kernel_link | URL to the notebook. |
comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
Column | Description |
comp_name | Name of the Kaggle competition. |
description | Overview of the competition task. |
data_type | Type of data used in the competition. |
comp_type | Classification of the competition. |
subtitle | Short description of the task. |
EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
data_sources | Links to datasets used. |
metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
Column | Description |
code_block | Machine learning code block. |
too_long | Flag indicating whether the block spans multiple semantic types. |
marks | Confidence level of the annotation. |
graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id
column.comp_name
. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index
column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank
), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by MarcoMarchetti
Released under CC0: Public Domain
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is an export of the tables from the Chinook sample database into CSV files. The Chinook database contains information about a fictional digital media store, including tables for artists, albums, media tracks, invoices, customers, and more.
The CSV file for each table contains the columns and all rows of data. The column headers match the table schema. Refer to the Chinook schema documentation for more details on each table and column.
The files are encoded as UTF-8. The delimiter is a comma. Strings are quoted. Null values are represented by empty strings.
Files
Usage
This dataset can be used to analyze the Chinook store data. For example, you could build models on customer purchases, track listening patterns, identify trends in genres or artists,etc.
The data is ideal for practicing Pandas, Numpy, PySpark, etc libraries. The database schema provides a realistic set of tables and relationships.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
This dataset was created by Rohit singh
Released under Data files © Original Authors
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.
The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).
Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.
end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
Notebook
Analysis.ipynb: Involves the graphica output of the trained and tested data.
Trained/ Test csv Data
Name | Description | PID |
regular_train.csv | For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose | 4421e56c-4cd3-4ec1-a566-a89d7ec0bced |
regular_test.csv: | For testing purpose of the regular season, the 2022-2023 season was selected | f9d84d5e-db01-4475-b7d1-80cfe9fe0e61 |
playoff_train.csv | For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected | bcb3cf2b-27df-48cc-8b76-9e49254783d0 |
playoff_test.csv | For testing purpose of the playoff season, 2023-2024 season was selected | de37d568-e97f-4cb9-bc05-2e600cc97102 |
Others
abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data
Additional Notes
Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)
Some preprocessing has to be done before uploading into dbrepo
Plots have also been uploaded as an output for visual purposes.
A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the fourth of the four datasets that we have created, for audio-text training tasks. These collect pairs of texts and audios, based on the audio-image pairs from our datasets [1, 2, 3]. These are only intended for research purposes.
For the conversion, .csv tables were created, where audio values were separated in 16,000 columns and images were transformed into texts using the public model BLIP [4]. The original images are also preserved for future reference.
To allow other researchers a quick evaluation of the potential usefulness of our datasets for their purposes, we have made available a public page where anyone can check 60 random samples that we extracted from all of our data [5].
[1] Jorge E. León. Image-audio pairs (1 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-1-of-3. [2] Jorge E. León. Image-audio pairs (2 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-2-of-3. [3] Jorge E. León. Image-audio pairs (3 of 3). 2024. url: https://www.kaggle.com/datasets/jorvan/image-audio-pairs-3-of-3. [4] Junnan Li et al. “BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation”. En: ArXiv 2201.12086 (2022). [5] Jorge E. León. AVT Multimodal Dataset. 2024. url: https://jorvan758.github.io/AVT-Multimodal-Dataset/.
Dataset Card for "stackoverflow_python"
Dataset Summary
This dataset comes originally from kaggle. It was originally split into three tables (CSV files) (Questions, Answers, and Tags) now merged into a single table. Each row corresponds to a pair (question-answer) and their associated tags. The dataset contains all questions asked between August 2, 2008 and Ocotober 19, 2016.
Supported Tasks and Leaderboards
This might be useful for open-domain… See the full description on the dataset page: https://huggingface.co/datasets/koutch/stackoverflow_python.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID vaccination vs. mortality ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sinakaraji/covid-vaccination-vs-death on 12 November 2021.
--- Dataset description provided by original source is as follows ---
The COVID-19 outbreak has brought the whole planet to its knees.More over 4.5 million people have died since the writing of this notebook, and the only acceptable way out of the disaster is to vaccinate all parts of society. Despite the fact that the benefits of vaccination have been proved to the world many times, anti-vaccine groups are springing up all over the world. This data set was generated to investigate the impact of coronavirus vaccinations on coronavirus mortality.
country | iso_code | date | total_vaccinations | people_vaccinated | people_fully_vaccinated | New_deaths | population | ratio |
---|---|---|---|---|---|---|---|---|
country name | iso code for each country | date that this data belong | number of all doses of COVID vaccine usage in that country | number of people who got at least one shot of COVID vaccine | number of people who got full vaccine shots | number of daily new deaths | 2021 country population | % of vaccinations in that country at that date = people_vaccinated/population * 100 |
This dataset is a combination of the following three datasets:
1.https://www.kaggle.com/gpreda/covid-world-vaccination-progress
2.https://covid19.who.int/WHO-COVID-19-global-data.csv
3.https://www.kaggle.com/rsrishav/world-population
you can find more detail about this dataset by reading this notebook:
https://www.kaggle.com/sinakaraji/simple-linear-regression-covid-vaccination
Afghanistan | Albania | Algeria | Andorra | Angola |
Anguilla | Antigua and Barbuda | Argentina | Armenia | Aruba |
Australia | Austria | Azerbaijan | Bahamas | Bahrain |
Bangladesh | Barbados | Belarus | Belgium | Belize |
Benin | Bermuda | Bhutan | Bolivia (Plurinational State of) | Brazil |
Bosnia and Herzegovina | Botswana | Brunei Darussalam | Bulgaria | Burkina Faso |
Cambodia | Cameroon | Canada | Cabo Verde | Cayman Islands |
Central African Republic | Chad | Chile | China | Colombia |
Comoros | Cook Islands | Costa Rica | Croatia | Cuba |
Curaçao | Cyprus | Denmark | Djibouti | Dominica |
Dominican Republic | Ecuador | Egypt | El Salvador | Equatorial Guinea |
Estonia | Ethiopia | Falkland Islands (Malvinas) | Fiji | Finland |
France | French Polynesia | Gabon | Gambia | Georgia |
Germany | Ghana | Gibraltar | Greece | Greenland |
Grenada | Guatemala | Guinea | Guinea-Bissau | Guyana |
Haiti | Honduras | Hungary | Iceland | India |
Indonesia | Iran (Islamic Republic of) | Iraq | Ireland | Isle of Man |
Israel | Italy | Jamaica | Japan | Jordan |
Kazakhstan | Kenya | Kiribati | Kuwait | Kyrgyzstan |
Lao People's Democratic Republic | Latvia | Lebanon | Lesotho | Liberia |
Libya | Liechtenstein | Lithuania | Luxembourg | Madagascar |
Malawi | Malaysia | Maldives | Mali | Malta |
Mauritania | Mauritius | Mexico | Republic of Moldova | Monaco |
Mongolia | Montenegro | Montserrat | Morocco | Mozambique |
Myanmar | Namibia | Nauru | Nepal | Netherlands |
New Caledonia | New Zealand | Nicaragua | Niger | Nigeria |
Niue | North Macedonia | Norway | Oman | Pakistan |
occupied Palestinian territory, including east Jerusalem | ||||
Panama | Papua New Guinea | Paraguay | Peru | Philippines |
Poland | Portugal | Qatar | Romania | Russian Federation |
Rwanda | Saint Kitts and Nevis | Saint Lucia | ||
Saint Vincent and the Grenadines | Samoa | San Marino | Sao Tome and Principe | Saudi Arabia |
Senegal | Serbia | Seychelles | Sierra Leone | Singapore |
Slovakia | Slovenia | Solomon Islands | Somalia | South Africa |
Republic of Korea | South Sudan | Spain | Sri Lanka | Sudan |
Suriname | Sweden | Switzerland | Syrian Arab Republic | Tajikistan |
United Republic of Tanzania | Thailand | Togo | Tonga | Trinidad and Tobago |
Tunisia | Turkey | Turkmenistan | Turks and Caicos Islands | Tuvalu |
Uganda | Ukraine | United Arab Emirates | The United Kingdom | United States of America |
Uruguay | Uzbekistan | Vanuatu | Venezuela (Bolivarian Republic of) | Viet Nam |
Wallis and Futuna | Yemen | Zambia | Zimbabwe |
--- Original source retains full ownership of the source dataset ---
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015
Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.
Data Collection Methodology:
The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.
Secondary/Related Resources:
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Stephen Aidoo
Released under Database: Open Database, Contents: Database Contents
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Rare Pepes’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/rare-pepese on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Data behind the story Can The Blockchain Turn Pepe The Frog Into Modern Art?
There are four data files, described below. You can also find further information about individual Rare Pepe assets at Rare Pepe Wallet.
ordermatches_all.csv
contains all Rare Pepe order matches from the beginning of the project, in late 2016, until Feb. 3. All order matches include a pair of assets (a “forward asset” and a “backward asset”) one of which is a Rare Pepe and the other of which is either XCP, the native Counterparty token, or Pepe Cash. The time of the order match can be determined by the block.
Header Description Block
The block number ForwardAsset
The type of forward asset ForwardQuantity
The quantity of forward asset BackwardAsset
The type of backward asset BackwardQuantity
The quantity of backward asset
blocks_timestamps.csv
is a pairing of block and timestamp. This can be used to determine the actual time an order match occurred, which can then be used to determine the dollar value of Pepe Cash or XCP at the time of the trade.
Header Description Block
The block number Timestamp
A Unix timestamp
pepecash_prices.csv
contains the dollar price of Pepe Cash over time.
Header Description Timestamp
A Unix timestamp Price
The price of Pepe Cash in dollars
xcp_prices.csv
contains the dollar price of XCP over time.
Header Description Timestamp
A Unix timestamp Price
The price of XCP in dollars Source: Rare Pepe Foundation
The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.
Source: https://github.com/fivethirtyeight/data
This dataset was created by FiveThirtyEight and contains around 30000 samples along with Backward Quantity, Block, technical information and other features such as: - Forward Quantity - Backward Asset - and more.
- Analyze Forward Asset in relation to Backward Quantity
- Study the influence of Block on Forward Quantity
- More datasets
If you use this dataset in your research, please credit FiveThirtyEight
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Hotel Prices - Beginner Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sveneschlbeck/hotel-prices-beginner-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset addresses Data Science students and/or Beginners who want to dive into Regression or Clustering without the need to pre-clean the data first.
This dataset consists of a pre-cleaned .csv
table that has been translated from German to English.
There are four columns in this dataset:
Here, "Hotel Prices" does not refer to the cost of spending a night at those hotels but the price for buying them. This would be an interesting chart for someone who wants to buy a hotel and needs to judge whether he/she is overpaying or getting a great deal depending on similar objects in other comparable cities.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Winter Olympics Prediction - Fantasy Draft Picks’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ericsbrown/winter-olympics-prediction-fantasy-draft-picks on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Our family runs an Olympic Draft - similar to fantasy football or baseball - for each Olympic cycle. The purpose of this case study is to identify trends in medal count / point value to create a predictive analysis of which teams should be selected in which order.
There are a few assumptions that will impact the final analysis: Point Value - Each medal is worth the following: Gold - 6 points Silver - 4 points Bronze - 3 points For analysis reviewing the last 10 Olympic cycles. Winter Olympics only.
All GDP numbers are in USD
My initial hypothesis is that larger GDP per capita and size of contingency are correlated with better points values for the Olympic draft.
All Data pulled from the following Datasets:
Winter Olympics Medal Count - https://www.kaggle.com/ramontanoeiro/winter-olympic-medals-1924-2018 Worldwide GDP History - https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2020&start=1984&view=chart
GDP data was a wide format when downloaded from the World Bank. Opened file in Excel, removed irrelevant years, and saved as .csv.
In RStudio utilized the following code to convert wide data to long:
install.packages("tidyverse") library(tidyverse) library(tidyr)
long <- newgdpdata %>% gather(year, value, -c("Country Name","Country Code"))
Completed these same steps for GDP per capita.
Differing types of data between these two databases and there is not a good primary key to utilize. Used CONCAT to create a new key column in both combining the year and country code to create a unique identifier that matches between the datasets.
SELECT *, CONCAT(year,country_code) AS "Primary" FROM medal_count
Saved as new table "medals_w_primary"
Utilized Excel to concatenate the primary key for GDP and GDP per capita utilizing:
=CONCAT()
Saved as new csv files.
Uploaded all to SSMS.
Next need to add contingent size.
No existing database had this information. Pulled data from Wikipedia.
2018 - No problem, pulled existing table. 2014 - Table was not created. Pulled information into excel, needed to convert the country NAMES into the country CODES.
Created excel document with all ISO Country Codes. Items were broken down between both formats, either 2 or 3 letters. Example:
AF/AFG
Used =RIGHT(C1,3) to extract only the country codes.
For the country participants list in 2014, copied source data from Wikipedia and pasted as plain text (not HTML).
Items then showed as: Albania (2)
Broke cells using "(" as the delimiter to separate country names and numbers, then find and replace to remove all parenthesis from this data.
We were left with: Albania 2
Used VLOOKUP to create correct country code: =VLOOKUP(A1,'Country Codes'!A:D,4,FALSE)
This worked for almost all items with a few exceptions that didn't match. Based on nature and size of items, manually checked on which items were incorrect.
Chinese Taipei 3 #N/A Great Britain 56 #N/A Virgin Islands 1 #N/A
This was relatively easy to fix by adding corresponding line items to the Country Codes sheet to account for future variability in the country code names.
Copied over to main sheet.
Repeated this process for additional years.
Once complete created sheet with all 10 cycles of data. In total there are 731 items.
Filtered by Country Code since this was an issue early on.
Found a number of N/A Country Codes:
Serbia and Montenegro FR Yugoslavia FR Yugoslavia Czechoslovakia Unified Team Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia
Appears to be issues with older codes, Soviet Union block countries especially. Referred to historical data and filled in these country codes manually. Codes found on iso.org.
Filled all in, one issue that was more difficult is the Unified Team of 1992 and Soviet Union. For simplicity used code for Russia - GDP data does not recognize the Soviet Union, breaks the union down to constituent countries. Using Russia is a reasonable figure for approximations and analysis to attempt to find trends.
From here created a filter and scanned through the country names to ensure there were no obvious outliers. Found the following:
Olympic Athletes from Russia[b] -- This is a one-off due to the recent PED controversy for Russia. Amended the Country Code to RUS to more accurately reflect the trends.
Korea[a] and South Korea -- both were listed in 2018. This is due to the unified Korean team that competed. This is an outlier and does not warrant standing on its own as the 2022 Olympics will not have this team (as of this writing on 01/14/2022). Removed the COR country code item.
Confirmed Primary Key was created for all entries.
Ran minimum and maximum years, no unexpected values. Ran minimum and maximum Athlete numbers, no unexpected values. Confirmed length of columns for Country Code and Primary Key.
No NULL values in any columns. Ready to import to SSMS.
We now have 4 tables, joined together to create the master table:
SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes FROM medals_w_primary INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY year DESC
This left us with the following table:
https://i.imgur.com/tpNhiNs.png" alt="Imgur">
Performed some basic cleaning tasks to ensure no outliers:
Checked GDP numbers: 1992 North Korea shows as null. Updated this row with information from countryeconomy.com - $12,458,000,000
Checked GDP per capita:
1992 North Korea again missing. Updated this to $595, utilized same source.
UPDATE [OlympicDraft].[dbo].[gdp_w_primary] SET [OlympicDraft].[dbo].[gdp_w_primary].[value] = 12458000000 WHERE [OlympicDraft].[dbo].[gdp_w_primary].[year_country] = '1992PRK'
UPDATE [OlympicDraft].[dbo].[convertedgdpdatapercapita] SET [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita] = 595 WHERE [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year_country] = '1992PRK'
Liechtenstein showed as an outlier with GDP per capita at 180,366 in 2018. Confirmed this number is correct per the World Bank, appears Liechtenstein does not often have atheletes in the winter olympics. Performing a quick SQL search to verify this shows that they fielded 3 atheletes in 2018, with a Bronze medal being won. Initially this appears to be a good ratio for win/loss.
Finally, need to create a column that shows the total point value for each of these rows based on the above formula (6 points for Gold, 4 points for Silver, 3 points for Bronze).
Updated query as follows:
SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes, (Gold*6) + (Silver*4) + (Bronze*3) AS 'Total_Points' FROM [OlympicDraft].[dbo].[medals_w_primary] INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year]
Spot checked, calculating correctly.
Saved result as winter_olympics_study.csv.
We can now see that all relevant information is in this table:
https://i.imgur.com/ceZvqCA.png" alt="Imgur">
To continue our analysis, opened this CSV in RStudio.
install.packages("tidyverse") library(tidyverse) library(ggplot2) install.packages("forecast") library(forecast) install.packages("GGally") library(GGally) install.packages("modelr") library(modelr)
View(winter_olympic_study)
ggplot(data = winter_olympic_study) + geom_point(aes(x=gdp_per_capita,y=Total_Points,color=country_name)) + facet_wrap(~country_name)
cor(winter_olympic_study$gdp_per_capita, winter_olympic_study$Total_Points, method = c("pearson"))
Result is .347, showing a moderate correlation between these two figures.
Looked next at GDP vs. Total_Points ggplot(data = winter_olympic_study) + geom_point(aes(x=GDP,y=Total_Points,color=country_name))+ facet_wrap(~country_name)
cor(winter_olympic_study$GDP, winter_olympic_study$Total_Points, method = c("pearson")) This resulted in 0.35, statistically insignificant difference between this and GDP Per Capita
Next looked at contingent size vs. total points ggplot(data = winter_olympic_study) + geom_point(aes(x=Atheletes,y=Total_Points,color=country_name)) +
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.
General
This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.
Description of the Data Files
This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:
ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.
AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).
demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.
Collection Process
Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.
The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.
gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.
natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.
wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.
Fork this kernel to get started.
Data Source: https://cloud.google.com/bigquery/sample-tables
Banner Photo by Mervyn Chan from Unplash.
How many babies were born in New York City on Christmas Day?
How many words are in the play Hamlet?
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a multi-task classification dataset I made for fun in late 2017 using a cheap webcam, wood, glue, paint, yarn and scotch tape. It consists of 2035 images of a board representing a ficticious ocean area where 6 ship models operate. Every image is 640x480 pixels with three color channels (RGB). Each non-empty image sample contains exactly one scaled model of a ship with a particular location and heading. The tasks are: * 1. Determine whether or not the image is non-empty (i.e., contains a ship). * 2. If the image is non-empty: * A. Determine the ship's location. * B. Determine the ship's heading. * C. Determine the ship's model.
The data split is as follows:
* Directory /set-A_train
contains 1635 image samples for training
* Directory /set-B_test
contains 400 image samples for testing (validation)
Needless to say, you may choose any other data split you find useful for your purposes.
The board consists of 28 locations, with rows ranging from 1 through 7 and columns ranging from A through D. Each non-empty image sample contains exactly one ship, and the ship may be facing either West (towards the left of the board) or East (towards the right of the board). The following image sample shows an empty board with each location labeled.
https://github.com/luis-i-reyes-castro/find-the-ship/blob/main/README_Board.png?raw=true" alt="Board">
Each non-empty image sample contains exactly one of six possible ship models, facing either West (towards the left of the board) or East (towards the right of the board). The following table displays sample images of each ship model.
Sample Image | Ship Model | Shown Facing |
---|---|---|
Cruiser-1 | West | |
Cruiser-2 | East | |
Cruiser-3 | East | |
Fishing-1 | West | |
Fishing-2 | East | |
Freighter | West |
The image labels can be found in the image_labels.csv
files inside each dataset directory. These CSV files contain tables where each row corresponds to an image sample. The columns are structured are as follows.
| Column | Values | |-------------|---------------------------------...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .