100+ datasets found

ov-blend-1st-place-sub
kaggle.com
Updated Oct 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiayang Gao (2020). ov-blend-1st-place-sub [Dataset]. https://www.kaggle.com/nullrecurrent/ov-blend-1st-place-sub/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jiayang Gao
Description
Dataset

This dataset was created by Jiayang Gao

Contents
Cassava Leaf Disease (1st place models)
kaggle.com
Updated Feb 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannis (2021). Cassava Leaf Disease (1st place models) [Dataset]. https://www.kaggle.com/jannish/cassava-leaf-disease-1st-place-models/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jannis
Description
Dataset

This dataset was created by Jannis

Contents
House Prices: Advanced Regression 'solution' file
kaggle.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

Content

This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

Acknowledgements

Ames Housing Dataset (on kaggle) by @prevek18

First place is meaningless in this way! by @diegojohnson
Esports Performance Rankings and Results
kaggle.com
Updated Dec 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Esports Performance Rankings and Results [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-collegiate-esports-performance-with-bu/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Esports Performance Rankings and Results

Performance Rankings and Results from Multiple Esports Platforms

By [source]

About this dataset

This dataset provides a detailed look into the world of competitive video gaming in universities. It covers a wide range of topics, from performance rankings and results across multiple esports platforms to the individual team and university rankings within each tournament. With an incredible wealth of data, fans can discover statistics on their favorite teams or explore the challenges placed upon university gamers as they battle it out to be the best. Dive into the information provided and get an inside view into the world of collegiate esports tournaments as you assess all things from Match ID, Team 1, University affiliations, Points earned or lost in each match and special Seeds or UniSeeds for exceptional teams. Of course don't forget about exploring all the great Team Names along with their corresponding websites for further details on stats across tournaments!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download Files First, make sure you have downloaded the CS_week1, CS_week2, CS_week3 and seeds datasets on Kaggle. You will also need to download the currentRankings file for each week of competition. All files should be saved using their originally assigned name in order for your analysis tools to read them properly (ie: CS_week1.csv).

Understand File Structure Once all data has been collected and organized into separate files on your desktop/laptop computer/mobile device/etc., it's time to become familiar with what type of information is included in each file. The main folder contains three main data files: week1-3 and seedings. The week1-3 contain teams matched against one another according to university, point score from match results as well as team name and website URL associated with university entry; whereas the seedings include a ranking system amongst university entries which are accompanied by information regarding team names, website URLs etc.. Furthermore, there is additional file featured which contains currentRankings scores for each individual player/teams for an first given period of competition (ie: first week).

Analyzing Data Now that everything is set up on your end it’s time explore! You can dive deep into trends amongst universities or individual players in regards to specific match performances or standings overall throughout weeks of competition etc… Furthermore you may also jumpstart insights via further creation of graphs based off compiled date from sources taken from BUECTracker dataset! For example let us say we wanted compare two universities- let's say Harvard University v Cornell University - against one another since beginning of event i we shall extract respective points(column),dates(column)(found under result tab) ,regions(csilluminating North America vs Europe etc)general stats such as maps played etc.. As well any other custom ideas which would come along in regards when dealing with similar datasets!

Research Ideas

Analyze the performance of teams and identify areas for improvement for better performance in future competitions.

Assess which esports platforms are the most popular among gamers.

Gain a better understanding of player rankings across different regions, based on rankings system, to create targeted strategies that could boost individual players' scoring potential or team overall success in competitive gaming events

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: CS_week1.csv | Column name | Description | |:---------------|:----------------------------------------------| | Match ID | Unique identifier for each match. (Integer) | | Team 1 | Name of the first team in the match. (String) | | University | University associated with the team. (String) |

File: CS_week1_currentRankings.csv | Column name | Description | |:--------------|:-----------------------------------------------------------|...
University dataset
kaggle.com
Updated Jan 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ritwik (2022). University dataset [Dataset]. https://www.kaggle.com/datasets/ritwiksingh99/university-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ritwik
Description
Context

Hi, guys i'm new at Kaggle and this is my 1st dataset. so plz support by giving me feedback regarding my work.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
World Happiness Report (Preprocessed)
kaggle.com
Updated Jul 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yamac Eren Ay (2020). World Happiness Report (Preprocessed) [Dataset]. https://www.kaggle.com/yamaerenay/world-happiness-report-preprocessed/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yamac Eren Ay
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Area covered
World
Description
Context

Analyzing World Happiness Report over years could be a stressful and boring task because it takes a lot of time to clean the data. This dataset, which is basically a 3D-Time-Series data ready-to-use for ML tasks, is a simplified version of already existing World Happiness Report datasets.

Content

This dataset contains the happiness score of each country, and some key factors that contribute directly to the overall happiness of the country, over 6 years (from 2015 to 2020).

For the ones who are new to the topic, I drop the link to the World Happiness Report

Here are some key differences from the original data:

Some features absent in any year's report or mostly unnecessary are excluded from the data. Features' names are now consistent.

For the sake of simplification, only the countries which are present in all annual reports are included in the data.

Instead of individual regions like Middle East and West Europe, continents are -in my opinion- a better choice for performing groupby-aggregate operations, so the already existing region column is replaced by the new continent column.

Acknowledgements

I am grateful to the Sustainable Development Solutions Network for creating the World Happiness Report and its Kaggle dataset, which I used for preprocessing in the first place.
AI Tool Usage by Indian College Students 2025
kaggle.com
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kapilavayi (2025). AI Tool Usage by Indian College Students 2025 [Dataset]. https://www.kaggle.com/datasets/rakeshkapilavai/ai-tool-usage-by-indian-college-students-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 9, 2025
Dataset provided by
Kaggle
Authors
Rakesh Kapilavayi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AI Tool Usage by Indian College Students 2025

This unique dataset, collected via a May 2025 survey, captures how 496 Indian college students use AI tools (e.g., ChatGPT, Gemini, Copilot) in academics. It includes 16 attributes like AI tool usage, trust, impact on grades, and internet access, ideal for education analytics and machine learning.

Columns

Student_Name: Anonymized student name.

College_Name: College attended.

Stream: Academic discipline (e.g., Engineering, Arts).

Year_of_Study: Year of study (1–4).

AI_Tools_Used: Tools used (e.g., ChatGPT, Gemini).

Daily_Usage_Hours: Hours spent daily on AI tools.

Use_Cases: Purposes (e.g., Assignments, Exam Prep).

Trust_in_AI_Tools: Trust level (1–5).

Impact_on_Grades: Grade impact (-3 to +3).

Do_Professors_Allow_Use: Professor approval (Yes/No).

Preferred_AI_Tool: Preferred tool.

Awareness_Level: AI awareness (1–10).

Willing_to_Pay_for_Access: Willingness to pay (Yes/No).

State: Indian state.

Device_Used: Device (e.g., Laptop, Mobile).

Internet_Access: Access quality (Poor/Medium/High).

Use Cases

Predict academic performance using AI tool usage.

Analyze trust in AI across streams or regions.

Cluster students by usage patterns.

Study digital divide via Internet_Access.

Source: Collected via Google Forms survey in May 2025, ensuring diverse representation across India. Note: First dataset of its kind on Kaggle!
The 500MB Tv-Show Dataset
kaggle.com
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iyad Elwy (2023). The 500MB Tv-Show Dataset [Dataset]. https://www.kaggle.com/datasets/iyadelwy/the-500mb-tv-show-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Iyad Elwy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Televisions

This dataset was extracted, transformed and loaded using various sources. The entire ETL Process looks as follows:

https://github.com/IyadElwy/Televisions/assets/83036619/7088d477-2559-4af2-94e9-924274521d36" alt="data_pipeline">

Links

Github ETL Process Code

Kaggle Dataset

Explanation

First I needed to find some appropriate data-sources. For this I used Wikiquote to extract important and unique script-text from the various shows. Wikipedia was used for the extraction of generic info like summaries, etc. Metacritic was used for the extraction of user reviews (scores + opinions) and the Opensource TvMaze API was used for getting all sorts of data, ranging from titles to cast, episodes, summaries and more.

Now the first step was to gather titles from those sources. It was important to divide the scrapers into two categories, scrapers that get the titles and scrapers that do the heavy-scraping which gets you the actual data.

For the ETL job orchastration Apache Airflow was used which was hosted on an Azure VM instance with 4 Virtual CPUs and about 14 GBs of RAM. This was needed because of the heavy Sprak data transformations

RDS, running PostgreSQL was used to save the titles and their corresponding urls

S3 buckets were mostly used as Data Lakes to hold either raw data or temp data that needed to be further processed

AWS Glue was used to do Transformations on the dataset and then output to Redshift which was our Data Warehouse in this case

CosmosDB NoSQL Schema

It's important to note the additionalProperties field which makes the addition of more data to the field possible. I.e. the following fields will have alot more nested data. json { "type": "object", "additionalProperties": true, "properties": { "id": { "type": "integer" }, "title": { "type": "string" }, "normalized_title": { "type": "string" }, "wikipedia_url": { "type": "string" }, "wikiquotes_url": { "type": "string" }, "eztv_url": { "type": "string" }, "metacritic_url": { "type": "string" }, "wikipedia": { "type": "object", "additionalProperties": true }, "wikiquotes": { "type": "object", "additionalProperties": true }, "metacritic": { "type": "object", "additionalProperties": true }, "tvMaze": { "type": "object", "additionalProperties": true } } }
Health Care Analytics
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description
Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people. For a few camps, there was hardware failure, so some information about date and time of registration is lost. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall. You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
P
DSEval-Kaggle Dataset
paperswithcode.com
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
Explore at:
Dataset updated
Apr 19, 2024
Authors
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
Description
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

This is one of DSEval benchmarks.
Student Performance Data Set
kaggle.com
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
test titanic
kaggle.com
Updated Mar 13, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kittisak (2017). test titanic [Dataset]. https://www.kaggle.com/kittisaks/testtitanic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kittisak
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
aggregate-data-italian-cities-from-wikipedia
kaggle.com
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
alepuzio (2020). aggregate-data-italian-cities-from-wikipedia [Dataset]. https://www.kaggle.com/alepuzio/aggregatedataitaliancitiesfromwikipedia/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 20, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
alepuzio
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

This dataset is the result of my study on web-scraping of English Wikipedia in R and my tests on regression and classification modelization in R.

Content

The content is create by reading the appropriate articles in English Wikipedia about Italian cities: I did'nt run NPL analisys but only the table with the data and I ranked every city from 0 to N in every aspect. About the values, 0 means "*the city is not ranked in this aspect*" and N means "*the city is at first place, in descending order of importance, in this aspect* ". If there's no ranking in a particular aspect (for example, the only existence of the airports/harbours with no additional data about the traffic or the size), then 0 means "*no existence*" and N means "*there are N airports/harbours*". The only not-numeric column is the column with the name of the cities in English form, except some exceptions (for example, "*Bra (CN)* " because of simplicity.

Acknowledgements

I acknowledge the Wikimedia Foundation for his work, his mission and to make available the cover image of this dataset, (please read the article "The Ideal city (painting)") . I acknowledge too StackOverflow and Cross-Validated to be the most important focus of technical knowledge in the world, all the people in Kaggle for the suggestions.

Inspiration

As a beginner in data analisys and modelization (Ok, I passed the exam of statistics in Politecnico di Milano (Italy), but there are more than 10 years that I don't work in this topic and my memory is getting old ^_^) I worked more on data clean, dataset building and building the simplest modelization.

You can use this datase to realize which city is good to live or to expand this to add some other data from Wikipedia (not only reading the tables but too to read the text adn extrapolate the data from the meaningless text.)
defacto-splicing
kaggle.com
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEFACTODataset (2022). defacto-splicing [Dataset]. https://www.kaggle.com/datasets/defactodataset/defactosplicing/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DEFACTODataset
Description
Context

Digital image forensic has gained a lot of attention as it is becoming easier for anyone to make forged images. Several areas are concerned by image manipulation: a doctored image can increase the credibility of fake news, impostors can use morphed images to pretend being someone else.

It became of critical importance to be able to recognize the manipulations suffered by the images. To do this, the first need is to be able to rely on reliable and controlled data sets representing the most characteristic cases encountered. The purpose of this work is to lay the foundations of a body of tests allowing both the qualification of automatic methods of authentication and detection of manipulations and the training of these methods.

Content

This dataset contains about 105000 splicing forgeries are available under the splicing directory. Each splicing is accompanied by two binary masks. One under the probe_mask subdirectory indicates the location of the forgery and one under the donor_mask indicates the location of the source. The external image can be found in the JSON file under the graph subdirectory.

Reference

If you use this dataset for your research, please refer to the original paper : @INPROCEEDINGS{DEFACTODataset, AUTHOR=”Gaël Mahfoudi and Badr Tajini and Florent Retraint and Fr{'e}d{'e}ric Morain-Nicolier and Jean Luc Dugelay and Marc Pic”, TITLE=”{DEFACTO:} Image and Face Manipulation Dataset”, BOOKTITLE=”27th European Signal Processing Conference (EUSIPCO 2019)”, ADDRESS=”A Coruña, Spain”, DAYS=1, MONTH=sep, YEAR=2019 }

and to the MSCOCO dataset

License

The DEFACTO Consortium does not own the copyright of those images. Please refer to the MSCOCO terms of use for all images based on their Dataset.
Chain-of-Thought collection
kaggle.com
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://identifiers.org/arxiv:2305.140
Dataset updated
Jun 19, 2023
Dataset provided by
Kaggle
Authors
Konrad Banachewicz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.

English words with stress position analyzed

kaggle.com

Updated Jul 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Victor_42 (2024). English words with stress position analyzed [Dataset]. https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 5, 2024

Dataset provided by

Kaggle

Authors

Victor_42

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Context

This dataset is a side product of a notebook to find out the rules of stress position in English.

It is a work based on another dataset with 300k+ English words.

I looked up dictionary for phonetic transcriptions with this free dictionary API and got about 30k transcriptions. Then I managed to extract syllable counts, stress positions and stressed syllables from them to make this new dataset.

File description

words_stress_analyzed.csv is the final dataset. Other files are just intermediate steps in the process.

Column description

Column	Datatype	Example	Description
word	str	complimentary	the English words
phonetic	str	/ˌkɒmplɪ̈ˈment(ə)ɹɪ/	the phonetic transcription of the words
part_of_speech	str(list like)	['adjective']	how are these words used in sentences
syllable_len	int	5	how many syllables are there in these words
stress_pos	int	3	on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
stress_syllable	str	e	the vowel of the stressed syllable

Note: Absence of stress symbol in some short words led to blanks in this dataset. It is recommended to filter out rows with empty stress_syllable and rows that syllable_len is 1.

Images

created with Midjourney

IL_UR3_movement_public
kaggle.com
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomáš Kotrba (2024). IL_UR3_movement_public [Dataset]. https://www.kaggle.com/datasets/tomkotrba/il-ur3-movement-public
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tomáš Kotrba
Description
UR3 Imitation Learning Dataset for Trajectory Movements

Overview

This dataset contains trajectory data for the UR3 robot moving from point A to point B using imitation learning. The data was collected using a Spacemouse to manually control the robotic arm. This dataset is intended to facilitate research and development in robotic motion planning and control, specifically focusing on imitation learning algorithms.

Dataset Contents

The dataset is organized into several CSV files, each representing different trajectories and positions (joint positions and tool positions). The files are structured as follows:

Data Structure

Each CSV file contains the following columns:

Tool Position Files x: The x-coordinate of the end-effector relative to the base coordinate system. y: The y-coordinate of the end-effector relative to the base coordinate system. z: The z-coordinate of the end-effector relative to the base coordinate system. rx: The rotation around the x-axis relative to the base coordinate system. ry: The rotation around the y-axis relative to the base coordinate system. rz: The rotation around the z-axis relative to the base coordinate system.

Joint Position Files base: The position of the base joint relative to its neutral position. shoulder: The position of the shoulder joint relative to its neutral position. elbow: The position of the elbow joint relative to its neutral position. wrist1: The position of the first wrist joint relative to its neutral position. wrist2: The position of the second wrist joint relative to its neutral position. wrist3: The position of the third wrist joint relative to its neutral position.

Usage This dataset is intended for use in: Training and testing imitation learning algorithms.
Job Offers Web Scraping Search
kaggle.com
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

By [source]

About this dataset

This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

Research Ideas

Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.

The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.

It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Malnutrition: Underweight Women, Children & Others
kaggle.com
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarthak Bose (2023). Malnutrition: Underweight Women, Children & Others [Dataset]. https://www.kaggle.com/datasets/sarthakbose/malnutrition-underweight-women-children-and-others
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Dataset provided by
Kaggle
Authors
Sarthak Bose
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
🔗 Check out my notebook here: Link

This dataset includes malnutrition indicators and some of the features that might impact malnutrition. The detailed description of the dataset is given below:

Percentage-of-underweight-children-data: Percentage of children aged 5 years or below who are underweight by country.

Prevalence of Underweight among Female Adults (Age Standardized Estimate): Percentage of female adults whos BMI is less than 18.

GDP per capita (constant 2015 US$): GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.

Domestic general government health expenditure (% of GDP): Public expenditure on health from domestic sources as a share of the economy as measured by GDP.

Maternal mortality ratio (modeled estimate, per 100,000 live births): Maternal mortality ratio is the number of women who die from pregnancy-related causes while pregnant or within 42 days of pregnancy termination per 100,000 live births. The data are estimated with a regression model using information on the proportion of maternal deaths among non-AIDS deaths in women ages 15-49, fertility, birth attendants, and GDP measured using purchasing power parities (PPPs).

Mean-age-at-first-birth-of-women-aged-20-50-data: Average age at which women of age 20-50 years have their first child.

School enrollment, secondary, female (% gross): Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jiayang Gao (2020). ov-blend-1st-place-sub [Dataset]. https://www.kaggle.com/nullrecurrent/ov-blend-1st-place-sub/metadata

ov-blend-1st-place-sub

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 26, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Jiayang Gao

Description

Dataset

This dataset was created by Jiayang Gao

Clear search

Close search

Google apps

Main menu

ov-blend-1st-place-sub

Dataset

Contents

Cassava Leaf Disease (1st place models)

Dataset

Contents

House Prices: Advanced Regression 'solution' file

Context

Content

Acknowledgements

Esports Performance Rankings and Results

Esports Performance Rankings and Results

Performance Rankings and Results from Multiple Esports Platforms

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

University dataset

Context

Content

Acknowledgements

Inspiration

World Happiness Report (Preprocessed)

Context

Content

Acknowledgements

AI Tool Usage by Indian College Students 2025

AI Tool Usage by Indian College Students 2025

Columns

Use Cases

The 500MB Tv-Show Dataset

Televisions

This dataset was extracted, transformed and loaded using various sources. The entire ETL Process looks as follows:

Links

Github ETL Process Code

Kaggle Dataset

Explanation

CosmosDB NoSQL Schema

Health Care Analytics

Context

Content

Acknowledgements

Inspiration

Hospital Management Dataset

DSEval-Kaggle Dataset

Student Performance Data Set

test titanic

Context

Content

Acknowledgements

Inspiration

aggregate-data-italian-cities-from-wikipedia

Context

Content

Acknowledgements

Inspiration

defacto-splicing

Context

Content

Reference

License

Chain-of-Thought collection

English words with stress position analyzed

Context

File description

Column description

Images

IL_UR3_movement_public

UR3 Imitation Learning Dataset for Trajectory Movements

Overview

Dataset Contents

Data Structure

Job Offers Web Scraping Search

Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

About this dataset