100+ datasets found
  1. ov-blend-1st-place-sub

    • kaggle.com
    Updated Oct 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiayang Gao (2020). ov-blend-1st-place-sub [Dataset]. https://www.kaggle.com/nullrecurrent/ov-blend-1st-place-sub/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jiayang Gao
    Description

    Dataset

    This dataset was created by Jiayang Gao

    Contents

  2. Cassava Leaf Disease (1st place models)

    • kaggle.com
    Updated Feb 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jannis (2021). Cassava Leaf Disease (1st place models) [Dataset]. https://www.kaggle.com/jannish/cassava-leaf-disease-1st-place-models/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jannis
    Description

    Dataset

    This dataset was created by Jannis

    Contents

  3. House Prices: Advanced Regression 'solution' file

    • kaggle.com
    Updated Sep 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

    Content

    This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

    Acknowledgements

  4. Esports Performance Rankings and Results

    • kaggle.com
    Updated Dec 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Esports Performance Rankings and Results [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-collegiate-esports-performance-with-bu/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Esports Performance Rankings and Results

    Performance Rankings and Results from Multiple Esports Platforms

    By [source]

    About this dataset

    This dataset provides a detailed look into the world of competitive video gaming in universities. It covers a wide range of topics, from performance rankings and results across multiple esports platforms to the individual team and university rankings within each tournament. With an incredible wealth of data, fans can discover statistics on their favorite teams or explore the challenges placed upon university gamers as they battle it out to be the best. Dive into the information provided and get an inside view into the world of collegiate esports tournaments as you assess all things from Match ID, Team 1, University affiliations, Points earned or lost in each match and special Seeds or UniSeeds for exceptional teams. Of course don't forget about exploring all the great Team Names along with their corresponding websites for further details on stats across tournaments!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Download Files First, make sure you have downloaded the CS_week1, CS_week2, CS_week3 and seeds datasets on Kaggle. You will also need to download the currentRankings file for each week of competition. All files should be saved using their originally assigned name in order for your analysis tools to read them properly (ie: CS_week1.csv).

    Understand File Structure Once all data has been collected and organized into separate files on your desktop/laptop computer/mobile device/etc., it's time to become familiar with what type of information is included in each file. The main folder contains three main data files: week1-3 and seedings. The week1-3 contain teams matched against one another according to university, point score from match results as well as team name and website URL associated with university entry; whereas the seedings include a ranking system amongst university entries which are accompanied by information regarding team names, website URLs etc.. Furthermore, there is additional file featured which contains currentRankings scores for each individual player/teams for an first given period of competition (ie: first week).

    Analyzing Data Now that everything is set up on your end it’s time explore! You can dive deep into trends amongst universities or individual players in regards to specific match performances or standings overall throughout weeks of competition etc… Furthermore you may also jumpstart insights via further creation of graphs based off compiled date from sources taken from BUECTracker dataset! For example let us say we wanted compare two universities- let's say Harvard University v Cornell University - against one another since beginning of event i we shall extract respective points(column),dates(column)(found under result tab) ,regions(csilluminating North America vs Europe etc)general stats such as maps played etc.. As well any other custom ideas which would come along in regards when dealing with similar datasets!

    Research Ideas

    • Analyze the performance of teams and identify areas for improvement for better performance in future competitions.
    • Assess which esports platforms are the most popular among gamers.
    • Gain a better understanding of player rankings across different regions, based on rankings system, to create targeted strategies that could boost individual players' scoring potential or team overall success in competitive gaming events

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: CS_week1.csv | Column name | Description | |:---------------|:----------------------------------------------| | Match ID | Unique identifier for each match. (Integer) | | Team 1 | Name of the first team in the match. (String) | | University | University associated with the team. (String) |

    File: CS_week1_currentRankings.csv | Column name | Description | |:--------------|:-----------------------------------------------------------|...

  5. University dataset

    • kaggle.com
    Updated Jan 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritwik (2022). University dataset [Dataset]. https://www.kaggle.com/datasets/ritwiksingh99/university-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ritwik
    Description

    Context

    Hi, guys i'm new at Kaggle and this is my 1st dataset. so plz support by giving me feedback regarding my work.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  6. World Happiness Report (Preprocessed)

    • kaggle.com
    Updated Jul 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamac Eren Ay (2020). World Happiness Report (Preprocessed) [Dataset]. https://www.kaggle.com/yamaerenay/world-happiness-report-preprocessed/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yamac Eren Ay
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Area covered
    World
    Description

    Context

    Analyzing World Happiness Report over years could be a stressful and boring task because it takes a lot of time to clean the data. This dataset, which is basically a 3D-Time-Series data ready-to-use for ML tasks, is a simplified version of already existing World Happiness Report datasets.

    Content

    This dataset contains the happiness score of each country, and some key factors that contribute directly to the overall happiness of the country, over 6 years (from 2015 to 2020).

    For the ones who are new to the topic, I drop the link to the World Happiness Report

    Here are some key differences from the original data:

    1. Some features absent in any year's report or mostly unnecessary are excluded from the data. Features' names are now consistent.

    2. For the sake of simplification, only the countries which are present in all annual reports are included in the data.

    3. Instead of individual regions like Middle East and West Europe, continents are -in my opinion- a better choice for performing groupby-aggregate operations, so the already existing region column is replaced by the new continent column.

    Acknowledgements

    I am grateful to the Sustainable Development Solutions Network for creating the World Happiness Report and its Kaggle dataset, which I used for preprocessing in the first place.

  7. AI Tool Usage by Indian College Students 2025

    • kaggle.com
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Kapilavayi (2025). AI Tool Usage by Indian College Students 2025 [Dataset]. https://www.kaggle.com/datasets/rakeshkapilavai/ai-tool-usage-by-indian-college-students-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Kaggle
    Authors
    Rakesh Kapilavayi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AI Tool Usage by Indian College Students 2025

    This unique dataset, collected via a May 2025 survey, captures how 496 Indian college students use AI tools (e.g., ChatGPT, Gemini, Copilot) in academics. It includes 16 attributes like AI tool usage, trust, impact on grades, and internet access, ideal for education analytics and machine learning.

    Columns

    • Student_Name: Anonymized student name.
    • College_Name: College attended.
    • Stream: Academic discipline (e.g., Engineering, Arts).
    • Year_of_Study: Year of study (1–4).
    • AI_Tools_Used: Tools used (e.g., ChatGPT, Gemini).
    • Daily_Usage_Hours: Hours spent daily on AI tools.
    • Use_Cases: Purposes (e.g., Assignments, Exam Prep).
    • Trust_in_AI_Tools: Trust level (1–5).
    • Impact_on_Grades: Grade impact (-3 to +3).
    • Do_Professors_Allow_Use: Professor approval (Yes/No).
    • Preferred_AI_Tool: Preferred tool.
    • Awareness_Level: AI awareness (1–10).
    • Willing_to_Pay_for_Access: Willingness to pay (Yes/No).
    • State: Indian state.
    • Device_Used: Device (e.g., Laptop, Mobile).
    • Internet_Access: Access quality (Poor/Medium/High).

    Use Cases

    • Predict academic performance using AI tool usage.
    • Analyze trust in AI across streams or regions.
    • Cluster students by usage patterns.
    • Study digital divide via Internet_Access.

    Source: Collected via Google Forms survey in May 2025, ensuring diverse representation across India. Note: First dataset of its kind on Kaggle!

  8. The 500MB Tv-Show Dataset

    • kaggle.com
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Elwy (2023). The 500MB Tv-Show Dataset [Dataset]. https://www.kaggle.com/datasets/iyadelwy/the-500mb-tv-show-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Iyad Elwy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Televisions

    This dataset was extracted, transformed and loaded using various sources. The entire ETL Process looks as follows:

    https://github.com/IyadElwy/Televisions/assets/83036619/7088d477-2559-4af2-94e9-924274521d36" alt="data_pipeline">

    Links

    Github ETL Process Code

    Kaggle Dataset

    Explanation

    • First I needed to find some appropriate data-sources. For this I used Wikiquote to extract important and unique script-text from the various shows. Wikipedia was used for the extraction of generic info like summaries, etc. Metacritic was used for the extraction of user reviews (scores + opinions) and the Opensource TvMaze API was used for getting all sorts of data, ranging from titles to cast, episodes, summaries and more.
    • Now the first step was to gather titles from those sources. It was important to divide the scrapers into two categories, scrapers that get the titles and scrapers that do the heavy-scraping which gets you the actual data.
    • For the ETL job orchastration Apache Airflow was used which was hosted on an Azure VM instance with 4 Virtual CPUs and about 14 GBs of RAM. This was needed because of the heavy Sprak data transformations
    • RDS, running PostgreSQL was used to save the titles and their corresponding urls
    • S3 buckets were mostly used as Data Lakes to hold either raw data or temp data that needed to be further processed
    • AWS Glue was used to do Transformations on the dataset and then output to Redshift which was our Data Warehouse in this case

    CosmosDB NoSQL Schema

    It's important to note the additionalProperties field which makes the addition of more data to the field possible. I.e. the following fields will have alot more nested data. json { "type": "object", "additionalProperties": true, "properties": { "id": { "type": "integer" }, "title": { "type": "string" }, "normalized_title": { "type": "string" }, "wikipedia_url": { "type": "string" }, "wikiquotes_url": { "type": "string" }, "eztv_url": { "type": "string" }, "metacritic_url": { "type": "string" }, "wikipedia": { "type": "object", "additionalProperties": true }, "wikiquotes": { "type": "object", "additionalProperties": true }, "metacritic": { "type": "object", "additionalProperties": true }, "tvMaze": { "type": "object", "additionalProperties": true } } }

  9. Health Care Analytics

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abishek Sudarshan
    Description

    Context

    Part of Janatahack Hackathon in Analytics Vidhya

    Content

    The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

    MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

    MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

    One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

    The Process:

    MedCamp employees / volunteers reach out to people and drive registrations.
    During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
    

    Other things to note:

    Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    For a few camps, there was hardware failure, so some information about date and time of registration is lost.
    MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides  
    information about several health issues through various awareness stalls.
    

    Favorable outcome:

    For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    You need to predict the chances (probability) of having a favourable outcome.
    

    Train / Test split:

    Camps started on or before 31st March 2006 are considered in Train
    Test data is for all camps conducted on or after 1st April 2006.
    

    Acknowledgements

    Credits to AV

    Inspiration

    To share with the data science community to jump start their journey in Healthcare Analytics

  10. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  11. P

    DSEval-Kaggle Dataset

    • paperswithcode.com
    Updated Apr 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
    Explore at:
    Dataset updated
    Apr 19, 2024
    Authors
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
    Description

    In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

    This is one of DSEval benchmarks.

  12. Student Performance Data Set

    • kaggle.com
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  13. test titanic

    • kaggle.com
    Updated Mar 13, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kittisak (2017). test titanic [Dataset]. https://www.kaggle.com/kittisaks/testtitanic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kittisak
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  14. aggregate-data-italian-cities-from-wikipedia

    • kaggle.com
    Updated May 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alepuzio (2020). aggregate-data-italian-cities-from-wikipedia [Dataset]. https://www.kaggle.com/alepuzio/aggregatedataitaliancitiesfromwikipedia/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    alepuzio
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset is the result of my study on web-scraping of English Wikipedia in R and my tests on regression and classification modelization in R.

    Content

    The content is create by reading the appropriate articles in English Wikipedia about Italian cities: I did'nt run NPL analisys but only the table with the data and I ranked every city from 0 to N in every aspect. About the values, 0 means "*the city is not ranked in this aspect*" and N means "*the city is at first place, in descending order of importance, in this aspect* ". If there's no ranking in a particular aspect (for example, the only existence of the airports/harbours with no additional data about the traffic or the size), then 0 means "*no existence*" and N means "*there are N airports/harbours*". The only not-numeric column is the column with the name of the cities in English form, except some exceptions (for example, "*Bra (CN)* " because of simplicity.

    Acknowledgements

    I acknowledge the Wikimedia Foundation for his work, his mission and to make available the cover image of this dataset, (please read the article "The Ideal city (painting)") . I acknowledge too StackOverflow and Cross-Validated to be the most important focus of technical knowledge in the world, all the people in Kaggle for the suggestions.

    Inspiration

    As a beginner in data analisys and modelization (Ok, I passed the exam of statistics in Politecnico di Milano (Italy), but there are more than 10 years that I don't work in this topic and my memory is getting old ^_^) I worked more on data clean, dataset building and building the simplest modelization.

    You can use this datase to realize which city is good to live or to expand this to add some other data from Wikipedia (not only reading the tables but too to read the text adn extrapolate the data from the meaningless text.)

  15. defacto-splicing

    • kaggle.com
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DEFACTODataset (2022). defacto-splicing [Dataset]. https://www.kaggle.com/datasets/defactodataset/defactosplicing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DEFACTODataset
    Description

    Context

    Digital image forensic has gained a lot of attention as it is becoming easier for anyone to make forged images. Several areas are concerned by image manipulation: a doctored image can increase the credibility of fake news, impostors can use morphed images to pretend being someone else.

    It became of critical importance to be able to recognize the manipulations suffered by the images. To do this, the first need is to be able to rely on reliable and controlled data sets representing the most characteristic cases encountered. The purpose of this work is to lay the foundations of a body of tests allowing both the qualification of automatic methods of authentication and detection of manipulations and the training of these methods.

    Content

    This dataset contains about 105000 splicing forgeries are available under the splicing directory. Each splicing is accompanied by two binary masks. One under the probe_mask subdirectory indicates the location of the forgery and one under the donor_mask indicates the location of the source. The external image can be found in the JSON file under the graph subdirectory.

    Reference

    If you use this dataset for your research, please refer to the original paper : @INPROCEEDINGS{DEFACTODataset, AUTHOR=”Gaël Mahfoudi and Badr Tajini and Florent Retraint and Fr{'e}d{'e}ric Morain-Nicolier and Jean Luc Dugelay and Marc Pic”, TITLE=”{DEFACTO:} Image and Face Manipulation Dataset”, BOOKTITLE=”27th European Signal Processing Conference (EUSIPCO 2019)”, ADDRESS=”A Coruña, Spain”, DAYS=1, MONTH=sep, YEAR=2019 }

    and to the MSCOCO dataset

    License

    The DEFACTO Consortium does not own the copyright of those images. Please refer to the MSCOCO terms of use for all images based on their Dataset.

  16. Chain-of-Thought collection

    • kaggle.com
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    Kaggle
    Authors
    Konrad Banachewicz
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

    From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.

  17. English words with stress position analyzed

    • kaggle.com
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor_42 (2024). English words with stress position analyzed [Dataset]. https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Dataset provided by
    Kaggle
    Authors
    Victor_42
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context

    This dataset is a side product of a notebook to find out the rules of stress position in English.

    It is a work based on another dataset with 300k+ English words.

    I looked up dictionary for phonetic transcriptions with this free dictionary API and got about 30k transcriptions. Then I managed to extract syllable counts, stress positions and stressed syllables from them to make this new dataset.

    File description

    words_stress_analyzed.csv is the final dataset. Other files are just intermediate steps in the process.

    Column description

    ColumnDatatypeExampleDescription
    wordstrcomplimentarythe English words
    phoneticstr/ˌkɒmplɪ̈ˈment(ə)ɹɪ/the phonetic transcription of the words
    part_of_speechstr(list like)['adjective']how are these words used in sentences
    syllable_lenint5how many syllables are there in these words
    stress_posint3on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
    stress_syllablestrethe vowel of the stressed syllable

    Note: Absence of stress symbol in some short words led to blanks in this dataset. It is recommended to filter out rows with empty stress_syllable and rows that syllable_len is 1.

    Images

    • created with Midjourney
  18. IL_UR3_movement_public

    • kaggle.com
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomáš Kotrba (2024). IL_UR3_movement_public [Dataset]. https://www.kaggle.com/datasets/tomkotrba/il-ur3-movement-public
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tomáš Kotrba
    Description

    UR3 Imitation Learning Dataset for Trajectory Movements

    Overview

    This dataset contains trajectory data for the UR3 robot moving from point A to point B using imitation learning. The data was collected using a Spacemouse to manually control the robotic arm. This dataset is intended to facilitate research and development in robotic motion planning and control, specifically focusing on imitation learning algorithms.

    Dataset Contents

    The dataset is organized into several CSV files, each representing different trajectories and positions (joint positions and tool positions). The files are structured as follows:

    Data Structure

    Each CSV file contains the following columns:

    Tool Position Files x: The x-coordinate of the end-effector relative to the base coordinate system. y: The y-coordinate of the end-effector relative to the base coordinate system. z: The z-coordinate of the end-effector relative to the base coordinate system. rx: The rotation around the x-axis relative to the base coordinate system. ry: The rotation around the y-axis relative to the base coordinate system. rz: The rotation around the z-axis relative to the base coordinate system.

    Joint Position Files base: The position of the base joint relative to its neutral position. shoulder: The position of the shoulder joint relative to its neutral position. elbow: The position of the elbow joint relative to its neutral position. wrist1: The position of the first wrist joint relative to its neutral position. wrist2: The position of the second wrist joint relative to its neutral position. wrist3: The position of the third wrist joint relative to its neutral position.

    Usage This dataset is intended for use in: Training and testing imitation learning algorithms.

  19. Job Offers Web Scraping Search

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Offers Web Scraping Search

    Targeted Results to Find the Optimal Work Solution

    By [source]

    About this dataset

    This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

    • Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

    • Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

    • Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

    • Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

      All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

    Research Ideas

    • Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.
    • The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.
    • It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  20. Malnutrition: Underweight Women, Children & Others

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarthak Bose (2023). Malnutrition: Underweight Women, Children & Others [Dataset]. https://www.kaggle.com/datasets/sarthakbose/malnutrition-underweight-women-children-and-others
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kaggle
    Authors
    Sarthak Bose
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    🔗 Check out my notebook here: Link

    This dataset includes malnutrition indicators and some of the features that might impact malnutrition. The detailed description of the dataset is given below:

    • Percentage-of-underweight-children-data: Percentage of children aged 5 years or below who are underweight by country.

    • Prevalence of Underweight among Female Adults (Age Standardized Estimate): Percentage of female adults whos BMI is less than 18.

    • GDP per capita (constant 2015 US$): GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.

    • Domestic general government health expenditure (% of GDP): Public expenditure on health from domestic sources as a share of the economy as measured by GDP.

    • Maternal mortality ratio (modeled estimate, per 100,000 live births): Maternal mortality ratio is the number of women who die from pregnancy-related causes while pregnant or within 42 days of pregnancy termination per 100,000 live births. The data are estimated with a regression model using information on the proportion of maternal deaths among non-AIDS deaths in women ages 15-49, fertility, birth attendants, and GDP measured using purchasing power parities (PPPs).

    • Mean-age-at-first-birth-of-women-aged-20-50-data: Average age at which women of age 20-50 years have their first child.

    • School enrollment, secondary, female (% gross): Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jiayang Gao (2020). ov-blend-1st-place-sub [Dataset]. https://www.kaggle.com/nullrecurrent/ov-blend-1st-place-sub/metadata
Organization logo

ov-blend-1st-place-sub

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jiayang Gao
Description

Dataset

This dataset was created by Jiayang Gao

Contents

Search
Clear search
Close search
Google apps
Main menu