19 datasets found
  1. Open Images

    • kaggle.com
    • opendatalab.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Labeled datasets are useful in machine learning research.

    Content

    This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

    Tables: 1) annotations_bbox 2) dict 3) images 4) labels

    Update Frequency: Quarterly

    Querying BigQuery Tables

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

    https://cloud.google.com/bigquery/public-data/openimages

    APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

    Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

    The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    Banner Photo by Mattias Diesel from Unsplash.

    Inspiration

    Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

  2. P

    Kaggle EyePACS Dataset

    • paperswithcode.com
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Kaggle EyePACS Dataset [Dataset]. https://paperswithcode.com/dataset/kaggle-eyepacs
    Explore at:
    Dataset updated
    Oct 28, 2020
    Description

    Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.

    retina

    The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.

    Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.

    Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.

    The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.

    Acknowledgements This competition is sponsored by the California Healthcare Foundation.

    Retinal images were provided by EyePACS, a free platform for retinopathy screening.

  3. NYC Open Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/nycopendata/new-york
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    NYC Open Data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

    Content

    Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

    • Over 8 million 311 service requests from 2012-2016

    • More than 1 million motor vehicle collisions 2012-present

    • Citi Bike stations and 30 million Citi Bike trips 2013-present

    • Over 1 billion Yellow and Green Taxi rides from 2009-present

    • Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

    This dataset is deprecated and not being updated.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://opendata.cityofnewyork.us/

    https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

    The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

    Banner Photo by @bicadmedia from Unplash.

    Inspiration

    On which New York City streets are you most likely to find a loud party?

    Can you find the Virginia Pines in New York City?

    Where was the only collision caused by an animal that injured a cyclist?

    What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

    https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png

  4. Google Landmarks Dataset v2

    • github.com
    • paperswithcode.com
    • +1more
    Updated Sep 27, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
    Explore at:
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.

  5. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  6. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  7. NFL Football Player Stats

    • kaggle.com
    Updated Dec 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zackthoutt (2017). NFL Football Player Stats [Dataset]. https://www.kaggle.com/datasets/zynicide/nfl-football-player-stats/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    zackthoutt
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NFL Football Stats

    My family has always been serious about fantasy football. I've managed my own team since elementary school. It's a fun reason to talk with each other on a weekly basis for almost half the year.

    Ever since I was in 8th grade I've dreamed of building an AI that could draft players and choose lineups for me. I started off in Excel and have since worked my way up to more sophisticated machine learning. The one thing that I've been lacking is really good data, which is why I decided to scrape pro-football-reference.com for all recorded NFL player data.

    From what I've been able to determine researching, this is the most complete public source of NFL player stats available online. I scraped every NFL player in their database going back to the 1940s. That's over 25,000 players who have played over 1,000,000 football games.

    The scraper code can be found here. Feel free to user, alter, or contribute to the repository.

    The data was scraped 12/1/17-12/4/17

    Shameless plug

    When I uploaded this dataset back in 2017, I had two people reach out to me who shared my passion for fantasy football and data science. We quickly decided to band together to create machine-learning-generated fantasy football predictions. Our website is https://gridironai.com. Over the last several years, we've worked to add dozens of data sources to our data stream that's collected weekly. Feel free to use this scraper for basic stats, but if you'd like a more complete dataset that's updated every week, check out our site.

    The data is broken into two parts. There is a players table where each player has been assigned an ID and a game stats table that has one entry per game played. These tables can be linked together using the player ID.

    Player Profile Fields

    • Player ID: The assigned ID for the player.
    • Name: The player's full name.
    • Position: The position the player played abbreviated to two characters. If the player played more than one position, the position field will be a comma-separated list of positions (i.e. "hb,qb").
    • Height: The height of the player in feet and inches. The data format is
  8. SpaceNet 7 Change Detection Chips and Masks

    • kaggle.com
    zip
    Updated Dec 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Merii (2020). SpaceNet 7 Change Detection Chips and Masks [Dataset]. https://www.kaggle.com/datasets/amerii/spacenet-7-change-detection-chips-and-masks
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Dec 24, 2020
    Authors
    A Merii
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset is based on the original SpaceNet 7 dataset, with a few modifications.

    Content

    The original dataset consisted of Planet satellite imagery mosaics, which includes 24 images (one per month) covering ~100 unique geographies. The original dataset will comprised over 40,000 square kilometers of imagery and exhaustive polygon labels of building footprints in the imagery, totaling over 10 million individual annotations.

    This dataset builds upon the original dataset, such that each image is segmented into 64 x 64 chips, in order to make it easier to build a model for.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4101651%2F66851650dbfb7017f1c5717af16cea3c%2Fchips.png?generation=1607947381793575&alt=media" alt="">

    The images also compare the changes that between each image of each month, such that an image taken in month 1 is compared with the image take in month 2, 3, ... 24. This is done by taking the cartesian product of the differences between each image. For more information on how this is done check out the following notebook.

    The differences between the images are captured in the output mask, and the 2 images being compared are stacked. Which means that our input images have dimensions of 64 x 64 x 6, and our output mask has dimensions 64 x 64 x 1. The reason our input images have 6 dimensions is because as mentioned earlier, they are 2 images stacked together. See image below for more details:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4101651%2F9cdcf8481d8d81b6d3fed072cea89586%2Fdifference.png?generation=1607947852597860&alt=media" alt="">

    The image above shows the masks for each of the original satellite images and what the difference between the 2 looks like. For more information on how the original data was explored check out this notebook.

    Data Structure

    The data is structured as follows:
    chip_dataset
    └── change_detection
    └── fname
    ├── chips
    │ └── year1_month1_year2_month2
    │ └── global_monthly_year1_month1_year2_month2_chip_x###_y###_fname.tif
    └── masks
    └── year1_month1_year2_month2
    └── global_monthly_year1_month1_year2_month2_chip_x###_y###_fname_blank.tif

    The _blank in the mask chips, indicates whether the mask is a blank mask or not.

    For more information on how the data was structured and augmented check out the following notebook.

    Acknowledgements

    All credit goes to the team at SpaceNet for collecting and annotating and formatting the original dataset.

  9. Synthetic Financial Datasets For Fraud Detection

    • kaggle.com
    zip
    Updated Apr 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgar Lopez-Rojas (2017). Synthetic Financial Datasets For Fraud Detection [Dataset]. https://www.kaggle.com/ealaxi/paysim1
    Explore at:
    zip(186385561 bytes)Available download formats
    Dataset updated
    Apr 3, 2017
    Authors
    Edgar Lopez-Rojas
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

    We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

    Content

    PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

    This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

    Headers

    This is a sample of 1 row with headers explanation:

    1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

    step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

    type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    amount - amount of the transaction in local currency.

    nameOrig - customer who started the transaction

    oldbalanceOrg - initial balance before the transaction

    newbalanceOrig - new balance after the transaction

    nameDest - customer who is the recipient of the transaction

    oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

    newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

    isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

    isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

    Past Research

    There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).

    We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    Acknowledgements

    This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

    Please refer to this dataset using the following citations:

    PaySim first paper of the simulator:

    E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

  10. Unsupervised Learning on Country Data

    • kaggle.com
    zip
    Updated Jun 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan kokkula (2020). Unsupervised Learning on Country Data [Dataset]. https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data
    Explore at:
    zip(5340 bytes)Available download formats
    Dataset updated
    Jun 17, 2020
    Authors
    Rohan kokkula
    Description

    Clustering the Countries by using Unsupervised Learning for HELP International

    Objective:

    To categorise the countries using socio-economic and health factors that determine the overall development of the country.

    About organization:

    HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

    Problem Statement:

    HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

  11. Taxi Trip Fare Prediction

    • kaggle.com
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagendra Kumar Reddy Syamala (2023). Taxi Trip Fare Prediction [Dataset]. http://doi.org/10.34740/kaggle/dsv/7210622
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nagendra Kumar Reddy Syamala
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage, or needing a database administrator.

    BigQuery Machine Learning BQML is where data analysts can create, train, evaluate, and predict with machine learning models with minimal coding.

    In this you will explore millions of New York City yellow taxi cab trips available in a BigQuery Public Dataset. You will create a machine learning model inside of BigQuery to predict the fare of the cab ride given your model inputs and evaluate the performance of your model and make predictions with it.

    perform the following tasks:

    Query and explore the public taxi cab dataset. Create a training and evaluation dataset to be used for batch prediction. Create a forecasting (linear regression) model in BQML. Evaluate the performance of your machine learning model.

    There are several model types to choose from:

    Forecasting numeric values like next month's sales with Linear Regression (linear_reg). Binary or Multiclass Classification like spam or not spam email by using Logistic Regression (logistic_reg). k-Means Clustering for when you want unsupervised learning for exploration (kmeans).

    Note: There are many additional model types used in Machine Learning (like Neural Networks and decision trees) and available using libraries like TensorFlow. At this time, BQML supports the three listed above. Follow the BQML roadmap for more information.

    For reference sake of you we also released notebook which is available in this try to explore from that .use AutoMl foundational Models to automatically selecting important features from dataset and Model selection .

    you can also go with spectral clustering algorithms upcourse it is not an unsupervised task but it is correlated ,visualize the Fare trip prices .so that cab drive easily identifies fare trips in their respective locations .

    Build a Forecasting model which helps for cab drives like (uber,rapido) which reach their customers easily and short time

    Dataset : ⏱️ 'trip_duration': How long did the journey last?[in Seconds] 🛣️ 'distance_traveled': How far did the taxi travel?[in Km] 🧑‍🤝‍🧑 'num_of_passengers': How many passengers were in the taxi? 💵 'fare': What's the base fare for the journey?[In INR] 💲 'tip': How much did the driver receive in tips?[In INR] 🎀 'miscellaneous_fees': Were there any additional charges during the trip?e.g. tolls, convenience fees, GST etc.[In INR] 💰 'total_fare': The grand total for the ride (this is your prediction target!).[In INR] ⚡ 'surge_applied': Was there a surge pricing applied? Yes or no?

    IF IT IS USEFUL UPVOTE THE DATASET. THANK YOU!

  12. Short Jokes

    • kaggle.com
    • huggingface.co
    zip
    Updated Feb 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinav Moudgil (2017). Short Jokes [Dataset]. https://www.kaggle.com/abhinavmoudgil95/short-jokes
    Explore at:
    zip(10299687 bytes)Available download formats
    Dataset updated
    Feb 6, 2017
    Authors
    Abhinav Moudgil
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.

    Visit my Github repository for more information regarding collection of data and the scripts used.

    Content

    This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.

    Disclaimer

    It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.

  13. Heart Attack Classification Training Dataset

    • kaggle.com
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Vasilyev (2024). Heart Attack Classification Training Dataset [Dataset]. https://www.kaggle.com/datasets/thxogg/heart-attack-classification-training-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tim Vasilyev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Cardiovascular diseases (CVDs) are the leading cause of death globally, encompassing conditions like coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel disorders. According to the World Health Organization, 17.9 million people die from CVDs annually. Heart attacks and strokes account for over 80% of these deaths, with one-third occurring before the age of 70. A comprehensive dataset has been created to analyze factors that contribute to heart attacks. This dataset contains 1,319 samples with nine fields: eight input variables and one output variable. The input variables include age, gender (0 for female, 1 for male), heart rate, systolic blood pressure (pressurehight), diastolic blood pressure (pressurelow), blood sugar (glucose), CK-MB (kcm), and Test-Troponin (troponin). The output variable indicates the presence or absence of a heart attack, categorized as either negative (no heart attack) or positive (heart attack).

  14. Diagnosis of COVID-19 and its clinical spectrum

    • kaggle.com
    zip
    Updated Mar 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einstein Data4u (2020). Diagnosis of COVID-19 and its clinical spectrum [Dataset]. https://www.kaggle.com/einsteindata4u/covid19
    Explore at:
    zip(569726 bytes)Available download formats
    Dataset updated
    Mar 27, 2020
    Authors
    Einstein Data4u
    Description

    Background

    The World Health Organization (WHO) characterized the COVID-19, caused by the SARS-CoV-2, as a pandemic on March 11, while the exponential increase in the number of cases was risking to overwhelm health systems around the world with a demand for ICU beds far above the existing capacity, with regions of Italy being prominent examples.

    Brazil recorded the first case of SARS-CoV-2 on February 26, and the virus transmission evolved from imported cases only, to local and finally community transmission very rapidly, with the federal government declaring nationwide community transmission on March 20.

    Until March 27, the state of São Paulo had recorded 1,223 confirmed cases of COVID-19, with 68 related deaths, while the county of São Paulo, with a population of approximately 12 million people and where Hospital Israelita Albert Einstein is located, had 477 confirmed cases and 30 associated death, as of March 23. Both the state and the county of São Paulo decided to establish quarantine and social distancing measures, that will be enforced at least until early April, in an effort to slow the virus spread.

    One of the motivations for this challenge is the fact that in the context of an overwhelmed health system with the possible limitation to perform tests for the detection of SARS-CoV-2, testing every case would be impractical and tests results could be delayed even if only a target subpopulation would be tested.

    Dataset

    This dataset contains anonymized data from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests during a visit to the hospital.

    All data were anonymized following the best international practices and recommendations. All clinical data were standardized to have a mean of zero and a unit standard deviation.

    Task Details

    TASK 1 • Predict confirmed COVID-19 cases among suspected cases. Based on the results of laboratory tests commonly collected for a suspected COVID-19 case during a visit to the emergency room, would it be possible to predict the test result for SARS-Cov-2 (positive/negative)?

    TASK 2 • Predict admission to general ward, semi-intensive unit or intensive care unit among confirmed COVID-19 cases. Based on the results of laboratory tests commonly collected among confirmed COVID-19 cases during a visit to the emergency room, would it be possible to predict which patients will need to be admitted to a general ward, semi-intensive unit or intensive care unit?

    Expected Submission

    Submit a notebook that implements the full lifecycle of data preparation, model creation and evaluation. Feel free to use this dataset plus any other data you have available. Since this is not a formal competition, you're not submitting a single submission file, but rather your whole approach to building a model.

    Evaluation

    This is not a formal competition, so we won't measure the results strictly against a given validation set using a strict metric. Rather, what we'd like to see is a well-defined process to build a model that can deliver decent results (evaluated by yourself).

    Our team will be looking at: 1. Model Performance - How well does the model perform on the real data? Can it be generalized over time? Can it be applied to other scenarios? Was it overfit? 2. Data Preparation - How well was the data analysed prior to feeding it into the model? Are there any useful visualisations? Does the reader learn any new techniques through this submission? A great entry will be informative, thought provoking, and fresh all at the same time. 3. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible.

    Questions and More Info

    Additional questions and clarifications can be obtained at data4u@einstein.br

    Answers to most voted questions

    Missing data

    Decision making by health care professionals is a complex process, when physicians see a patient for the first time with an acute complaint (e.g., recent onset of fever and respiratory symptoms) they will take a medical history, perform a physical examination, and will base their decisions on this information. To order or not laboratory tests, and which ones to order, is among these decisions, and there is no standard set of tests that are ordered to every individual or to a specific condition. This will depend on the complaints, the findings on the physical examination, personal medical history (e.g., current and prior diagnosed diseases, medications under use, prior surgeries, vaccination), lifestyle habits (e.g., smoking, alcohol use, exercising), family medical history, and prior exposures (e.g., traveling, occupation). The dataset reflects the complexity of decision making during routine clinical care, as opposed to what happens on a more controlled research setting, and data sparsity is, therefore, expected.

    Variables in addition to laboratory results

    We understand that clinical and exposure data, in addition to the laboratory results, are invaluable information to be added to the models, but at this moment they are not available.

    Additional laboratory variables

    A main objective of this challenge is to develop a generalizable model that could be useful during routine clinical care, and although which laboratory exams are ordered can vary for different individuals, even with the same condition, we aimed at including laboratory tests more commonly order during a visit to the emergency room. So, if you found some additional laboratory test that was not included, it is because it was not considered as commonly order in this situation.

    Our message to all participants

    Hospital Israelita Albert Einstein would like to thank you for all the effort and time dedicated to this challenge, the community interest and the number of contributions have surpassed our expectations, and we are extremely satisfied with the results.

    These have been challenging times, and we believe that promoting information sharing and collaboration will be crucial to gain insights, as fast as possible, that could help to implement measures to diminish the burden of COVID-19.

    The multitude of solutions presented focusing on different aspects of the problem could represent a valuable resource in the evaluation of different strategies to implement predictive models for COVID-19. Besides the data visualization methods employed could make it easier for multidisciplinary teams to collaborate around COVID-19 real-world data.

    Although this was not a competition, we would like to highlight some solutions, based on the community and our review of results.

    Lucas Moda (https://www.kaggle.com/lukmoda/covid-19-optimizing-recall-with-smote) utilized interesting data visualization methods for the interpretability of models. Fellipe Gomes (https://www.kaggle.com/gomes555/task2-covid-19-admission-ac-94-sens-0-92-auc-0-96) used concise descriptions of the data and model results. We saw interesting ideas for visualizing and understanding the data, like the dendrogram used by CaesarLupum (https://www.kaggle.com/caesarlupum/brazil-against-the-advance-of-covid-19). Ossamu (https://www.kaggle.com/ossamum/eda-and-feat-import-recall-0-95-roc-auc-0-61) also sought to evaluate several data resampling techniques, to verify how it can improve the performance of predictive models, which was also done by Kaike Reis (https://www.kaggle.com/kaikewreis/a-second-end-to-end-solution-for-covid-19) . Jairo Freitas & Christian Espinoza (https://www.kaggle.com/jairofreitas/covid-19-influence-of-exams-in-recall-precision) sought to understand the distribution of exams regarding the outcomes of task 2, to support the decisions to be made in the construction of predictive models.

    We thank you all for the feedback on available data, helping to show its potential, and taking the challenge of dealing with real data feed. Your efforts let the feeling that it is possible to build good predictive models in real life healthcare settings.

  15. 2 million rows of data on homes for sale

    • kaggle.com
    Updated Mar 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    msorondo (2021). 2 million rows of data on homes for sale [Dataset]. https://www.kaggle.com/msorondo/argentina-venta-de-propiedades/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    msorondo
    Description

    Description in Spanish, original page The data in this dataset was collected by Properati.

    Context

    One of the best applications of data science and machine learning in general is the real estate business. This data set provides data for those who want to make data analysis and use of machine learning models to perform multiple tasks and generate new insights.

    Content

    It consists of a .csv where each row contains a publication. The .csv contains no missing data, this means that it is almost ready for use and model training. The only thing necessary is to convert the "string" type data into numerical data.

    Columns

    id - Notice identifier. It is not unique: if the notification is updated by the real estate agency (new version of the notification) a new record is created with the same id but different dates: registration and cancellation.

    operation_type - Type of operation (these are all sales, can be removed).

    l2 - Administrative level 2: usually province

    l3 - Administrative level 3: usually city

    lat - Latitude.

    lon - Longitude.

    price - Price published in the ad.

    property_type - Type of property (House, Apartment, PH).

    rooms - Number of rooms (useful in Argentina).

    bathrooms - Number of bathrooms.

    start_date - Date when the ad was created.

    end_date - Date of termination of the advertisement.

    created_on - Date when the first version of the notice was created.

    surface_total - Total area in m².

    surface_covered - Covered area in m².

    title - Title of the advertisement.

    description - Description of the advertisement.

    ad_type - Type of ad (Property, Development/Project).

    Acknowledgements

    The data in this dataset was collected by Properati.

  16. IBRD Statement Of Income FY2013

    • kaggle.com
    zip
    Updated Apr 9, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). IBRD Statement Of Income FY2013 [Dataset]. https://www.kaggle.com/theworldbank/ibrd-statement-of-income-fy2013
    Explore at:
    zip(3239 bytes)Available download formats
    Dataset updated
    Apr 9, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    Provides data from the IBRD Statement of Income for the fiscal years ended June 30, 2013, June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on FY 2012 income statement to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.

    Cover photo by Matt Artz on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

    This dataset is distributed under Creative Commons Attribution 3.0 IGO

  17. IBRD Balance Sheet FY2012

    • kaggle.com
    zip
    Updated Apr 9, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). IBRD Balance Sheet FY2012 [Dataset]. https://www.kaggle.com/theworldbank/ibrd-balance-sheet-fy2012
    Explore at:
    zip(3856 bytes)Available download formats
    Dataset updated
    Apr 9, 2019
    Dataset authored and provided by
    World Bank
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    Provides data from the IBRD Balance Sheet for the fiscal years ended June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on the June 30, 2011 balance sheet to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.

    Cover photo by rawpixel on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

    This dataset is distributed under Creative Commons Attribution 3.0 IGO

  18. XarpAi Lung Opacity Detector - Desktop App

    • kaggle.com
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbookshelf (2023). XarpAi Lung Opacity Detector - Desktop App [Dataset]. https://www.kaggle.com/datasets/vbookshelf/xarpai-lung-opacity-detector/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vbookshelf
    Description

    Demo

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Feee5467d2935e8c3aaa9af5d58e6bbf4%2Fxarpai-opacity.gif?generation=1681225192483507&alt=media" alt="">

    Demo showing what happens after a user submits three images

    Summary

    The XarpAi Lung Opacity Detector is a proof-of-concept for a free, open-source desktop app that uses artificial intelligence to detect opacities on chest x-rays.

    Opacities are characteristic signs of lung diseases like TB and Pneumonia. This app analyses chest x-rays and draws bounding boxes around opacities. Radiologists can then review these areas of interest and use their clinical judgement to make a final diagnosis.

    There’s a shortage of radiologists in poor countries. In 2015, Rwanda had 11 radiologists to serve its population of 12 million people. Liberia, with a population of four million, had two practising radiologists. This app provides high volume diagnosis support. It can help overwhelmed radiologists triage x-rays and speed up their workflows.

    The predictions are made by a Pytorch Faster R-CNN model. The model was fine tuned on data from four chest x-ray datasets:

    • The TBX11K Tuberculosis dataset
    • The Kaggle RSNA Pneumonia Detection Challenge
    • The Kaggle VinBigData Chest X-ray Abnormalities Detection competition
    • The Kaggle SIIM-FISABIO-RSNA COVID-19 Detection competition

    Although the app displays opacity bounding boxes, the model was also trained to detect lungs i.e. it predicts a bounding box that surrounds both lungs. If the model fails to detect the lungs then the app outputs an error message.

    The model was validated on an 80/20 train test split. It was also tested on three out of sample datasets:

    • The Shenzhen and Montgomery Tuberculosis datasets
    • The DA and DB Tuberculosis datasets
    • Child Chest X-Ray Pneumonia dataset

    These out of sample datasets don’t have annotated opacity bounding boxes. Therefore, accuracy was used as a rough metric - if the target was positive (e.g. positive for TB) and the model predicted a bounding box, the model was deemed to have made a correct prediction. This validation approach is not rigorous. But it’s a quick and simple way to get a feel for the model’s capability.

    Results on the 20% validation data: - map@0.5: 0.776 accuracy: 0.91

    Accuracy on out of sample datasets: - Shenzhen and Montgomery TB datasets: 0.84 - DA and DB TB datasets: 0.85 - Child Chest X-Ray Pneumonia dataset: 0.83

    Chest x-rays can be difficult for humans to read. One study (TBX11k paper) found that radiologists have a 68.7% accuracy when diagnosing TB on chest x-rays. Using that number for context, the model’s test results look very good. The good performance on the child pneumonia data is surprising because the training data didn’t include a large number of child x-rays.

    These results show that this opacity detection app could be helpful when diagnosing lung diseases like TB and Pneumonia.


    The complete project folder, including the trained models, is stored in this Kaggle dataset.

    For a full project description please refer to the GitHub repo: https://github.com/vbookshelf/XarpAi-Lung-Opacity-Detector

    For info on model training and validation, please refer to the model card. I've included a summary of the datasets, confusion matrices and classification reports. https://github.com/vbookshelf/XarpAi-Lung-Opacity-Detector/blob/main/xarpai-lung-opacity-detector-v1.0/static/assets/Model-Card-and-App-Info-v1.0.pdf

    How to run this app

    I suggest that you download the project folder from Kaggle instead of from the GitHub repo. This is because the project folder on Kaggle includes the trained model. The project folder in the GitHub repo does not include the trained model because GitHub does not allow files larger than 25MB to be uploaded.


    System Requirements

    You'll need about 1.5GB of free disk space. Other than that there are no special system requirements. This app will run on a CPU. I have an old 2014 Macbook Pro laptop with 8GB of RAM. This app runs on it without any issues.


    Overview

    This is a standard flask app. The steps to set up and run the app are the same for both Mac and Windows.

    1. Download the project folder.
    2. Use the command line to pip install the requirements listed in the requirements.txt file. (It’s located inside the project folder.)
    3. Run the app.py file from the command line.
    4. Copy the url that gets printed in the console.
    5. Paste that url into your chrome browser and press Enter. The app will open in the browser.

    This app is based on Flask and Pytorch, both of which are pure python. If you encounter any errors during installation you should be able to solve them quite easily. You won’t have to deal with the package dependency issues that happen when using Tensorflow.


    Detailed setup instructions

    The instructions below are for...

  19. Dota 2 Matches

    • kaggle.com
    zip
    Updated Oct 24, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devin Anzelmo (2016). Dota 2 Matches [Dataset]. https://www.kaggle.com/datasets/devinanzelmo/dota-2-matches/versions/1
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Oct 24, 2016
    Authors
    Devin Anzelmo
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains 50000 ranked ladder matches from the Dota 2 data dump created by Opendota. It was inspired by the Dota 2 Matches data published here by Joe Ramir. This is an update and improved version of that dataset. I have kept the same image and a similar title.

    Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.

    The aim of this dataset is to enable the exploration of player behavior, skill estimation, or anything you find interesting. The intent is to create an accessible, and easy to use resource, which can be expanded and modified if needed. As such I am open to a wide variety of suggestions as to what additions or changes to make.

    Whats Currently Available

    See https://github.com/odota/core/wiki/JSON-Data-Dump for documentaion on data. I have found a few undocumented areas in the data, including the objectives information. player_slot can be used to combine most of the data, and it is available in most of the tables. Additionally all tables include match_id, and some have account_id to make it easier to look at an individual players matches. match_id, and account_id have been reencoded to save a little space. I can upload tables to allow conversion if needed. I plan adding small amount of information very soon. Including outcome for an additional 50k-100k matches that occurred after the ones currently uploaded, and some tables to enable determining which continent or region the match was played in.

    • matches: contains top level information about each match. see https://wiki.teamfortress.com/wiki/WebAPI/GetMatchDetails#Tower_Status%22tower_status_dire%22:%202047) for interpreting tower and barracks status. Cluster can link matches to geographic region.

    • players: Individual players are identified by account_id but there is an option to play anonymously and roughly one third of the account_id are not available. Anonymous users have the value of 0 for account_id. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order_. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold_, and xp_.

    • player_time: Contains last hits, experience, and gold sampled at one minute interval for all players in all matches. The column names indicate the player_slot. For instance xp_t_1 indicates that this column has experience sums for the player in slot one.

    • teamfights: Start and stop time of teamfights, as well as last death time. Teamfights appear to be all battles with three or more deaths. As such this does not include all battles for the entire match.

    • teamfights_players : Additional information provided for each player in each teamfight. player_slot can be used to link this back to players.csv

    • objectives: Gives information on all the objectives completed, by which player and at what time.

    • chat: All chat for the 50k matches. There is plenty of profanity, and good natured trolling.

    Past Research

    There seem to be some efforts to establish indicators for skillfull play based on specific parts of gameplay. Opendota has many statistics, and some analysis for specific benchmarks at different times in the game. Dotabuff has a lot of information I have not explored it deeply. This is an area to gather more information.

    Some possible directions of investigation

    Insight from domain experts would also be useful to help clarify what problems are interesting to work on. Some initial task ideas

    • Predict match outcomes based on aggregates for individual players using only account_id as prior information
    • Add hero id to this and see if there is a differences in performance
    • Estimate player skill based on a sample of in game play(this might need an external mmr source or different definition skill)
    • Create improved indicators of skillful play based game actions to help players target areas for improvement

    All of these areas have been worked on, but I am not aware of the most up to date research on dota2 gameplay.

    I plan on setting up several different predictive tasks in the upcoming weeks. A test set of an additional 50 to 100 thousand matches with just hero_id, and account_id included along with outcome of the match.

    The current dataset seems pretty small for modeling individual players. I would prefer to have a wide range of features instead of a larger dataset for the moment.

    Dataset idea for anyone interested in creating their own Dota 2 dataset. It would be useful to have a few full matches avai...

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
Organization logoOrganization logo

Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

Search
Clear search
Close search
Google apps
Main menu