Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Labeled datasets are useful in machine learning research.
This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.
Tables: 1) annotations_bbox 2) dict 3) images 4) labels
Update Frequency: Quarterly
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images
https://cloud.google.com/bigquery/public-data/openimages
APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.
Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.
The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.
Banner Photo by Mattias Diesel from Unsplash.
Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
retina
The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.
Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.
Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.
The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.
Acknowledgements This competition is sponsored by the California Healthcare Foundation.
Retinal images were provided by EyePACS, a free platform for retinopathy screening.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
https://opendata.cityofnewyork.us/
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.
The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.
Banner Photo by @bicadmedia from Unplash.
On which New York City streets are you most likely to find a loud party?
Can you find the Virginia Pines in New York City?
Where was the only collision caused by an animal that injured a cyclist?
What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.
Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB
Key Features:
Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.
Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.
Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.
Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.
Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.
Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.
Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.
Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.
The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.
Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.
Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.
Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.
Dataset Preparation: The translation ...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
My family has always been serious about fantasy football. I've managed my own team since elementary school. It's a fun reason to talk with each other on a weekly basis for almost half the year.
Ever since I was in 8th grade I've dreamed of building an AI that could draft players and choose lineups for me. I started off in Excel and have since worked my way up to more sophisticated machine learning. The one thing that I've been lacking is really good data, which is why I decided to scrape pro-football-reference.com
for all recorded NFL player data.
From what I've been able to determine researching, this is the most complete public source of NFL player stats available online. I scraped every NFL player in their database going back to the 1940s. That's over 25,000 players who have played over 1,000,000 football games.
The scraper code can be found here. Feel free to user, alter, or contribute to the repository.
The data was scraped 12/1/17-12/4/17
When I uploaded this dataset back in 2017, I had two people reach out to me who shared my passion for fantasy football and data science. We quickly decided to band together to create machine-learning-generated fantasy football predictions. Our website is https://gridironai.com. Over the last several years, we've worked to add dozens of data sources to our data stream that's collected weekly. Feel free to use this scraper for basic stats, but if you'd like a more complete dataset that's updated every week, check out our site.
The data is broken into two parts. There is a players table where each player has been assigned an ID and a game stats table that has one entry per game played. These tables can be linked together using the player ID.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is based on the original SpaceNet 7 dataset, with a few modifications.
The original dataset consisted of Planet satellite imagery mosaics, which includes 24 images (one per month) covering ~100 unique geographies. The original dataset will comprised over 40,000 square kilometers of imagery and exhaustive polygon labels of building footprints in the imagery, totaling over 10 million individual annotations.
This dataset builds upon the original dataset, such that each image is segmented into 64 x 64 chips, in order to make it easier to build a model for.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4101651%2F66851650dbfb7017f1c5717af16cea3c%2Fchips.png?generation=1607947381793575&alt=media" alt="">
The images also compare the changes that between each image of each month, such that an image taken in month 1 is compared with the image take in month 2, 3, ... 24. This is done by taking the cartesian product of the differences between each image. For more information on how this is done check out the following notebook.
The differences between the images are captured in the output mask, and the 2 images being compared are stacked. Which means that our input images have dimensions of 64 x 64 x 6, and our output mask has dimensions 64 x 64 x 1. The reason our input images have 6 dimensions is because as mentioned earlier, they are 2 images stacked together. See image below for more details:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4101651%2F9cdcf8481d8d81b6d3fed072cea89586%2Fdifference.png?generation=1607947852597860&alt=media" alt="">
The image above shows the masks for each of the original satellite images and what the difference between the 2 looks like. For more information on how the original data was explored check out this notebook.
The data is structured as follows:
chip_dataset
└── change_detection
└── fname
├── chips
│ └── year1_month1_year2_month2
│ └── global_monthly_year1_month1_year2_month2_chip_x###_y###_fname.tif
└── masks
└── year1_month1_year2_month2
└── global_monthly_year1_month1_year2_month2_chip_x###_y###_fname_blank.tif
The _blank
in the mask chips, indicates whether the mask is a blank mask or not.
For more information on how the data was structured and augmented check out the following notebook.
All credit goes to the team at SpaceNet for collecting and annotating and formatting the original dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.
We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.
This is a sample of 1 row with headers explanation:
1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0
step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).
We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.
Please refer to this dataset using the following citations:
PaySim first paper of the simulator:
E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016
To categorise the countries using socio-economic and health factors that determine the overall development of the country.
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage, or needing a database administrator.
BigQuery Machine Learning BQML is where data analysts can create, train, evaluate, and predict with machine learning models with minimal coding.
In this you will explore millions of New York City yellow taxi cab trips available in a BigQuery Public Dataset. You will create a machine learning model inside of BigQuery to predict the fare of the cab ride given your model inputs and evaluate the performance of your model and make predictions with it.
perform the following tasks:
Query and explore the public taxi cab dataset. Create a training and evaluation dataset to be used for batch prediction. Create a forecasting (linear regression) model in BQML. Evaluate the performance of your machine learning model.
There are several model types to choose from:
Forecasting numeric values like next month's sales with Linear Regression (linear_reg). Binary or Multiclass Classification like spam or not spam email by using Logistic Regression (logistic_reg). k-Means Clustering for when you want unsupervised learning for exploration (kmeans).
Note: There are many additional model types used in Machine Learning (like Neural Networks and decision trees) and available using libraries like TensorFlow. At this time, BQML supports the three listed above. Follow the BQML roadmap for more information.
For reference sake of you we also released notebook which is available in this try to explore from that .use AutoMl foundational Models to automatically selecting important features from dataset and Model selection .
you can also go with spectral clustering algorithms upcourse it is not an unsupervised task but it is correlated ,visualize the Fare trip prices .so that cab drive easily identifies fare trips in their respective locations .
Build a Forecasting model which helps for cab drives like (uber,rapido) which reach their customers easily and short time
Dataset : ⏱️ 'trip_duration': How long did the journey last?[in Seconds] 🛣️ 'distance_traveled': How far did the taxi travel?[in Km] 🧑🤝🧑 'num_of_passengers': How many passengers were in the taxi? 💵 'fare': What's the base fare for the journey?[In INR] 💲 'tip': How much did the driver receive in tips?[In INR] 🎀 'miscellaneous_fees': Were there any additional charges during the trip?e.g. tolls, convenience fees, GST etc.[In INR] 💰 'total_fare': The grand total for the ride (this is your prediction target!).[In INR] ⚡ 'surge_applied': Was there a surge pricing applied? Yes or no?
IF IT IS USEFUL UPVOTE THE DATASET. THANK YOU!
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Cardiovascular diseases (CVDs) are the leading cause of death globally, encompassing conditions like coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel disorders. According to the World Health Organization, 17.9 million people die from CVDs annually. Heart attacks and strokes account for over 80% of these deaths, with one-third occurring before the age of 70. A comprehensive dataset has been created to analyze factors that contribute to heart attacks. This dataset contains 1,319 samples with nine fields: eight input variables and one output variable. The input variables include age, gender (0 for female, 1 for male), heart rate, systolic blood pressure (pressurehight), diastolic blood pressure (pressurelow), blood sugar (glucose), CK-MB (kcm), and Test-Troponin (troponin). The output variable indicates the presence or absence of a heart attack, categorized as either negative (no heart attack) or positive (heart attack).
The World Health Organization (WHO) characterized the COVID-19, caused by the SARS-CoV-2, as a pandemic on March 11, while the exponential increase in the number of cases was risking to overwhelm health systems around the world with a demand for ICU beds far above the existing capacity, with regions of Italy being prominent examples.
Brazil recorded the first case of SARS-CoV-2 on February 26, and the virus transmission evolved from imported cases only, to local and finally community transmission very rapidly, with the federal government declaring nationwide community transmission on March 20.
Until March 27, the state of São Paulo had recorded 1,223 confirmed cases of COVID-19, with 68 related deaths, while the county of São Paulo, with a population of approximately 12 million people and where Hospital Israelita Albert Einstein is located, had 477 confirmed cases and 30 associated death, as of March 23. Both the state and the county of São Paulo decided to establish quarantine and social distancing measures, that will be enforced at least until early April, in an effort to slow the virus spread.
One of the motivations for this challenge is the fact that in the context of an overwhelmed health system with the possible limitation to perform tests for the detection of SARS-CoV-2, testing every case would be impractical and tests results could be delayed even if only a target subpopulation would be tested.
This dataset contains anonymized data from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests during a visit to the hospital.
All data were anonymized following the best international practices and recommendations. All clinical data were standardized to have a mean of zero and a unit standard deviation.
TASK 1 • Predict confirmed COVID-19 cases among suspected cases. Based on the results of laboratory tests commonly collected for a suspected COVID-19 case during a visit to the emergency room, would it be possible to predict the test result for SARS-Cov-2 (positive/negative)?
TASK 2 • Predict admission to general ward, semi-intensive unit or intensive care unit among confirmed COVID-19 cases. Based on the results of laboratory tests commonly collected among confirmed COVID-19 cases during a visit to the emergency room, would it be possible to predict which patients will need to be admitted to a general ward, semi-intensive unit or intensive care unit?
Submit a notebook that implements the full lifecycle of data preparation, model creation and evaluation. Feel free to use this dataset plus any other data you have available. Since this is not a formal competition, you're not submitting a single submission file, but rather your whole approach to building a model.
This is not a formal competition, so we won't measure the results strictly against a given validation set using a strict metric. Rather, what we'd like to see is a well-defined process to build a model that can deliver decent results (evaluated by yourself).
Our team will be looking at: 1. Model Performance - How well does the model perform on the real data? Can it be generalized over time? Can it be applied to other scenarios? Was it overfit? 2. Data Preparation - How well was the data analysed prior to feeding it into the model? Are there any useful visualisations? Does the reader learn any new techniques through this submission? A great entry will be informative, thought provoking, and fresh all at the same time. 3. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible.
Additional questions and clarifications can be obtained at data4u@einstein.br
Decision making by health care professionals is a complex process, when physicians see a patient for the first time with an acute complaint (e.g., recent onset of fever and respiratory symptoms) they will take a medical history, perform a physical examination, and will base their decisions on this information. To order or not laboratory tests, and which ones to order, is among these decisions, and there is no standard set of tests that are ordered to every individual or to a specific condition. This will depend on the complaints, the findings on the physical examination, personal medical history (e.g., current and prior diagnosed diseases, medications under use, prior surgeries, vaccination), lifestyle habits (e.g., smoking, alcohol use, exercising), family medical history, and prior exposures (e.g., traveling, occupation). The dataset reflects the complexity of decision making during routine clinical care, as opposed to what happens on a more controlled research setting, and data sparsity is, therefore, expected.
We understand that clinical and exposure data, in addition to the laboratory results, are invaluable information to be added to the models, but at this moment they are not available.
A main objective of this challenge is to develop a generalizable model that could be useful during routine clinical care, and although which laboratory exams are ordered can vary for different individuals, even with the same condition, we aimed at including laboratory tests more commonly order during a visit to the emergency room. So, if you found some additional laboratory test that was not included, it is because it was not considered as commonly order in this situation.
Hospital Israelita Albert Einstein would like to thank you for all the effort and time dedicated to this challenge, the community interest and the number of contributions have surpassed our expectations, and we are extremely satisfied with the results.
These have been challenging times, and we believe that promoting information sharing and collaboration will be crucial to gain insights, as fast as possible, that could help to implement measures to diminish the burden of COVID-19.
The multitude of solutions presented focusing on different aspects of the problem could represent a valuable resource in the evaluation of different strategies to implement predictive models for COVID-19. Besides the data visualization methods employed could make it easier for multidisciplinary teams to collaborate around COVID-19 real-world data.
Although this was not a competition, we would like to highlight some solutions, based on the community and our review of results.
Lucas Moda (https://www.kaggle.com/lukmoda/covid-19-optimizing-recall-with-smote) utilized interesting data visualization methods for the interpretability of models. Fellipe Gomes (https://www.kaggle.com/gomes555/task2-covid-19-admission-ac-94-sens-0-92-auc-0-96) used concise descriptions of the data and model results. We saw interesting ideas for visualizing and understanding the data, like the dendrogram used by CaesarLupum (https://www.kaggle.com/caesarlupum/brazil-against-the-advance-of-covid-19). Ossamu (https://www.kaggle.com/ossamum/eda-and-feat-import-recall-0-95-roc-auc-0-61) also sought to evaluate several data resampling techniques, to verify how it can improve the performance of predictive models, which was also done by Kaike Reis (https://www.kaggle.com/kaikewreis/a-second-end-to-end-solution-for-covid-19) . Jairo Freitas & Christian Espinoza (https://www.kaggle.com/jairofreitas/covid-19-influence-of-exams-in-recall-precision) sought to understand the distribution of exams regarding the outcomes of task 2, to support the decisions to be made in the construction of predictive models.
We thank you all for the feedback on available data, helping to show its potential, and taking the challenge of dealing with real data feed. Your efforts let the feeling that it is possible to build good predictive models in real life healthcare settings.
Description in Spanish, original page The data in this dataset was collected by Properati.
One of the best applications of data science and machine learning in general is the real estate business. This data set provides data for those who want to make data analysis and use of machine learning models to perform multiple tasks and generate new insights.
It consists of a .csv where each row contains a publication. The .csv contains no missing data, this means that it is almost ready for use and model training. The only thing necessary is to convert the "string" type data into numerical data.
id - Notice identifier. It is not unique: if the notification is updated by the real estate agency (new version of the notification) a new record is created with the same id but different dates: registration and cancellation.
operation_type - Type of operation (these are all sales, can be removed).
l2 - Administrative level 2: usually province
l3 - Administrative level 3: usually city
lat - Latitude.
lon - Longitude.
price - Price published in the ad.
property_type - Type of property (House, Apartment, PH).
rooms - Number of rooms (useful in Argentina).
bathrooms - Number of bathrooms.
start_date - Date when the ad was created.
end_date - Date of termination of the advertisement.
created_on - Date when the first version of the notice was created.
surface_total - Total area in m².
surface_covered - Covered area in m².
title - Title of the advertisement.
description - Description of the advertisement.
ad_type - Type of ad (Property, Development/Project).
The data in this dataset was collected by Properati.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Provides data from the IBRD Statement of Income for the fiscal years ended June 30, 2013, June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on FY 2012 income statement to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.
Cover photo by Matt Artz on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset is distributed under Creative Commons Attribution 3.0 IGO
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Provides data from the IBRD Balance Sheet for the fiscal years ended June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on the June 30, 2011 balance sheet to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.
Cover photo by rawpixel on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset is distributed under Creative Commons Attribution 3.0 IGO
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Feee5467d2935e8c3aaa9af5d58e6bbf4%2Fxarpai-opacity.gif?generation=1681225192483507&alt=media" alt="">
Demo showing what happens after a user submits three images
The XarpAi Lung Opacity Detector is a proof-of-concept for a free, open-source desktop app that uses artificial intelligence to detect opacities on chest x-rays.
Opacities are characteristic signs of lung diseases like TB and Pneumonia. This app analyses chest x-rays and draws bounding boxes around opacities. Radiologists can then review these areas of interest and use their clinical judgement to make a final diagnosis.
There’s a shortage of radiologists in poor countries. In 2015, Rwanda had 11 radiologists to serve its population of 12 million people. Liberia, with a population of four million, had two practising radiologists. This app provides high volume diagnosis support. It can help overwhelmed radiologists triage x-rays and speed up their workflows.
The predictions are made by a Pytorch Faster R-CNN model. The model was fine tuned on data from four chest x-ray datasets:
Although the app displays opacity bounding boxes, the model was also trained to detect lungs i.e. it predicts a bounding box that surrounds both lungs. If the model fails to detect the lungs then the app outputs an error message.
The model was validated on an 80/20 train test split. It was also tested on three out of sample datasets:
These out of sample datasets don’t have annotated opacity bounding boxes. Therefore, accuracy was used as a rough metric - if the target was positive (e.g. positive for TB) and the model predicted a bounding box, the model was deemed to have made a correct prediction. This validation approach is not rigorous. But it’s a quick and simple way to get a feel for the model’s capability.
Results on the 20% validation data:
- map@0.5: 0.776
accuracy: 0.91
Accuracy on out of sample datasets: - Shenzhen and Montgomery TB datasets: 0.84 - DA and DB TB datasets: 0.85 - Child Chest X-Ray Pneumonia dataset: 0.83
Chest x-rays can be difficult for humans to read. One study (TBX11k paper) found that radiologists have a 68.7% accuracy when diagnosing TB on chest x-rays. Using that number for context, the model’s test results look very good. The good performance on the child pneumonia data is surprising because the training data didn’t include a large number of child x-rays.
These results show that this opacity detection app could be helpful when diagnosing lung diseases like TB and Pneumonia.
The complete project folder, including the trained models, is stored in this Kaggle dataset.
For a full project description please refer to the GitHub repo: https://github.com/vbookshelf/XarpAi-Lung-Opacity-Detector
For info on model training and validation, please refer to the model card. I've included a summary of the datasets, confusion matrices and classification reports. https://github.com/vbookshelf/XarpAi-Lung-Opacity-Detector/blob/main/xarpai-lung-opacity-detector-v1.0/static/assets/Model-Card-and-App-Info-v1.0.pdf
I suggest that you download the project folder from Kaggle instead of from the GitHub repo. This is because the project folder on Kaggle includes the trained model. The project folder in the GitHub repo does not include the trained model because GitHub does not allow files larger than 25MB to be uploaded.
You'll need about 1.5GB of free disk space. Other than that there are no special system requirements. This app will run on a CPU. I have an old 2014 Macbook Pro laptop with 8GB of RAM. This app runs on it without any issues.
This is a standard flask app. The steps to set up and run the app are the same for both Mac and Windows.
This app is based on Flask and Pytorch, both of which are pure python. If you encounter any errors during installation you should be able to solve them quite easily. You won’t have to deal with the package dependency issues that happen when using Tensorflow.
The instructions below are for...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 50000 ranked ladder matches from the Dota 2 data dump created by Opendota. It was inspired by the Dota 2 Matches data published here by Joe Ramir. This is an update and improved version of that dataset. I have kept the same image and a similar title.
Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.
The aim of this dataset is to enable the exploration of player behavior, skill estimation, or anything you find interesting. The intent is to create an accessible, and easy to use resource, which can be expanded and modified if needed. As such I am open to a wide variety of suggestions as to what additions or changes to make.
See https://github.com/odota/core/wiki/JSON-Data-Dump for documentaion on data. I have found a few undocumented areas in the data, including the objectives
information. player_slot
can be used to combine most of the data, and it is available in most of the tables. Additionally all tables include match_id
, and some have account_id
to make it easier to look at an individual players matches. match_id
, and account_id
have been reencoded to save a little space. I can upload tables to allow conversion if needed. I plan adding small amount of information very soon. Including outcome for an additional 50k-100k matches that occurred after the ones currently uploaded, and some tables to enable determining which continent or region the match was played in.
matches: contains top level information about each match. see https://wiki.teamfortress.com/wiki/WebAPI/GetMatchDetails#Tower_Status%22tower_status_dire%22:%202047) for interpreting tower and barracks status. Cluster can link matches to geographic region.
players: Individual players are identified by account_id
but there is an option to play anonymously and roughly one third of the account_id
are not available. Anonymous users have the value of 0
for account_id
. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order_
. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold_
, and xp_
.
player_time: Contains last hits, experience, and gold sampled at one minute interval for all players in all matches. The column names indicate the player_slot. For instance xp_t_1
indicates that this column has experience sums for the player in slot one.
teamfights: Start and stop time of teamfights, as well as last death time. Teamfights appear to be all battles with three or more deaths. As such this does not include all battles for the entire match.
teamfights_players : Additional information provided for each player in each teamfight. player_slot
can be used to link this back to players.csv
objectives: Gives information on all the objectives completed, by which player and at what time.
chat: All chat for the 50k matches. There is plenty of profanity, and good natured trolling.
There seem to be some efforts to establish indicators for skillfull play based on specific parts of gameplay. Opendota has many statistics, and some analysis for specific benchmarks at different times in the game. Dotabuff has a lot of information I have not explored it deeply. This is an area to gather more information.
Insight from domain experts would also be useful to help clarify what problems are interesting to work on. Some initial task ideas
All of these areas have been worked on, but I am not aware of the most up to date research on dota2 gameplay.
I plan on setting up several different predictive tasks in the upcoming weeks. A test set of an additional 50 to 100 thousand matches with just hero_id, and account_id included along with outcome of the match.
The current dataset seems pretty small for modeling individual players. I would prefer to have a wide range of features instead of a larger dataset for the moment.
Dataset idea for anyone interested in creating their own Dota 2 dataset. It would be useful to have a few full matches avai...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Labeled datasets are useful in machine learning research.
This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.
Tables: 1) annotations_bbox 2) dict 3) images 4) labels
Update Frequency: Quarterly
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images
https://cloud.google.com/bigquery/public-data/openimages
APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.
Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.
The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.
Banner Photo by Mattias Diesel from Unsplash.
Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?