This is the dataset that goes along with the Deep Learning basics with Python, TensorFlow and Keras p.2 Tutorial provided by Sentdex. Link here: https://www.youtube.com/watch?v=j-3vuBynnOE&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=2
Kaggle is a platform for sharing data, performing reproducible analyses, interactive data analysis tutorials, and machine learning competitions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘US Adult Income’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/johnolafenwa/us-census-data on 28 January 2022.
--- Dataset description provided by original source is as follows ---
US Adult Census data relating income to social factors such as Age, Education, race etc.
The Us Adult income dataset was extracted by Barry Becker from the 1994 US Census Database. The data set consists of anonymous information such as occupation, age, native country, race, capital gain, capital loss, education, work class and more. Each row is labelled as either having a salary greater than ">50K" or "<=50K".
This Data set is split into two CSV files, named adult-training.txt
and adult-test.txt
.
The goal here is to train a binary classifier on the training dataset to predict the column income_bracket
which has two possible values ">50K" and "<=50K" and evaluate the accuracy of the classifier with the test dataset.
Note that the dataset is made up of categorical and continuous features. It also contains missing values The categorical columns are: workclass, education, marital_status, occupation, relationship, race, gender, native_country
The continuous columns are: age, education_num, capital_gain, capital_loss, hours_per_week
This Dataset was obtained from the UCI repository, it can be found on
https://archive.ics.uci.edu/ml/datasets/census+income, http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/
USAGE This dataset is well suited to developing and testing wide linear classifiers, deep neutral network classifiers and a combination of both. For more info on Combined Deep and Wide Model classifiers, refer to the Research Paper by Google https://arxiv.org/abs/1606.07792
Refer to this kernel for sample usage : https://www.kaggle.com/johnolafenwa/wage-prediction
Complete Tutorial is available from http://johnolafenwa.blogspot.com.ng/2017/07/machine-learning-tutorial-1-wage.html?m=1
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘homeprices-multiple-variables’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/pankeshpatel/homepricesmultiplevariables on 14 February 2022.
--- Dataset description provided by original source is as follows ---
Sample data of housing price. We have used this small data set to create a tutorial -- Machine learning for absolute beginners. The topic is Multivariate Regression.
It has the following four attributes, describing a house - **area **: area of a house in square feet - bedrooms: number of bedrooms in a house - **age **: age of house - price: price of a house.
Area, bedrooms, and age are feature attributes and price is target attributes/variable.
Source: codebasics : https://twitter.com/codebasicshub
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Adult Census data relating income to social factors such as Age, Education, race etc.
The Us Adult income dataset was extracted by Barry Becker from the 1994 US Census Database. The data set consists of anonymous information such as occupation, age, native country, race, capital gain, capital loss, education, work class and more. Each row is labelled as either having a salary greater than ">50K" or "<=50K".
This Data set is split into two CSV files, named adult-training.txt
and adult-test.txt
.
The goal here is to train a binary classifier on the training dataset to predict the column income_bracket
which has two possible values ">50K" and "<=50K" and evaluate the accuracy of the classifier with the test dataset.
Note that the dataset is made up of categorical and continuous features. It also contains missing values The categorical columns are: workclass, education, marital_status, occupation, relationship, race, gender, native_country
The continuous columns are: age, education_num, capital_gain, capital_loss, hours_per_week
This Dataset was obtained from the UCI repository, it can be found on
https://archive.ics.uci.edu/ml/datasets/census+income, http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/
USAGE This dataset is well suited to developing and testing wide linear classifiers, deep neutral network classifiers and a combination of both. For more info on Combined Deep and Wide Model classifiers, refer to the Research Paper by Google https://arxiv.org/abs/1606.07792
Refer to this kernel for sample usage : https://www.kaggle.com/johnolafenwa/wage-prediction
Complete Tutorial is available from http://johnolafenwa.blogspot.com.ng/2017/07/machine-learning-tutorial-1-wage.html?m=1
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Avengers: Infinity War is one of my favorite movies not only from the superhero genre, but of all time. Besides an insane character crossover, it features quite a few scenes with pop culture references. In one of them, Peter Quill (aka Star-Lord) refers to Thanos as "Grimace"
For those who don't know, Grimace is one of McDonaldland's characters. These were quite popular in the 80s and 90s (which fits with Quill's time in Earth as a kid). This reference inspired cosplayers Kittie Cosplay and Banana Steve in making an even crazier crossover: Grimos. This made me wonder: can we create an algorithm capable of distinguishing between these two purple characters?
Unsurprisingly, I couldn't find an existing dataset for this project. Thus, I created it myself leveraging the search function of Google Images. Originally, I wanted to use a semi-automatic approach, similar to what is suggested in this tutorial. However, I could never get it running. My guess is that it is a bit old and no longer compatible with the most recent versions of web browsers. Therefore, I looked for other alternatives and found Fatkun Batch Download Image extension for Chrome. This is a very handy tool which allows you to manually select and download all the images of a single tab in your browser to your computer. Although at the beginning I was a bit bummed that I had to click through a lot of images, I quickly realized this was a necessary step if I wanted to have a decently curated dataset.
I selected and downloaded images that:
Had the character in any representation (e.g., photo, comic, drawing, cartoon, etc.)
Showed the character from different angles
All images are JPEGs.
You can find the repository of the project where these data are put in action here.
Images were scrapped using Google Images and Fatkun Batch Download Image extension for Chrome
All images were obtained from publicly available websites. If you own the rights to any of the shown images and wish to get it removed from this dataset, please let me know.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Bootstrapped and manually annotated NER Swedish web news from 2012. NER stands for Named entity recognition, and its used to describe entities in a text such as organisations, locations and people for instance.
Its a very common operation in general NLP pipeline, and several algorithms can be used to train a model. Traditionally many NER systems were trained using some kind of CRF (conditionally random fields) approach, but nowadays many people successfully uses LSTM:s or other sequence based deep learning techniques.
A tutorial on how to use this dataset to train an NER for Stanford CoreNLP is available here https://medium.com/@klintcho/training-a-swedish-ner-model-for-stanford-corenlp-part-2-20a0cfd801dd
The dataset is very simple and can easily be adapted into other formats, it is specifically adapted to CoreNLP NER. Thus the first column is a word. Second column (tab separated) is either the NER category (ORG, PER, LOC, MISC) or a 0 if it does not belong to any category (not an entity). Each word is separated by a new line, and each sentence is separated by an empty new line.
Sample structure (of two sentences, one three word sentence, and another 4 word sentence): Apple ORG is 0 nice 0 . 0
Per PER is 0 not 0 sad 0
Text is annotated from http://spraakbanken.gu.se/eng/resource/webbnyheter2012. Thanks Norah Klintberg Sakal for helping out with the annotation and reviewing all annotations as well.
Feel free to use this for whatever you like. As most datasets it would definitely benefit from becoming larger, feel free to create a pull request https://github.com/klintan/swedish-ner-corpus/ or update it here on Kaggle.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This is the dataset that goes along with the Deep Learning basics with Python, TensorFlow and Keras p.2 Tutorial provided by Sentdex. Link here: https://www.youtube.com/watch?v=j-3vuBynnOE&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=2