https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains information about world's biggest companies.
Among them you can find companies founded in the US, the UK, Europe, Asia, South America, South Africa, Australia.
The dataset contains information about the year the company was founded, its' revenue and net income in years 2018 - 2020, and the industry.
I have included 2 csv files: the raw csv file if you want to practice cleaning the data, and the clean csv ready to be analyzed.
The third dataset includes the name of all the companies included in the previous datasets and 2 additional columns: number of employees and name of the founder.
In addition there's tesla.csv file containing shares prices for Tesla.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a dataset that requires a lot of preprocessing with amazing EDA insights for a company. A dataset consisting of sales and profit data sorted by market segment and country/region.
Tips for pre-processing: 1. Check for column names and find error there itself!! 2. Remove '$' sign and '-' from all columns where they are present 3. Change datatype from objects to int after the above two. 4. Challenge: Try removing " , " (comma) from all numerical numbers. 5. Try plotting sales and profit with respect to timeline
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.
The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Google Stock Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/varpit94/google-stock-data on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Facebook, Apple, and Microsoft. Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14% of its publicly-listed shares and control 56% of the stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly-owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's Internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet.
This dataset provides historical data of Alphabet Inc. (GOOG). The data is available at a daily level. Currency is USD.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘🐕 Cat VS Dog popularity per state’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/cat-vs-dog-popularity-in-u-se on 13 February 2022.
--- Dataset description provided by original source is as follows ---
http://i.imgur.com/LGI7wTt.png" alt="Imgur" style="">
This dataset was created by Andrew Duff and contains around 0 samples along with Percentage Of Cat Owners, Mean Number Of Dogs Per Household, technical information and other features such as: - Percentage Of Households With Pets - Mean Number Of Cats - and more.
- Analyze Percentage Of Dog Owners in relation to Number Of Pet Households (in 1000)
- Study the influence of Percentage Of Cat Owners on Mean Number Of Dogs Per Household
- More datasets
If you use this dataset in your research, please credit Andrew Duff
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Demolitions in the Occupied Territories is a dataset that provides statistics on the demolition of Palestinian-owned homes and structures in the Occupied Territories.
The information is based on investigations conducted by B’Tselem – The Israeli Information Center for Human Rights in the Occupied Territories.
The dataset covers a period from January 2004 to August 2023 and includes information about the date of demolition, locality, district, area, housing units, people left homeless, minors left homeless, type of structure, and reason for demolition.
The intention of using this data should be solely for objective analysis and understanding of the situation, without any political intent. Any analysis or interpretation should be approached with sensitivity and respect for human rights.
Fatalities in the Israeli-Palestinian Conflict
If you find this dataset valuable, don't forget to hit the upvote button! 😊💝
Photo by Oleg Solodkov on Unsplash
This is a dataset containing 10,000 posts from Kaggle and 60,000 comments related to those posts in the question-answer topic.
Data Fields
kaggle_post
'pseudo', The question authors. 'title', Title of the Post. 'question', The question's body. 'vote', Voting on Kaggle is similar to liking. 'medal', I will share with you the Kaggle medal system, which can be found at https://www.kaggle.com/progression. The system awards medals to users based on… See the full description on the dataset page: https://huggingface.co/datasets/Raaxx/Kaggle-post-and-comments-question-answer-topic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contain video transcript from a limited number of youtubers who post Their review on iPhone 15, 15 plus , pro and pro max model . These are the videos used for the videos. Video Credits are owned by respective creators.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13244501%2Fc3bf6524f3ddfa376794de29f97651a1%2F_results_14_0.png?generation=1695205189424943&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13244501%2F645638973f5f8f5782cc8720ac4214c1%2F_results_15_0.png?generation=1695205202162850&alt=media" alt="">
For more check Here
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Source: https://www.kaggle.com/datasets/hugomathien/soccer by Hugo Mathien
About Dataset
The ultimate Soccer database for data analysis and machine learning
What you get:
+25,000 matches +10,000 players 11 European Countries with their lead championship Seasons 2008 to 2016 Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates Team line up with squad formation (X, Y coordinates) Betting odds from up to 10 providers… See the full description on the dataset page: https://huggingface.co/datasets/julien-c/kaggle-hugomathien-soccer.
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides historical stock market performance data for specific companies. It enables users to analyze and understand the past trends and fluctuations in stock prices over time. This information can be utilized for various purposes such as investment analysis, financial research, and market trend forecasting.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides historical stock market performance data for specific companies. It enables users to analyze and understand the past trends and fluctuations in stock prices over time. This information can be utilized for various purposes such as investment analysis, financial research, and market trend forecasting.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides comprehensive information on companies listed on the NASDAQ stock exchange. It includes essential details about each company, making it a valuable resource for financial analysis, stock market research, and investment strategies.
Analyze stock symbols, company names, and market data.
Incorporate company details into financial models and investment strategies.
Understand the distribution of companies by country and currency.
Create visualizations of the NASDAQ market landscape.
The data is sourced from the Twelve Data API, which provides up-to-date financial and stock market information.
Notes: The dataset includes only NASDAQ-listed companies and does not cover other exchanges. Ensure to comply with any data usage policies or licensing agreements associated with the data source. Feel free to adapt the description based on the specific details and attributes of your dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset offers a detailed collection of US-GAAP financial data extracted from the financial statements of exchange-listed U.S. companies, as submitted to the U.S. Securities and Exchange Commission (SEC) via the EDGAR database. Covering filings from January 2009 onwards, this dataset provides key financial figures reported by companies in accordance with U.S. Generally Accepted Accounting Principles (GAAP).
This dataset primarily relies on the SEC's Financial Statement Data Sets and EDGAR APIs: - SEC Financial Statement Data Sets - EDGAR Application Programming Interfaces
In instances where specific figures were missing from these sources, data was directly extracted from the companies' financial statements to ensure completeness.
Please note that the dataset presents financial figures exactly as reported by the companies, which may occasionally include errors. A common issue involves incorrect reporting of scaling factors in the XBRL format. XBRL supports two tag attributes related to scaling: 'decimals' and 'scale.' The 'decimals' attribute indicates the number of significant decimal places but does not affect the actual value of the figure, while the 'scale' attribute adjusts the value by a specific factor.
However, there are several instances, numbering in the thousands, where companies have incorrectly used the 'decimals' attribute (e.g., 'decimals="-6"') under the mistaken assumption that it controls scaling. This is not correct, and as a result, some figures may be inaccurately scaled. This dataset does not attempt to detect or correct such errors; it aims to reflect the data precisely as reported by the companies. A future version of the dataset may be introduced to address and correct these issues.
The source code for data extraction is available here
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Shark Tank India - Season 1 to season 4 information, with 80 fields/columns and 630+ records.
All seasons/episodes of 🦈 SHARKTANK INDIA 🇮🇳 were broadcasted on SonyLiv OTT/Sony TV.
Here is the data dictionary for (Indian) Shark Tank season's dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In the bustling world of Kanto, where Pokémon battles shape destinies, crime lurks in the shadows. Detective Kotso, the sharpest mind in Pokémon crime investigations, has been tasked with an urgent mission. The mayor suspects that Team Rocket has infiltrated the city, disguising themselves as ordinary citizens.
But Kotso doesn’t work alone—he relies on you, a brilliant data scientist, to uncover the truth. Your job? Analyze the data of 5,000 residents to predict which of the 1,000 unclassified individuals are secretly part of Team Rocket.
Can you spot the hidden patterns? Can Machine Learning crack the case where traditional detective work fails? The fate of Kanto depends on your skills.
This dataset holds the key to exposing Team Rocket’s operatives. Below is a breakdown of the features at your disposal:
Column Name | Description |
---|---|
ID | Unique identifier for each citizen |
Age | Age of the citizen |
City | City the citizen is from |
Economic Status | Low, Medium, High |
Occupation | Profession in the Pokémon world |
Most Frequent Pokémon Type | The type of Pokémon most frequently used |
Average Pokémon Level | Average level of owned Pokémon |
Criminal Record | Clean (0) or Dirty (1) |
Pokéball Usage | Preferred Pokéball type (e.g., DarkBall, UltraBall) |
Winning Percentage | Battle win rate (e.g., 64%, 88%) |
Gym Badges | Number of gym badges collected (0 to 8) |
Is Pokémon Champion | True if the citizen has defeated the Pokémon Elite Four |
Battle Strategy | Defensive, Aggressive, Unpredictable |
City Movement Frequency | Number of times the citizen moved between cities in the last year |
Possession of Rare Items | Yes or No |
Debts to the Kanto System | Amount of debt (e.g., 20,000) |
Charitable Activities | Yes or No |
Team Rocket Membership | Yes or No (target variable) |
This dataset is not just about numbers—it’s a criminal investigation. Hidden patterns lurk beneath the surface, waiting to be uncovered.
This isn’t just another classification task—it’s a race against time to stop Team Rocket before they take control of Kanto!
Detective Kotso is counting on you. Will you rise to the challenge? 🕵️♂️🔎
1️⃣ Do certain Pokémon types indicate suspicious behavior?
- 📈 Graph: Stacked bar chart comparing Pokémon type distribution between Rocket & non-Rocket members.
- 🎯 Test: Chi-square test for correlation.
2️⃣ Is economic status a reliable predictor of criminal affiliation?
- 📊 Graph: Box plot of debt and economic status per Team Rocket status.
- 🏦 Test: ANOVA test for group differences.
3️⃣ Do Team Rocket members have a preference for specific PokéBalls?
- 🎨 Graph: Heatmap of PokéBall usage vs. Team Rocket status.
- ⚡ Test: Chi-square test for independence.
4️⃣ Does a high battle win ratio correlate with Team Rocket membership?
- 📉 Graph: KDE plot of win ratio distribution for both classes.
- 🏆 Test: T-test for mean differences.
5️⃣ Are migration patterns different for Team Rocket members?
- 📈 Graph: Violin plot of migration counts per group.
- 🌍 Test: Mann-Whitney U test.
6️⃣ Do Rocket members tend to avoid charity participation?
- 📊 Graph: Grouped bar chart of charity participation rates.
- 🕵️♂️ Test: Fisher’s Exact Test for small sample sizes.
7️⃣ Do Rocket members disguise themselves in certain professions?
- 📊 Graph: Horizontal bar chart of profession frequency per group.
- 🕵️♂️ Test: Chi-square test for profession-Team Rocket relationship.
8️⃣ Is there an unusual cluster of Rocket members in specific cities?
- 🗺 Graph: Geographic heatmap of city distributions.
- 📌 Test: Spatial autocorrelation test.
9️⃣ How does badge count affect the likelihood of being a Rocket member?
- 📉 Graph: Histogram of gym badge distributions.
- 🏅 Test: Kruskal-Wallis test.
🔟 **Are there any multi-feature interactions that reve...
This dataset contains a wealth of information from 52,000 loan applications, offering detailed insights into the factors that influence loan approval decisions. Collected from financial institutions, this data is highly valuable for credit risk analysis, financial modeling, and predictive analytics. The dataset is particularly useful for anyone interested in applying machine learning techniques to real-world financial decision-making scenarios.
Overview: This dataset provides information about various applicants and the loans they applied for, including their demographic details, income, loan terms, and approval status. By analyzing this data, one can gain an understanding of which factors are most critical for determining the likelihood of loan approval. The dataset can also help in evaluating credit risk and building robust credit scoring systems.
Dataset Columns: Applicant_ID: Unique identifier for each loan application. Gender: Gender of the applicant (Male/Female). Age: Age of the applicant. Marital_Status: Marital status of the applicant (Single/Married). Dependents: Number of dependents the applicant has. Education: Education level of the applicant (Graduate/Not Graduate). Employment_Status: Employment status of the applicant (Employed, Self-Employed, Unemployed). Occupation_Type: Type of occupation, which provides insights into the nature of the applicant’s job (Salaried, Business, Others). Residential_Status: Type of residence (Owned, Rented, Mortgage). City/Town: The city or town where the applicant resides. Annual_Income: The total annual income of the applicant, a key factor in loan eligibility. Monthly_Expenses: The monthly expenses of the applicant, indicating their financial obligations. Credit_Score: The applicant's credit score, reflecting their creditworthiness. Existing_Loans: Number of existing loans the applicant is servicing. Total_Existing_Loan_Amount: The total amount of all existing loans the applicant has. Outstanding_Debt: The remaining amount of debt yet to be paid by the applicant. Loan_History: The applicant’s previous loan history (Good/Bad), indicating their repayment reliability. Loan_Amount_Requested: The loan amount the applicant has applied for. Loan_Term: The term of the loan in months. Loan_Purpose: The purpose of the loan (e.g., Home, Car, Education, Personal, Business). Interest_Rate: The interest rate applied to the loan. Loan_Type: The type of loan (Secured/Unsecured). Co-Applicant: Indicates if there is a co-applicant for the loan (Yes/No). Bank_Account_History: Applicant’s banking history, showing past transactions and reliability. Transaction_Frequency: The frequency of financial transactions in the applicant’s bank account (Low/Medium/High). Default_Risk: The risk level of the applicant defaulting on the loan (Low/Medium/High). Loan_Approval_Status: Final decision on the loan application (Approved/Rejected).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cook County Certified Veteran Owned Businesses
This is a dataset hosted by the City of Chicago. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore the City of Chicago using Kaggle and all of the data sources available through the City of Chicago organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
Cover photo by 刘 帅 on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:
Context:
Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.
Inspiration:
The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.
Dataset Information:
The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:
Use Cases:
Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains information about world's biggest companies.
Among them you can find companies founded in the US, the UK, Europe, Asia, South America, South Africa, Australia.
The dataset contains information about the year the company was founded, its' revenue and net income in years 2018 - 2020, and the industry.
I have included 2 csv files: the raw csv file if you want to practice cleaning the data, and the clean csv ready to be analyzed.
The third dataset includes the name of all the companies included in the previous datasets and 2 additional columns: number of employees and name of the founder.
In addition there's tesla.csv file containing shares prices for Tesla.