This dataset was created by Jiayang Gao
This dataset was created by Jannis
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv
file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.
This dataset is the submission.csv
file that will produce a public leaderboard score of 0.00000
.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides a detailed look into the world of competitive video gaming in universities. It covers a wide range of topics, from performance rankings and results across multiple esports platforms to the individual team and university rankings within each tournament. With an incredible wealth of data, fans can discover statistics on their favorite teams or explore the challenges placed upon university gamers as they battle it out to be the best. Dive into the information provided and get an inside view into the world of collegiate esports tournaments as you assess all things from Match ID, Team 1, University affiliations, Points earned or lost in each match and special Seeds or UniSeeds for exceptional teams. Of course don't forget about exploring all the great Team Names along with their corresponding websites for further details on stats across tournaments!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Download Files First, make sure you have downloaded the CS_week1, CS_week2, CS_week3 and seeds datasets on Kaggle. You will also need to download the currentRankings file for each week of competition. All files should be saved using their originally assigned name in order for your analysis tools to read them properly (ie: CS_week1.csv).
Understand File Structure Once all data has been collected and organized into separate files on your desktop/laptop computer/mobile device/etc., it's time to become familiar with what type of information is included in each file. The main folder contains three main data files: week1-3 and seedings. The week1-3 contain teams matched against one another according to university, point score from match results as well as team name and website URL associated with university entry; whereas the seedings include a ranking system amongst university entries which are accompanied by information regarding team names, website URLs etc.. Furthermore, there is additional file featured which contains currentRankings scores for each individual player/teams for an first given period of competition (ie: first week).
Analyzing Data Now that everything is set up on your end it’s time explore! You can dive deep into trends amongst universities or individual players in regards to specific match performances or standings overall throughout weeks of competition etc… Furthermore you may also jumpstart insights via further creation of graphs based off compiled date from sources taken from BUECTracker dataset! For example let us say we wanted compare two universities- let's say Harvard University v Cornell University - against one another since beginning of event i we shall extract respective points(column),dates(column)(found under result tab) ,regions(csilluminating North America vs Europe etc)general stats such as maps played etc.. As well any other custom ideas which would come along in regards when dealing with similar datasets!
- Analyze the performance of teams and identify areas for improvement for better performance in future competitions.
- Assess which esports platforms are the most popular among gamers.
- Gain a better understanding of player rankings across different regions, based on rankings system, to create targeted strategies that could boost individual players' scoring potential or team overall success in competitive gaming events
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: CS_week1.csv | Column name | Description | |:---------------|:----------------------------------------------| | Match ID | Unique identifier for each match. (Integer) | | Team 1 | Name of the first team in the match. (String) | | University | University associated with the team. (String) |
File: CS_week1_currentRankings.csv | Column name | Description | |:--------------|:-----------------------------------------------------------|...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is the result of my study on web-scraping of English Wikipedia in R and my tests on regression and classification modelization in R.
The content is create by reading the appropriate articles in English Wikipedia about Italian cities: I did'nt run NPL analisys but only the table with the data and I ranked every city from 0 to N in every aspect. About the values, 0 means "*the city is not ranked in this aspect*" and N means "*the city is at first place, in descending order of importance, in this aspect* ". If there's no ranking in a particular aspect (for example, the only existence of the airports/harbours with no additional data about the traffic or the size), then 0 means "*no existence*" and N means "*there are N airports/harbours*". The only not-numeric column is the column with the name of the cities in English form, except some exceptions (for example, "*Bra (CN)* " because of simplicity.
I acknowledge the Wikimedia Foundation for his work, his mission and to make available the cover image of this dataset, (please read the article "The Ideal city (painting)") . I acknowledge too StackOverflow and Cross-Validated to be the most important focus of technical knowledge in the world, all the people in Kaggle for the suggestions.
As a beginner in data analisys and modelization (Ok, I passed the exam of statistics in Politecnico di Milano (Italy), but there are more than 10 years that I don't work in this topic and my memory is getting old ^_^) I worked more on data clean, dataset building and building the simplest modelization.
You can use this datase to realize which city is good to live or to expand this to add some other data from Wikipedia (not only reading the tables but too to read the text adn extrapolate the data from the meaningless text.)
Hi, guys i'm new at Kaggle and this is my 1st dataset. so plz support by giving me feedback regarding my work.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Analyzing World Happiness Report over years could be a stressful and boring task because it takes a lot of time to clean the data. This dataset, which is basically a 3D-Time-Series data ready-to-use for ML tasks, is a simplified version of already existing World Happiness Report datasets.
This dataset contains the happiness score of each country, and some key factors that contribute directly to the overall happiness of the country, over 6 years (from 2015 to 2020).
For the ones who are new to the topic, I drop the link to the World Happiness Report
Here are some key differences from the original data:
Some features absent in any year's report or mostly unnecessary are excluded from the data. Features' names are now consistent.
For the sake of simplification, only the countries which are present in all annual reports are included in the data.
Instead of individual regions like Middle East and West Europe, continents are -in my opinion- a better choice for performing groupby-aggregate operations, so the already existing region column is replaced by the new continent column.
I am grateful to the Sustainable Development Solutions Network for creating the World Happiness Report and its Kaggle dataset, which I used for preprocessing in the first place.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://github.com/IyadElwy/Televisions/assets/83036619/7088d477-2559-4af2-94e9-924274521d36" alt="data_pipeline">
It's important to note the additionalProperties
field which makes the addition of more data to the field possible. I.e. the following fields will have alot more nested data.
json
{
"type": "object",
"additionalProperties": true,
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "string"
},
"normalized_title": {
"type": "string"
},
"wikipedia_url": {
"type": "string"
},
"wikiquotes_url": {
"type": "string"
},
"eztv_url": {
"type": "string"
},
"metacritic_url": {
"type": "string"
},
"wikipedia": {
"type": "object",
"additionalProperties": true
},
"wikiquotes": {
"type": "object",
"additionalProperties": true
},
"metacritic": {
"type": "object",
"additionalProperties": true
},
"tvMaze": {
"type": "object",
"additionalProperties": true
}
}
}
Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.
This is one of DSEval benchmarks.
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Digital image forensic has gained a lot of attention as it is becoming easier for anyone to make forged images. Several areas are concerned by image manipulation: a doctored image can increase the credibility of fake news, impostors can use morphed images to pretend being someone else.
It became of critical importance to be able to recognize the manipulations suffered by the images. To do this, the first need is to be able to rely on reliable and controlled data sets representing the most characteristic cases encountered. The purpose of this work is to lay the foundations of a body of tests allowing both the qualification of automatic methods of authentication and detection of manipulations and the training of these methods.
This dataset contains about 105000 splicing forgeries are available under the splicing directory. Each splicing is accompanied by two binary masks. One under the probe_mask subdirectory indicates the location of the forgery and one under the donor_mask indicates the location of the source. The external image can be found in the JSON file under the graph subdirectory.
If you use this dataset for your research, please refer to the original paper :
@INPROCEEDINGS{DEFACTODataset, AUTHOR=”Gaël Mahfoudi and Badr Tajini and Florent Retraint and Fr{'e}d{'e}ric Morain-Nicolier and Jean Luc Dugelay and Marc Pic”, TITLE=”{DEFACTO:} Image and Face Manipulation Dataset”, BOOKTITLE=”27th European Signal Processing Conference (EUSIPCO 2019)”, ADDRESS=”A Coruña, Spain”, DAYS=1, MONTH=sep, YEAR=2019 }
and to the MSCOCO dataset
The DEFACTO Consortium does not own the copyright of those images. Please refer to the MSCOCO terms of use for all images based on their Dataset.
This dataset was created by HuiUnited
Photo by Anastasiya Pavlova from Unsplash
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is a side product of a notebook to find out the rules of stress position in English.
It is a work based on another dataset with 300k+ English words.
I looked up dictionary for phonetic transcriptions with this free dictionary API and got about 30k transcriptions. Then I managed to extract syllable counts, stress positions and stressed syllables from them to make this new dataset.
words_stress_analyzed.csv
is the final dataset. Other files are just intermediate steps in the process.
Column | Datatype | Example | Description |
---|---|---|---|
word | str | complimentary | the English words |
phonetic | str | /ˌkɒmplɪ̈ˈment(ə)ɹɪ/ | the phonetic transcription of the words |
part_of_speech | str(list like) | ['adjective'] | how are these words used in sentences |
syllable_len | int | 5 | how many syllables are there in these words |
stress_pos | int | 3 | on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress |
stress_syllable | str | e | the vowel of the stressed syllable |
Note: Absence of stress symbol in some short words led to blanks in this dataset. It is recommended to filter out rows with empty stress_syllable and rows that syllable_len is 1.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains information about a number of participants (participants.csv) to a workshop that need to be assigned to a number of rooms (rooms.csv).
Restrictions: 1.- The workshop has 5 different activities 2.- Each participant has indicated their first, second and third preferences for the activities available (Priority1, Priority2 and Priority3 columns in participants.csv) 3.- Participants are part of teams (Team column in participant.csv) and should be assigned together 4.- Each Activity lasts for half a day, and each participant will take part in one activity in the morning and one activity in the afternoon. 5.- Each Room must contain the SAME activity in the morning and in the afternoon.
Requirements A.- Define the way i which each participant should be assigned through a csv file in the format Name;ActivityAM;RoomAM, ActivityPM;RoomPM B.- Maximize the number of people getting their 1st and 2nd preferences.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045
From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
This dataset contains trajectory data for the UR3 robot moving from point A to point B using imitation learning. The data was collected using a Spacemouse to manually control the robotic arm. This dataset is intended to facilitate research and development in robotic motion planning and control, specifically focusing on imitation learning algorithms.
The dataset is organized into several CSV files, each representing different trajectories and positions (joint positions and tool positions). The files are structured as follows:
Each CSV file contains the following columns:
Tool Position Files x: The x-coordinate of the end-effector relative to the base coordinate system. y: The y-coordinate of the end-effector relative to the base coordinate system. z: The z-coordinate of the end-effector relative to the base coordinate system. rx: The rotation around the x-axis relative to the base coordinate system. ry: The rotation around the y-axis relative to the base coordinate system. rz: The rotation around the z-axis relative to the base coordinate system.
Joint Position Files base: The position of the base joint relative to its neutral position. shoulder: The position of the shoulder joint relative to its neutral position. elbow: The position of the elbow joint relative to its neutral position. wrist1: The position of the first wrist joint relative to its neutral position. wrist2: The position of the second wrist joint relative to its neutral position. wrist3: The position of the third wrist joint relative to its neutral position.
Usage This dataset is intended for use in: Training and testing imitation learning algorithms.
**New to machine learning and data science? No question is too basic or too simple. Use this place to post any first-timer clarifying questions for the classification algorithm or related to datasets ** !This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable
This data set contains the following features:
This data set contains the following features:
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
🔗 Check out my notebook here: Link
This dataset includes malnutrition indicators and some of the features that might impact malnutrition. The detailed description of the dataset is given below:
Percentage-of-underweight-children-data: Percentage of children aged 5 years or below who are underweight by country.
Prevalence of Underweight among Female Adults (Age Standardized Estimate): Percentage of female adults whos BMI is less than 18.
GDP per capita (constant 2015 US$): GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.
Domestic general government health expenditure (% of GDP): Public expenditure on health from domestic sources as a share of the economy as measured by GDP.
Maternal mortality ratio (modeled estimate, per 100,000 live births): Maternal mortality ratio is the number of women who die from pregnancy-related causes while pregnant or within 42 days of pregnancy termination per 100,000 live births. The data are estimated with a regression model using information on the proportion of maternal deaths among non-AIDS deaths in women ages 15-49, fertility, birth attendants, and GDP measured using purchasing power parities (PPPs).
Mean-age-at-first-birth-of-women-aged-20-50-data: Average age at which women of age 20-50 years have their first child.
School enrollment, secondary, female (% gross): Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers.
This dataset was created by Jiayang Gao