18 datasets found

ML HACK Dataset
kaggle.com
zip
Updated Nov 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhinav Padmawar (2020). ML HACK Dataset [Dataset]. https://www.kaggle.com/abhinavpadmawar20/ml-hack-dataset
Explore at:
zip(94446 bytes)Available download formats
Dataset updated
Nov 17, 2020
Authors
Abhinav Padmawar
Description
Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.

File descriptions:

train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player
submit_ai_hack_2024
kaggle.com
zip
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SURIYA CHAYATUMMAGOON (2024). submit_ai_hack_2024 [Dataset]. https://www.kaggle.com/suriyachayatummagoon/submit-ai-hack-2024
Explore at:
zip(3253 bytes)Available download formats
Dataset updated
Aug 14, 2024
Authors
SURIYA CHAYATUMMAGOON
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by SURIYA CHAYATUMMAGOON

Released under MIT

Contents
Flower Type Prediction Machine Hack
kaggle.com
zip
Updated Aug 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
V.Prasanna Kumar (2020). Flower Type Prediction Machine Hack [Dataset]. https://www.kaggle.com/datasets/vpkprasanna/flower-type-prediction-machine-hack
Explore at:
zip(402827 bytes)Available download formats
Dataset updated
Aug 21, 2020
Authors
V.Prasanna Kumar
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.

In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.

In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.

Dataset Description:

Train.csv - 12666 rows x 7 columns (includes Class as target column) Test.csv - 29555 rows x 6 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.

Attributes Description:

Area_Code - Generic Area code, species were collected from Locality_Code - Locality code, species were collected from Region_Code - Region code, species were collected from Height - Height collected from lab data Diameter - Diameter collected from lab data Species - Species of the flower Class - Target Column (0-7) classes

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Buyer's Time Prediction Challenge
kaggle.com
zip
Updated Dec 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohd Aquib (2020). Buyer's Time Prediction Challenge [Dataset]. https://www.kaggle.com/aquib5559/buyers-time-prediction-challenge
Explore at:
zip(296363 bytes)Available download formats
Dataset updated
Dec 18, 2020
Authors
Mohd Aquib
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
The Dataset is from Machine Hack.

Buyers spend a significant amount of time surfing an e-commerce store, since the pandemic the e-commerce has seen a boom in the number of users across the domains. In the meantime, the store owners are also planning to attract customers using various algorithms to leverage customer behavior patterns

Tracking customer activity is also a great way of understanding customer behavior and figuring out what can actually be done to serve them better. Machine learning and AI has already played a significant role in designing various recommendation engines to lure customers by predicting their buying patterns

Dataset Description:

Train.json - 5429 rows x 9 columns (Includes time_spent Column as Target variable)

Test.json - 2327 rows x 8 columns

Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description:

session_id - Unique identifier for every row

session_number - Session type identifier

client_agent - Client-side software details

device_details - Client-side device details

date - Datestamp of the session

purchased - Binary value for any purchase done

added_in_cart - Binary value for cart activity

checked_out - Binary value for checking out successfully

time_spent - Total time spent in seconds (Target Column)
GitHub Bugs Prediction Challenge (Machine Hack)
kaggle.com
zip
Updated Oct 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). GitHub Bugs Prediction Challenge (Machine Hack) [Dataset]. https://www.kaggle.com/arbazkhan971/github-bugs-prediction-challenge-machine-hack
Explore at:
zip(103105526 bytes)Available download formats
Dataset updated
Oct 8, 2020
Authors
ask9
Description
Overview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.

However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.

Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon

Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!

Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data
Machine Hack House Price Prediction
kaggle.com
zip
Updated Oct 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurav Mishra (2020). Machine Hack House Price Prediction [Dataset]. https://www.kaggle.com/msaurav/machine-hack-house-price-prediction
Explore at:
zip(2233190 bytes)Available download formats
Dataset updated
Oct 2, 2020
Authors
Saurav Mishra
Description
Context

Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.

Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.

Content

Data Description: Train.csv - 29451 rows x 12 columns Test.csv - 68720 rows x 11 columns Sample Submission - Acceptable submission format. (.csv/.xlsx file with 68720 rows)

Attributes Description: POSTED_BY - Category marking who has listed the property UNDER_CONSTRUCTION - Under Construction or Not RERA - Rera approved or Not BHK_NO - Number of Rooms BHK_OR_RK - Type of property SQUARE_FT - Total area of the house in square feet READY_TO_MOVE - Category marking Ready to move or Not RESALE - Category marking Resale or not ADDRESS - Address of the property LONGITUDE - Longitude of the property LATITUDE - Latitude of the property

Acknowledgements

This dataset was taken from a machine hack competition. Link to the competition: machine hack competition

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Dare In Reality Hackathon 2021: Machine hack
kaggle.com
zip
Updated Nov 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Tripathi (2021). Dare In Reality Hackathon 2021: Machine hack [Dataset]. https://www.kaggle.com/manishtripathi86/dare-in-reality-hackathon-2021-machine-hack
Explore at:
zip(370370 bytes)Available download formats
Dataset updated
Nov 10, 2021
Authors
Manish Tripathi
Description
Dataset Source: https://machinehack.com/hackathons/dare_in_reality_hackathon/overview

Overview In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?

To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.

Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.

Challenge Hackathon Starts Datasets will be made live on 08th November, at 06:00 PM IST Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data Final winners will be notified via an email, based on an aggregate score of their private leaderboard rankings.

What is the Metric In this competition? How is the Leaderboard Calculated ? The submission will be evaluated using the RMSLE metric. One can usenp.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values

You should submit a .csv file with exactly 420 rows with 1 column(LAP_TIME). Your submission will return an Invalid Score if you have extra columns or rows.

The file should have exactly 1 column.

Note: Do not shuffle the sequence of the test series

Using Pandas:

submission_df.to_csv('my_submission_file.csv', index=False)
Microsoft Professional Capstone DataSet
kaggle.com
zip
Updated Oct 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Sharma (2017). Microsoft Professional Capstone DataSet [Dataset]. https://www.kaggle.com/sharmaharsh/microsoft-capstone
Explore at:
zip(6858551 bytes)Available download formats
Dataset updated
Oct 22, 2017
Authors
Harsh Sharma
Description
Problem Description

About the data

Target Variable

Submission Format

Performance Metric

Features

Categories

Example Row

About the data

Your goal is to predict a student's earnings a set number of years after they have enrolled in United States institutions of higher education. The data is compiled from a wide range of sources and made publicly available by the United States Department of education.

Target Variable

We're trying to predict the variable income, which represents earnings in thousands of US dollars a set interval from when the student first enrolled.

Submission Format

The format for the submission file is two columns with the row_id and the income. The data type of income is a float, so make sure there is a decimal point in your submission. For example 0.0 is a valid float. 0 is not.

For example, if you predicted...

row_id income
2 0.0
8 0.0
9 0.0
10 0.0
11 0.0

The first few lines of the .csv file that you submit would look like:

row_id,income 2,0.0 8,0.0 9,0.0 10,0.0 11,0.0

Performance Metric

We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric called Root-mean-squared error. It is an error metric, so lower value is better (as opposed to an accuracy metric, where a higher value is better).

\[RMSE = \sqrt{\frac{1}{N}\sum_{n=1}^{N} (\hat{y}_n - y_n)^2 }\]

Where $\hat{y}_n$ is the predicted earnings and $y_n$ is the actual earnings. The best possible score is 0, but the worst possible score can be infinite.

Features

There are 297 variables in this dataset. Each row in the dataset represents a United States institution of higher education in a specific year. The dataset we are working with covers four particular years, denoted year_a, year_f, year_w, and year_z in our dataset. An institution may have a row for all, some, or just for one of the years. We don't provide a unique identifier for an individual institution, just a row_id for each row.

The variables in the dataset have names that of the form category_variable, where category is the high level category of the variable (e.g. academics or students). variable is what the specific column contains.

Categories

academics

program_assoc_agriculture: Associate degree in Agriculture, Agriculture Operations, And Related Sciences.

program_assoc_architecture: Associate degree in Architecture And Related Services.

program_assoc_biological: Associate degree in Biological And Biomedical Sciences.

program_assoc_business_marketing: Associate degree in Business, Management, Marketing, And Related Support Services.

program_assoc_communication: Associate degree in Communication, Journalism, And Related Programs.

program_assoc_communications_technology: Associate degree in Communications Technologies/Technicians And Support Services.

program_assoc_computer: Associate degree in Computer And Information Sciences And Support Services.

program_assoc_construction: Associate degree in Construction Trades.

program_assoc_education: Associate degree in Education.

program_assoc_engineering: Associate degree in Engineering.

program_assoc_engineering_technology: Associate degree in Engineering Technologies And Engineering-Related Fields.

program_assoc_english: Associate degree in English Language And Literature/Letters.

program_assoc_ethnic_cultural_gender: Associate degree in Area, Ethnic, Cultural, Gender, And Group Studies.

program_assoc_family_consumer_science: Associate degree in Family And Consumer Sciences/Human Sciences.

program_assoc_health: Associate degree in Health Professions And Related Programs.

program_assoc_history: Associate degree in History.

program_assoc_humanities: Associate degree in Liberal Arts And Sciences, General Studies And Humanities.

program_assoc_language: Associate degree in Foreign Languages, Literatures, And Linguistics.

program_assoc_legal: Associate degree in Legal Professions And Studies.

program_assoc_library: Associate degree in Library Science.

program_assoc_mathematics: Associate degree in Mathematics And Statistics.

program_assoc_mechanic_repair_technology: Associate degree in Mechanic And Repair Technologies/Technicians.

program_assoc_military: Associate degree in Military Technologies And Applied Sciences. ...
The Great Indian Hiring Hackathon
kaggle.com
zip
Updated Nov 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). The Great Indian Hiring Hackathon [Dataset]. https://www.kaggle.com/arbazkhan971/the-great-indian-hiring-hackathon
Explore at:
zip(6815872 bytes)Available download formats
Dataset updated
Nov 6, 2020
Authors
ask9
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The current pandemic has dwindled the data science job market likewise recruiters are also facing difficulties filtering the right talent. To bridge this gap we bring a chance for the MachineHack community to compete for jobs with some of the key analytics players for a rewarding career in Data Science. In this competition, we are challenging the MachineHack community to come up with an algorithm to predict the price of retail items belonging to different categories. Foretelling the Retail price can be a daunting task due to the huge datasets with a variety of attributes ranging from Text, Numbers(floats, integers), and DateTime. Also, outliers can be a big problem when dealing with unit prices.

With a key focus on the Data Scientist role in an esteemed organization, this hackathon can help freshers and experienced folks prove their mettle and land up in a rewarding career.

By participating in this hackathon, every participant will be eligible for the Data Scientist job role by making sure their MachineHack Information with Resume is up to date.

Dataset Description:

Train.csv - 284780 rows x 8 columns (Inlcudes UnitPrice Columns as Target) Test.csv - 122049 rows x 7 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description:

Invoice No - Invoice ID, encoded as Label StockCode - Unique code per stock, encoded as Label Description - The Description, encoded as Label Quantity - Quantity purchased InvoiceDate - Date of purchase UnitPrice - The target value, price of every product CustomerID - Unique Identifier for every country Country - Country of sales, encoded as Label

Skills:

Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSE to generalize well on unseen data

Acknowledgements

This dataset is taken from Machine Hack challenge https://www.machinehack.com/hackathons/retail_price_prediction_mega_hiring_hackathon/overview
MH Indus PandoraBox
kaggle.com
zip
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Ziauddin (2022). MH Indus PandoraBox [Dataset]. https://www.kaggle.com/datasets/mohamedziauddin/mh-indus-pandorabox
Explore at:
zip(138278864 bytes)Available download formats
Dataset updated
Sep 13, 2022
Authors
Mohamed Ziauddin
Description
All you need to know about the “Quote to Code I: Pandora's Box by Indus OS”:

You are provided with data about:

app metadata: Metadata about the app user metadata: Metadata about the user app installs: apps installed by a user in the last six months app usage: apps used by the user in the last one week. actual set: For participants to validate their model and results validation set: UIDs on which participants need to predict top four recommendations from the universe of apps in apps metadata. data dictionary: check this file for more details about the datasets sample submission: schema to submit the final results Participants need to predict top four recommendations from the universe of apps in apps metadata.

Note: The purpose of uploading the data is to support kagglers to make prediction for this machine hack competition (source:https://machinehack.com/hackathons/quote_to_code_i_pandoras_box/data)
Electrical Motor sound data
kaggle.com
zip
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afroz (2024). Electrical Motor sound data [Dataset]. https://www.kaggle.com/datasets/pythonafroz/electrical-motor-anomaly-detection-from-sound-data
Explore at:
zip(4469982501 bytes)Available download formats
Dataset updated
Apr 16, 2024
Authors
Afroz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "development dataset" for the DCASE 2021 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions".

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

Fan Gearbox Pump Slide Rail Valve

Why focus on domain shift?

The task setup of the 2020 version was the ASD under ideal conditions. The training- and testing-phase datasets were generated under the same recording conditions, and enough normal training clips recorded under the test domain were made available. In contrast, real-world cases are more complicated and often involve different machine operating conditions between the training and testing phases. A frequent example of this is when the motor speed continuously varies in a conveyor transporting products on a production line based on the production volume in response to product demand. Since there is infinite variation in rotation speed, the sound will also change with infinite variation. Due to the seasonal demand for many products, a limited period of recording training data limits the motor speed during that period (e.g., 200-300 rpm for autumn) and variations in the training data. However, in the test phase, the ASD system must continue to monitor the conveyor through all seasons, so it must be able to monitor all possible motor speed conditions, including those that differ from the training data (such as 100-400 rpm). In addition to the conditions of the machine, environmental noise conditions (SNR, sound characteristics, etc.) also fluctuate uncontrollably depending on the seasonal demand. In such a situation, the normal state's distribution will be changed (i.e., domain shift).

Definition

First, we define some important terms in this task: "machine type," "section," "source domain," and "target domain."

The machine type means the kind of machine, which can be one of seven in this task: fan, gearbox, pump, slide rail, ToyCar, ToyTrain, and valve. The section is defined as a subset of the dataset for calculating performance metrics and is almost identical to what was called "machine ID" in the 2020 version. In the 2020 version, there was a one-to-one correspondence between machine IDs and products, but in the 2021 version, the same product may appear in different sections. Different products may appear in the same section. The source domain means the condition under which most of the training data was recorded, and the target domain means a different condition under which some of the test data was recorded. The source and target domains differ in terms of operating speed, machine load, viscosity, heating temperature, environmental noise, SNR, etc.

Data

This dataset consists of three sections for each machine type (Section 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) around 1,000 clips of normal sounds in a source domain for training, (ii) only three clips of normal sounds in a target domain for training, (iii) around 100 clips each of normal and anomalous sounds in the source domain for the test, and (iv) around 100 clips each of normal and anomalous sounds in the target domain for the test.

Recording procedure

Normal/anomalous operating sounds of machines and related equipment were recorded. Anomalous sounds were collected by deliberately damaging machines. To simplify the task, we only used the first channel of the multi-channel recordings; all recordings were regarded as single-channel recordings from a fixed microphone. We mixed a machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise clips were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Reference labels

The given labels for each training/test clip are machine type, section index, normal/anomaly information, and brief attribute information about conditions other than normal/abnormal. The machine type information is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information is given by their respective file names. For the training data, the attribute information is given by their respective file names.

Baseline system

Two simple baseline systems are available on the Github repository [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They...
deloitte hackathon predict the loan defaulter
kaggle.com
zip
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Tripathi (2021). deloitte hackathon predict the loan defaulter [Dataset]. https://www.kaggle.com/datasets/manishtripathi86/deloitte-hackathon-predict-the-loan-defaulter/discussion
Explore at:
zip(9131913 bytes)Available download formats
Dataset updated
Nov 30, 2021
Authors
Manish Tripathi
Description
Overview Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited (“DTTL”), its global network of member firms, and their related entities (collectively, the “Deloitte organization”). DTTL (also referred to as “Deloitte Global”) and each of its member firms and related entities are legally separate and independent entities, which cannot obligate or bind each other in respect of third parties. DTTL and each DTTL member firm and related entity is liable only for its own acts and omissions, and not those of each other. DTTL does not provide services to clients. Please see www.deloitte.com/about to learn more.

All the facts and figures that talk to our size and diversity and years of experiences, as notable and important as they may be, are secondary to the truest measure of Deloitte: the impact we make in the world.

So, when people ask, “what’s different about Deloitte?” the answer resides in the many specific examples of where we have helped Deloitte member firm clients, our people, and sections of society to achieve remarkable goals, solve complex problems or make meaningful progress. Deeper still, it’s in the beliefs, behaviours and fundamental sense of purpose that underpin all that we do. Deloitte Globally has grown in scale and diversity—more than 345,000 people in 150 countries, providing multidisciplinary services yet our shared culture remains the same.

(C) 2021 Deloitte Touche Tohmatsu India LLP”

Dataset Link: https://machinehack-staging.netlify.app/hackathons/deloitte_hackathon_predict_the_loan_defaulter/overview

https://analyticsindiamag.com/deloitte-in-association-with-machine-hack-present-machine-learning-challenge-an-exclusive-online-hackathon-for-data-scientists/

**The data has been posted here for easy use of kaggle kernels by competition participants. I do not claim any ownership for the data **

Challenge Dataset Description Train.csv - 70,000 rows x 40 columns (Includes target column as Loan Status) Attributes ID Loan Amount Funded Amount Funded Amount Investor Term Batch Enrolled Interest Rate Grade Sub Grade CTC Designation Employment Duration Home Ownership Verification Status Payment Plan Loan Purpose Loan Title Zip Code Address State Debt to Income Delinquency - two years Inquires - six months Open Account Public Record Revolving Balance Revolving Utilities Total Accounts Initial List Status Total Received Interest Total Received Late Fee Recoveries Collection Recovery Fee Collection 12 months Medical Application Type Last week Pay Accounts Delinquent Total Collection Amount Total Current Balance Total Revolving Credit Limit Loan Status Test.csv - 30,000 rows x 39 columns(Includes target column as Loan Status) Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the Loan Status Knowledge and Skills Big dataset, underfitting vs overfitting Optimising log_loss to generalise well on unseen data

Competition Rules:

NE ACCOUNT PER PARTICIPANT One account per participant. Submissions from multiple accounts will lead to disqualification. All registered users are eligible to participate in the hackathon. We ask that you respect the spirit of the competition and do not cheat.

NO PRIVATE SHARING OUTSIDE TEAMS No private sharing outside teams. Any discrepancies reported will be taken seriously and can lead to disqualification.

SUBMISSION LIMITS The submission limit for the hackathon is 3 per day after which the submission will not be evaluated All registered users are eligible to participate in the hackathon We ask that you respect the spirit of the competition and do not cheat.

COMPETTION TIMELINE Start Date: 26/11/2021 End Date: 13/12/2021

Hackathon Specific Rules Deadline This hackathon will expire on 22nd November at 06:00 PM IST. Disqualification Analytics India Magazine and Deloitte reserve the right to disqualify any participant if the details provided are found incorrect. Any external dataset usage is strictly prohibited. The participants will be disqualified if found using any external dataset

Evaluation What is the Metric In this competition? How is the Leaderboard Calculated? The submission will be evaluated using the Log Loss metric. One can use sklearn.metric.log_loss to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values

You should submit a .csv file with exactly 100,000 rows with 1 column(loan_status). Your submission will return an Invalid Score if you hav...
Soccer Fever Challenge
kaggle.com
zip
Updated Aug 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aishik Rakshit (2021). Soccer Fever Challenge [Dataset]. https://www.kaggle.com/datasets/aishikai/soccer-fever-challenge
Explore at:
zip(263194 bytes)Available download formats
Dataset updated
Aug 21, 2021
Authors
Aishik Rakshit
Description
Overview

Welcome back to the ‘Weekend Hackathon Edition 2- The Last Hacker Standing’ at Machine Hack. In this edition, we will be posing unique problem statements every week, which will test you over various aspects of being a Data Scientist. The Weekend Edition will be held for a 6 week period starting 30 July 2021 to 9 Sept 2021.

This time it is dedicated to passion and fervour which a sport creates. Challenge Name: THE SOCCER FEVER

Introduction

Soccer aka Football is the most popular game in the world. It’s a religion of its own. If groups of 10 people can stop time and make people watch them in awe and reverence, it’s this beautiful game. Also, anybody can play soccer- all it needs is 4 poles, a ground and a ball. You can just get started with the play.

In fact, Nelson Mandela very effectively used Football as the unifying factor when he was elected President of South Africa post the Apartheid era. The sport just cuts across all discriminating factors.

Relevance

An entire ecosystem revolves around this beautiful sport. Clubs, Merchandise, listed football clubs, fan clubs and a group of rivals who can just get into a fight based on the outcome of the game. The amount of currency involved in this game is just phenomenal. It impacts millions of people who depend on it for their livelihood and recreation. Criticality

We live in ambiguity and always need some information to just make a decision. Decisions are made based on possible outcomes. Win/ Loss/ Pass / Fail etc.

The below problem statement is a classic study for decision-making and understanding the odds stacked against a particular situation.

Train

Dataset: 7443*21 Columns: 21 Target Column: Outcome

Evaluation Metric: Log Loss

Test

Dataset: 4008*20 Columns: 20

Submission Format :

Dataset: 4008*1( Column Name - ‘Outcome’)

Skills

Multi-Class Classification Optimizing Log Loss
HSBC ML Hackathon 2023
kaggle.com
zip
Updated Apr 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashis Parida (2023). HSBC ML Hackathon 2023 [Dataset]. https://www.kaggle.com/datasets/ashisparida/hsbc-ml-hackathon-2023
Explore at:
zip(53780476 bytes)Available download formats
Dataset updated
Apr 21, 2023
Authors
Ashis Parida
Description
Problems

<hr> <div>Payment transfer</div> <div> <div>Max. score: 100</div> <div></div> </div> </div>

Payment transfer

You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been an increased scrutiny of the digital payments to check for its genuineness.
To aid the team to deal with this problem, you are provided with the payments data to predict whether the customer themselves have made the transfer or not.The payments data contains the attributes which gets captures when a payment is initiated by a banking customer.

Task

You are required to build a machine learning model that can predict whether the customer themselves have made the transfer or not.

Dataset description

The dataset folder contains the following files:

train.csv: 233633 x 14

train_helper.csv: 1231200 x 10

test.csv: 215852 x 13

test_helper.csv: 1160950 x 10

sample_submission.csv: 215852 x 3

Evaluation metric

precision, recall, threshold = metrics.precision_recall_curve(actual, predicted) score = max(0, 100*metrics.auc(precision, recall))

Result submission guidelines

The index is "V2" and the target is the ["Probability","Target"] columns.

The submission file must be submitted in .csv format only.

The size of this submission file must be 215852 x 3.

Note: Ensure that your submission file contains the following:

Correct index values as per the test file

Correct names of columns as provided in the sample_submission.csv file

Download dataset

</div> </div>

Problems

<hr> <div>Payment transfer</div> <div> <div>Max. score: 100</div> <div></div> </div> </div> <div> <div> <div>Payment transfer</div> <p>You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been a...
Cars Price Dataset
kaggle.com
zip
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Jarupula (2021). Cars Price Dataset [Dataset]. https://www.kaggle.com/jarupula/machine-hack
Explore at:
zip(606314 bytes)Available download formats
Dataset updated
Aug 3, 2021
Authors
Rakesh Jarupula
Description
About Data

With the rise in the variety of cars with differentiated capabilities and features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, colour, airbags and many more, we are bringing a car price prediction challenge for all. We all aspire to own a car within budget with the best features available. To solve the price problem we have created a dataset of 19237 for the training dataset and 8245 for the test dataset.

Dataset Description

Train.csv - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID

Price: price of the care(Target Column)

Levy

Manufacturer

Model

Prod. year

Category

Leather interior

Fuel type

Engine volume

Mileage

Cylinders

Gear box type

Drive wheels

Doors

Wheel

Color

Airbags

Test.csv - 8245 rows x 17 columns

Sample Submission.csv
Dare In Reality
kaggle.com
zip
Updated Nov 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prakhar Prasad (2021). Dare In Reality [Dataset]. https://www.kaggle.com/datasets/prakharprasad/dare-in-reality
Explore at:
zip(370370 bytes)Available download formats
Dataset updated
Nov 8, 2021
Authors
Prakhar Prasad
Description
Context

In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?

Build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.

Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.

Content

Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data

Acknowledgements

The hackathon and the dataset was published on Machine Hack https://machinehack.com/hackathons/dare_in_reality_hackathon/overview
Detecting Anomalies in Wafer Manufacturing
kaggle.com
zip
Updated Aug 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SubhamNagar (2020). Detecting Anomalies in Wafer Manufacturing [Dataset]. https://www.kaggle.com/subham07/detecting-anomalies-in-water-manufacturing
Explore at:
zip(124481 bytes)Available download formats
Dataset updated
Aug 28, 2020
Authors
SubhamNagar
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

The Dataset is taken from Machine Hack Weekend Hackathon #18

The filenames are described as follows:

Train.csv: Training set containing 1559 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine. The last column is the target variable (class) it belongs to

Test.csv: Test set containing 1558 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine.

Submission csv file will have just one column (Class) which will store predicted value of the target Variable

Class (0 or 1): Represents Good/Anomalous Class labels for the products.

Acknowledgement

Machine Hack
MH : Dare in Reality 2021
kaggle.com
zip
Updated Nov 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Ziauddin (2021). MH : Dare in Reality 2021 [Dataset]. https://www.kaggle.com/mohamedziauddin/mh-dare-in-reality-2021
Explore at:
zip(370370 bytes)Available download formats
Dataset updated
Nov 8, 2021
Authors
Mohamed Ziauddin
Description
Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv - predict the LAP_TIME for the qualifying groups of location 6, 7 and 8 for the test data.

Note: this data is from below machine hack site, to help the kaggle users make use of kaggle notebook for modelling https://machinehack.com/hackathons/dare_in_reality_hackathon/data
Not seeing a result you expected?
Learn how you can add new datasets to our index.

row_id	income
2	0.0
8	0.0
9	0.0
10	0.0
11	0.0

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhinav Padmawar (2020). ML HACK Dataset [Dataset]. https://www.kaggle.com/abhinavpadmawar20/ml-hack-dataset

ML HACK Dataset

Dataset used in ML Hack Hackathon

Explore at:

zip(94446 bytes)Available download formats

Dataset updated

Nov 17, 2020

Authors

Abhinav Padmawar

Description

Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.

File descriptions:

train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player

Clear search

Close search

Google apps

Main menu

ML HACK Dataset

submit_ai_hack_2024

Dataset

Contents

Flower Type Prediction Machine Hack

Context

Content

Acknowledgements

Inspiration

Buyer's Time Prediction Challenge

Dataset Description:

Attribute Description:

GitHub Bugs Prediction Challenge (Machine Hack)

Machine Hack House Price Prediction

Context

Content

Acknowledgements

Inspiration

Dare In Reality Hackathon 2021: Machine hack

Microsoft Professional Capstone DataSet

Problem Description

About the data

Target Variable

Submission Format

Performance Metric

Features

Categories

The Great Indian Hiring Hackathon

Overview

Dataset Description:

Attribute Description:

Skills:

Acknowledgements

MH Indus PandoraBox

Electrical Motor sound data

deloitte hackathon predict the loan defaulter

Soccer Fever Challenge

Overview

Introduction

Relevance

Train

Test

Submission Format :

Skills

HSBC ML Hackathon 2023

Evaluation metric

Result submission guidelines

Cars Price Dataset

About Data

Dataset Description

Dare In Reality

Context

Content

Acknowledgements

Detecting Anomalies in Wafer Manufacturing

Context

Content

Acknowledgement

MH : Dare in Reality 2021

ML HACK Dataset

Dataset used in ML Hack Hackathon