Facebook
TwitterGiven below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.
File descriptions:
train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by SURIYA CHAYATUMMAGOON
Released under MIT
Facebook
TwitterThere's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.
In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.
In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.
Dataset Description:
Train.csv - 12666 rows x 7 columns (includes Class as target column)
Test.csv - 29555 rows x 6 columns
Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.
Attributes Description:
Area_Code - Generic Area code, species were collected from
Locality_Code - Locality code, species were collected from
Region_Code - Region code, species were collected from
Height - Height collected from lab data
Diameter - Diameter collected from lab data
Species - Species of the flower
Class - Target Column (0-7) classes
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
The Dataset is from Machine Hack.
Buyers spend a significant amount of time surfing an e-commerce store, since the pandemic the e-commerce has seen a boom in the number of users across the domains. In the meantime, the store owners are also planning to attract customers using various algorithms to leverage customer behavior patterns
Tracking customer activity is also a great way of understanding customer behavior and figuring out what can actually be done to serve them better. Machine learning and AI has already played a significant role in designing various recommendation engines to lure customers by predicting their buying patterns
Facebook
TwitterOverview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.
However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.
Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon
Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!
Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data
Facebook
TwitterWelcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.
Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.
Data Description: Train.csv - 29451 rows x 12 columns Test.csv - 68720 rows x 11 columns Sample Submission - Acceptable submission format. (.csv/.xlsx file with 68720 rows)
Attributes Description: POSTED_BY - Category marking who has listed the property UNDER_CONSTRUCTION - Under Construction or Not RERA - Rera approved or Not BHK_NO - Number of Rooms BHK_OR_RK - Type of property SQUARE_FT - Total area of the house in square feet READY_TO_MOVE - Category marking Ready to move or Not RESALE - Category marking Resale or not ADDRESS - Address of the property LONGITUDE - Longitude of the property LATITUDE - Latitude of the property
This dataset was taken from a machine hack competition. Link to the competition: machine hack competition
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterDataset Source: https://machinehack.com/hackathons/dare_in_reality_hackathon/overview
Overview In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?
To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.
Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.
Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.
Challenge Hackathon Starts Datasets will be made live on 08th November, at 06:00 PM IST Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data Final winners will be notified via an email, based on an aggregate score of their private leaderboard rankings.
What is the Metric In this competition? How is the Leaderboard Calculated ? The submission will be evaluated using the RMSLE metric. One can usenp.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values
You should submit a .csv file with exactly 420 rows with 1 column(LAP_TIME). Your submission will return an Invalid Score if you have extra columns or rows.
The file should have exactly 1 column.
Note: Do not shuffle the sequence of the test series
Using Pandas:
submission_df.to_csv('my_submission_file.csv', index=False)
Facebook
TwitterYour goal is to predict a student's earnings a set number of years after they have enrolled in United States institutions of higher education. The data is compiled from a wide range of sources and made publicly available by the United States Department of education.
We're trying to predict the variable income, which represents earnings in thousands of US dollars a set interval
from when the student first enrolled.
The format for the submission file is two columns with the row_id and the income. The data type of income is a float, so make sure there is a decimal point in your submission. For example 0.0 is a valid float. 0 is not.
For example, if you predicted...
| row_id | income |
|---|---|
| 2 | 0.0 |
| 8 | 0.0 |
| 9 | 0.0 |
| 10 | 0.0 |
| 11 | 0.0 |
The first few lines of the .csv file that you submit would look like:
row_id,income
2,0.0
8,0.0
9,0.0
10,0.0
11,0.0
We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric called Root-mean-squared error. It is an error metric, so lower value is better (as opposed to an accuracy metric, where a higher value is better).
\[RMSE = \sqrt{\frac{1}{N}\sum_{n=1}^{N} (\hat{y}_n - y_n)^2 }\]
Where $\hat{y}_n$ is the predicted earnings and $y_n$ is the actual earnings. The best possible score is 0, but the worst possible score can be infinite.
There are 297 variables in this dataset. Each row in the dataset represents a United States institution of higher education in a specific year. The dataset we are working with covers four particular years, denoted year_a, year_f, year_w, and year_z in our dataset. An institution may have a row for all, some, or just for one of the years. We don't provide a unique identifier for an individual institution, just a row_id for each row.
The variables in the dataset have names that of the form category_variable, where category is the high level category of the variable (e.g. academics or students). variable is what the specific column contains.
academics
program_assoc_agriculture: Associate degree in Agriculture, Agriculture Operations, And Related Sciences.program_assoc_architecture: Associate degree in Architecture And Related Services.program_assoc_biological: Associate degree in Biological And Biomedical Sciences.program_assoc_business_marketing: Associate degree in Business, Management, Marketing, And Related Support Services.program_assoc_communication: Associate degree in Communication, Journalism, And Related Programs.program_assoc_communications_technology: Associate degree in Communications Technologies/Technicians And Support Services.program_assoc_computer: Associate degree in Computer And Information Sciences And Support Services.program_assoc_construction: Associate degree in Construction Trades.program_assoc_education: Associate degree in Education.program_assoc_engineering: Associate degree in Engineering.program_assoc_engineering_technology: Associate degree in Engineering Technologies And Engineering-Related Fields.program_assoc_english: Associate degree in English Language And Literature/Letters.program_assoc_ethnic_cultural_gender: Associate degree in Area, Ethnic, Cultural, Gender, And Group Studies.program_assoc_family_consumer_science: Associate degree in Family And Consumer Sciences/Human Sciences.program_assoc_health: Associate degree in Health Professions And Related Programs.program_assoc_history: Associate degree in History.program_assoc_humanities: Associate degree in Liberal Arts And Sciences, General Studies And Humanities.program_assoc_language: Associate degree in Foreign Languages, Literatures, And Linguistics.program_assoc_legal: Associate degree in Legal Professions And Studies.program_assoc_library: Associate degree in Library Science.program_assoc_mathematics: Associate degree in Mathematics And Statistics.program_assoc_mechanic_repair_technology: Associate degree in Mechanic And Repair Technologies/Technicians.program_assoc_military: Associate degree in Military Technologies And Applied Sciences.
...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The current pandemic has dwindled the data science job market likewise recruiters are also facing difficulties filtering the right talent. To bridge this gap we bring a chance for the MachineHack community to compete for jobs with some of the key analytics players for a rewarding career in Data Science. In this competition, we are challenging the MachineHack community to come up with an algorithm to predict the price of retail items belonging to different categories. Foretelling the Retail price can be a daunting task due to the huge datasets with a variety of attributes ranging from Text, Numbers(floats, integers), and DateTime. Also, outliers can be a big problem when dealing with unit prices.
With a key focus on the Data Scientist role in an esteemed organization, this hackathon can help freshers and experienced folks prove their mettle and land up in a rewarding career.
By participating in this hackathon, every participant will be eligible for the Data Scientist job role by making sure their MachineHack Information with Resume is up to date.
Train.csv - 284780 rows x 8 columns (Inlcudes UnitPrice Columns as Target) Test.csv - 122049 rows x 7 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Invoice No - Invoice ID, encoded as Label StockCode - Unique code per stock, encoded as Label Description - The Description, encoded as Label Quantity - Quantity purchased InvoiceDate - Date of purchase UnitPrice - The target value, price of every product CustomerID - Unique Identifier for every country Country - Country of sales, encoded as Label
Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSE to generalize well on unseen data
This dataset is taken from Machine Hack challenge https://www.machinehack.com/hackathons/retail_price_prediction_mega_hiring_hackathon/overview
Facebook
TwitterAll you need to know about the “Quote to Code I: Pandora's Box by Indus OS”:
You are provided with data about:
app metadata: Metadata about the app user metadata: Metadata about the user app installs: apps installed by a user in the last six months app usage: apps used by the user in the last one week. actual set: For participants to validate their model and results validation set: UIDs on which participants need to predict top four recommendations from the universe of apps in apps metadata. data dictionary: check this file for more details about the datasets sample submission: schema to submit the final results Participants need to predict top four recommendations from the universe of apps in apps metadata.
Note: The purpose of uploading the data is to support kagglers to make prediction for this machine hack competition (source:https://machinehack.com/hackathons/quote_to_code_i_pandoras_box/data)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "development dataset" for the DCASE 2021 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions".
The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:
Fan Gearbox Pump Slide Rail Valve
Why focus on domain shift?
The task setup of the 2020 version was the ASD under ideal conditions. The training- and testing-phase datasets were generated under the same recording conditions, and enough normal training clips recorded under the test domain were made available. In contrast, real-world cases are more complicated and often involve different machine operating conditions between the training and testing phases. A frequent example of this is when the motor speed continuously varies in a conveyor transporting products on a production line based on the production volume in response to product demand. Since there is infinite variation in rotation speed, the sound will also change with infinite variation. Due to the seasonal demand for many products, a limited period of recording training data limits the motor speed during that period (e.g., 200-300 rpm for autumn) and variations in the training data. However, in the test phase, the ASD system must continue to monitor the conveyor through all seasons, so it must be able to monitor all possible motor speed conditions, including those that differ from the training data (such as 100-400 rpm). In addition to the conditions of the machine, environmental noise conditions (SNR, sound characteristics, etc.) also fluctuate uncontrollably depending on the seasonal demand. In such a situation, the normal state's distribution will be changed (i.e., domain shift).
Definition
First, we define some important terms in this task: "machine type," "section," "source domain," and "target domain."
The machine type means the kind of machine, which can be one of seven in this task: fan, gearbox, pump, slide rail, ToyCar, ToyTrain, and valve. The section is defined as a subset of the dataset for calculating performance metrics and is almost identical to what was called "machine ID" in the 2020 version. In the 2020 version, there was a one-to-one correspondence between machine IDs and products, but in the 2021 version, the same product may appear in different sections. Different products may appear in the same section. The source domain means the condition under which most of the training data was recorded, and the target domain means a different condition under which some of the test data was recorded. The source and target domains differ in terms of operating speed, machine load, viscosity, heating temperature, environmental noise, SNR, etc.
Data
This dataset consists of three sections for each machine type (Section 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) around 1,000 clips of normal sounds in a source domain for training, (ii) only three clips of normal sounds in a target domain for training, (iii) around 100 clips each of normal and anomalous sounds in the source domain for the test, and (iv) around 100 clips each of normal and anomalous sounds in the target domain for the test.
Recording procedure
Normal/anomalous operating sounds of machines and related equipment were recorded. Anomalous sounds were collected by deliberately damaging machines. To simplify the task, we only used the first channel of the multi-channel recordings; all recordings were regarded as single-channel recordings from a fixed microphone. We mixed a machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise clips were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Reference labels
The given labels for each training/test clip are machine type, section index, normal/anomaly information, and brief attribute information about conditions other than normal/abnormal. The machine type information is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information is given by their respective file names. For the training data, the attribute information is given by their respective file names.
Baseline system
Two simple baseline systems are available on the Github repository [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They...
Facebook
TwitterOverview Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited (“DTTL”), its global network of member firms, and their related entities (collectively, the “Deloitte organization”). DTTL (also referred to as “Deloitte Global”) and each of its member firms and related entities are legally separate and independent entities, which cannot obligate or bind each other in respect of third parties. DTTL and each DTTL member firm and related entity is liable only for its own acts and omissions, and not those of each other. DTTL does not provide services to clients. Please see www.deloitte.com/about to learn more.
All the facts and figures that talk to our size and diversity and years of experiences, as notable and important as they may be, are secondary to the truest measure of Deloitte: the impact we make in the world.
So, when people ask, “what’s different about Deloitte?” the answer resides in the many specific examples of where we have helped Deloitte member firm clients, our people, and sections of society to achieve remarkable goals, solve complex problems or make meaningful progress. Deeper still, it’s in the beliefs, behaviours and fundamental sense of purpose that underpin all that we do. Deloitte Globally has grown in scale and diversity—more than 345,000 people in 150 countries, providing multidisciplinary services yet our shared culture remains the same.
(C) 2021 Deloitte Touche Tohmatsu India LLP”
Dataset Link: https://machinehack-staging.netlify.app/hackathons/deloitte_hackathon_predict_the_loan_defaulter/overview
**The data has been posted here for easy use of kaggle kernels by competition participants. I do not claim any ownership for the data **
Challenge Dataset Description Train.csv - 70,000 rows x 40 columns (Includes target column as Loan Status) Attributes ID Loan Amount Funded Amount Funded Amount Investor Term Batch Enrolled Interest Rate Grade Sub Grade CTC Designation Employment Duration Home Ownership Verification Status Payment Plan Loan Purpose Loan Title Zip Code Address State Debt to Income Delinquency - two years Inquires - six months Open Account Public Record Revolving Balance Revolving Utilities Total Accounts Initial List Status Total Received Interest Total Received Late Fee Recoveries Collection Recovery Fee Collection 12 months Medical Application Type Last week Pay Accounts Delinquent Total Collection Amount Total Current Balance Total Revolving Credit Limit Loan Status Test.csv - 30,000 rows x 39 columns(Includes target column as Loan Status) Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the Loan Status Knowledge and Skills Big dataset, underfitting vs overfitting Optimising log_loss to generalise well on unseen data
Competition Rules:
NE ACCOUNT PER PARTICIPANT One account per participant. Submissions from multiple accounts will lead to disqualification. All registered users are eligible to participate in the hackathon. We ask that you respect the spirit of the competition and do not cheat.
NO PRIVATE SHARING OUTSIDE TEAMS No private sharing outside teams. Any discrepancies reported will be taken seriously and can lead to disqualification.
SUBMISSION LIMITS The submission limit for the hackathon is 3 per day after which the submission will not be evaluated All registered users are eligible to participate in the hackathon We ask that you respect the spirit of the competition and do not cheat.
COMPETTION TIMELINE Start Date: 26/11/2021 End Date: 13/12/2021
Hackathon Specific Rules Deadline This hackathon will expire on 22nd November at 06:00 PM IST. Disqualification Analytics India Magazine and Deloitte reserve the right to disqualify any participant if the details provided are found incorrect. Any external dataset usage is strictly prohibited. The participants will be disqualified if found using any external dataset
Evaluation What is the Metric In this competition? How is the Leaderboard Calculated? The submission will be evaluated using the Log Loss metric. One can use sklearn.metric.log_loss to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values
You should submit a .csv file with exactly 100,000 rows with 1 column(loan_status). Your submission will return an Invalid Score if you hav...
Facebook
TwitterWelcome back to the ‘Weekend Hackathon Edition 2- The Last Hacker Standing’ at Machine Hack. In this edition, we will be posing unique problem statements every week, which will test you over various aspects of being a Data Scientist. The Weekend Edition will be held for a 6 week period starting 30 July 2021 to 9 Sept 2021.
This time it is dedicated to passion and fervour which a sport creates. Challenge Name: THE SOCCER FEVER
Soccer aka Football is the most popular game in the world. It’s a religion of its own. If groups of 10 people can stop time and make people watch them in awe and reverence, it’s this beautiful game. Also, anybody can play soccer- all it needs is 4 poles, a ground and a ball. You can just get started with the play.
In fact, Nelson Mandela very effectively used Football as the unifying factor when he was elected President of South Africa post the Apartheid era. The sport just cuts across all discriminating factors.
An entire ecosystem revolves around this beautiful sport. Clubs, Merchandise, listed football clubs, fan clubs and a group of rivals who can just get into a fight based on the outcome of the game. The amount of currency involved in this game is just phenomenal. It impacts millions of people who depend on it for their livelihood and recreation. Criticality
We live in ambiguity and always need some information to just make a decision. Decisions are made based on possible outcomes. Win/ Loss/ Pass / Fail etc.
The below problem statement is a classic study for decision-making and understanding the odds stacked against a particular situation.
Dataset: 7443*21
Columns: 21
Target Column: Outcome
Evaluation Metric: Log Loss
Dataset: 4008*20
Columns: 20
Dataset: 4008*1( Column Name - ‘Outcome’)
Multi-Class Classification
Optimizing Log Loss
Facebook
Twitter <hr>
<div>Payment transfer</div>
<div>
<div>Max. score: 100</div>
<div></div>
</div>
</div>
You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been an increased scrutiny of the digital payments to check for its genuineness.
To aid the team to deal with this problem, you are provided with the payments data to predict whether the customer themselves have made the transfer or not.The payments data contains the attributes which gets captures when a payment is initiated by a banking customer.
Task
You are required to build a machine learning model that can predict whether the customer themselves have made the transfer or not.
Dataset description
The dataset folder contains the following files:
precision, recall, threshold = metrics.precision_recall_curve(actual, predicted)
score = max(0, 100*metrics.auc(precision, recall))
Note: Ensure that your submission file contains the following:
</div>
</div>
<hr>
<div>Payment transfer</div>
<div>
<div>Max. score: 100</div>
<div></div>
</div>
</div>
<div>
<div>
<div>Payment transfer</div>
<p>You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been a...
Facebook
TwitterWith the rise in the variety of cars with differentiated capabilities and features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, colour, airbags and many more, we are bringing a car price prediction challenge for all. We all aspire to own a car within budget with the best features available. To solve the price problem we have created a dataset of 19237 for the training dataset and 8245 for the test dataset.
Facebook
TwitterIn the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?
Build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.
To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.
Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.
Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.
Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data
The hackathon and the dataset was published on Machine Hack https://machinehack.com/hackathons/dare_in_reality_hackathon/overview
Facebook
TwitterThere's a story behind every dataset and here's your opportunity to share yours.
The Dataset is taken from Machine Hack Weekend Hackathon #18
The filenames are described as follows:
Train.csv: Training set containing 1559 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine. The last column is the target variable (class) it belongs to
Test.csv: Test set containing 1558 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine.
Submission csv file will have just one column (Class) which will store predicted value of the target Variable
Class (0 or 1): Represents Good/Anomalous Class labels for the products.
Facebook
TwitterDataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv - predict the LAP_TIME for the qualifying groups of location 6, 7 and 8 for the test data.
Note: this data is from below machine hack site, to help the kaggle users make use of kaggle notebook for modelling https://machinehack.com/hackathons/dare_in_reality_hackathon/data
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterGiven below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.
File descriptions:
train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player