18 datasets found
  1. ML HACK Dataset

    • kaggle.com
    zip
    Updated Nov 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinav Padmawar (2020). ML HACK Dataset [Dataset]. https://www.kaggle.com/abhinavpadmawar20/ml-hack-dataset
    Explore at:
    zip(94446 bytes)Available download formats
    Dataset updated
    Nov 17, 2020
    Authors
    Abhinav Padmawar
    Description

    Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.

    File descriptions:

    train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player

  2. submit_ai_hack_2024

    • kaggle.com
    zip
    Updated Aug 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SURIYA CHAYATUMMAGOON (2024). submit_ai_hack_2024 [Dataset]. https://www.kaggle.com/suriyachayatummagoon/submit-ai-hack-2024
    Explore at:
    zip(3253 bytes)Available download formats
    Dataset updated
    Aug 14, 2024
    Authors
    SURIYA CHAYATUMMAGOON
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by SURIYA CHAYATUMMAGOON

    Released under MIT

    Contents

  3. Flower Type Prediction Machine Hack

    • kaggle.com
    zip
    Updated Aug 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V.Prasanna Kumar (2020). Flower Type Prediction Machine Hack [Dataset]. https://www.kaggle.com/datasets/vpkprasanna/flower-type-prediction-machine-hack
    Explore at:
    zip(402827 bytes)Available download formats
    Dataset updated
    Aug 21, 2020
    Authors
    V.Prasanna Kumar
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.

    In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.

    In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.

    Dataset Description:

    Train.csv - 12666 rows x 7 columns (includes Class as target column)
    Test.csv - 29555 rows x 6 columns
    Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.
    

    Attributes Description:

    Area_Code - Generic Area code, species were collected from
    Locality_Code - Locality code, species were collected from
    Region_Code - Region code, species were collected from
    Height - Height collected from lab data
    Diameter - Diameter collected from lab data
    Species - Species of the flower
    Class - Target Column (0-7) classes
    

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  4. Buyer's Time Prediction Challenge

    • kaggle.com
    zip
    Updated Dec 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohd Aquib (2020). Buyer's Time Prediction Challenge [Dataset]. https://www.kaggle.com/aquib5559/buyers-time-prediction-challenge
    Explore at:
    zip(296363 bytes)Available download formats
    Dataset updated
    Dec 18, 2020
    Authors
    Mohd Aquib
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    The Dataset is from Machine Hack.

    Buyers spend a significant amount of time surfing an e-commerce store, since the pandemic the e-commerce has seen a boom in the number of users across the domains. In the meantime, the store owners are also planning to attract customers using various algorithms to leverage customer behavior patterns

    Tracking customer activity is also a great way of understanding customer behavior and figuring out what can actually be done to serve them better. Machine learning and AI has already played a significant role in designing various recommendation engines to lure customers by predicting their buying patterns

    Dataset Description:

    • Train.json - 5429 rows x 9 columns (Includes time_spent Column as Target variable)
    • Test.json - 2327 rows x 8 columns
    • Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description:

    • session_id - Unique identifier for every row
    • session_number - Session type identifier
    • client_agent - Client-side software details
    • device_details - Client-side device details
    • date - Datestamp of the session
    • purchased - Binary value for any purchase done
    • added_in_cart - Binary value for cart activity
    • checked_out - Binary value for checking out successfully
    • time_spent - Total time spent in seconds (Target Column)
  5. GitHub Bugs Prediction Challenge (Machine Hack)

    • kaggle.com
    zip
    Updated Oct 8, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ask9 (2020). GitHub Bugs Prediction Challenge (Machine Hack) [Dataset]. https://www.kaggle.com/arbazkhan971/github-bugs-prediction-challenge-machine-hack
    Explore at:
    zip(103105526 bytes)Available download formats
    Dataset updated
    Oct 8, 2020
    Authors
    ask9
    Description

    Overview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.

    However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

    In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.

    Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon

    Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!

    Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data

  6. Machine Hack House Price Prediction

    • kaggle.com
    zip
    Updated Oct 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurav Mishra (2020). Machine Hack House Price Prediction [Dataset]. https://www.kaggle.com/msaurav/machine-hack-house-price-prediction
    Explore at:
    zip(2233190 bytes)Available download formats
    Dataset updated
    Oct 2, 2020
    Authors
    Saurav Mishra
    Description

    Context

    Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.

    Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.

    Content

    Data Description: Train.csv - 29451 rows x 12 columns Test.csv - 68720 rows x 11 columns Sample Submission - Acceptable submission format. (.csv/.xlsx file with 68720 rows)

    Attributes Description: POSTED_BY - Category marking who has listed the property UNDER_CONSTRUCTION - Under Construction or Not RERA - Rera approved or Not BHK_NO - Number of Rooms BHK_OR_RK - Type of property SQUARE_FT - Total area of the house in square feet READY_TO_MOVE - Category marking Ready to move or Not RESALE - Category marking Resale or not ADDRESS - Address of the property LONGITUDE - Longitude of the property LATITUDE - Latitude of the property

    Acknowledgements

    This dataset was taken from a machine hack competition. Link to the competition: machine hack competition

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  7. Dare In Reality Hackathon 2021: Machine hack

    • kaggle.com
    zip
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Tripathi (2021). Dare In Reality Hackathon 2021: Machine hack [Dataset]. https://www.kaggle.com/manishtripathi86/dare-in-reality-hackathon-2021-machine-hack
    Explore at:
    zip(370370 bytes)Available download formats
    Dataset updated
    Nov 10, 2021
    Authors
    Manish Tripathi
    Description

    Dataset Source: https://machinehack.com/hackathons/dare_in_reality_hackathon/overview

    Overview In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?

    To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

    Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.

    Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.

    Challenge Hackathon Starts Datasets will be made live on 08th November, at 06:00 PM IST Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data Final winners will be notified via an email, based on an aggregate score of their private leaderboard rankings.

    What is the Metric In this competition? How is the Leaderboard Calculated ? The submission will be evaluated using the RMSLE metric. One can usenp.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values

    You should submit a .csv file with exactly 420 rows with 1 column(LAP_TIME). Your submission will return an Invalid Score if you have extra columns or rows.

    The file should have exactly 1 column.

    Note: Do not shuffle the sequence of the test series

    Using Pandas:

    submission_df.to_csv('my_submission_file.csv', index=False)

  8. Microsoft Professional Capstone DataSet

    • kaggle.com
    zip
    Updated Oct 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Sharma (2017). Microsoft Professional Capstone DataSet [Dataset]. https://www.kaggle.com/sharmaharsh/microsoft-capstone
    Explore at:
    zip(6858551 bytes)Available download formats
    Dataset updated
    Oct 22, 2017
    Authors
    Harsh Sharma
    Description

    Problem Description

    About the data

    Your goal is to predict a student's earnings a set number of years after they have enrolled in United States institutions of higher education. The data is compiled from a wide range of sources and made publicly available by the United States Department of education.

    Target Variable

    We're trying to predict the variable income, which represents earnings in thousands of US dollars a set interval from when the student first enrolled.

    Submission Format

    The format for the submission file is two columns with the row_id and the income. The data type of income is a float, so make sure there is a decimal point in your submission. For example 0.0 is a valid float. 0 is not.

    For example, if you predicted...

    row_idincome
    20.0
    80.0
    90.0
    100.0
    110.0

    The first few lines of the .csv file that you submit would look like:

    row_id,income
    2,0.0
    8,0.0
    9,0.0
    10,0.0
    11,0.0
    

    Performance Metric

    We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric called Root-mean-squared error. It is an error metric, so lower value is better (as opposed to an accuracy metric, where a higher value is better).

    \[RMSE = \sqrt{\frac{1}{N}\sum_{n=1}^{N} (\hat{y}_n - y_n)^2 }\]

    Where $\hat{y}_n$ is the predicted earnings and $y_n$ is the actual earnings. The best possible score is 0, but the worst possible score can be infinite.

    Features

    There are 297 variables in this dataset. Each row in the dataset represents a United States institution of higher education in a specific year. The dataset we are working with covers four particular years, denoted year_a, year_f, year_w, and year_z in our dataset. An institution may have a row for all, some, or just for one of the years. We don't provide a unique identifier for an individual institution, just a row_id for each row.

    The variables in the dataset have names that of the form category_variable, where category is the high level category of the variable (e.g. academics or students). variable is what the specific column contains.

    Categories

    • academics

      • program_assoc_agriculture: Associate degree in Agriculture, Agriculture Operations, And Related Sciences.
      • program_assoc_architecture: Associate degree in Architecture And Related Services.
      • program_assoc_biological: Associate degree in Biological And Biomedical Sciences.
      • program_assoc_business_marketing: Associate degree in Business, Management, Marketing, And Related Support Services.
      • program_assoc_communication: Associate degree in Communication, Journalism, And Related Programs.
      • program_assoc_communications_technology: Associate degree in Communications Technologies/Technicians And Support Services.
      • program_assoc_computer: Associate degree in Computer And Information Sciences And Support Services.
      • program_assoc_construction: Associate degree in Construction Trades.
      • program_assoc_education: Associate degree in Education.
      • program_assoc_engineering: Associate degree in Engineering.
      • program_assoc_engineering_technology: Associate degree in Engineering Technologies And Engineering-Related Fields.
      • program_assoc_english: Associate degree in English Language And Literature/Letters.
      • program_assoc_ethnic_cultural_gender: Associate degree in Area, Ethnic, Cultural, Gender, And Group Studies.
      • program_assoc_family_consumer_science: Associate degree in Family And Consumer Sciences/Human Sciences.
      • program_assoc_health: Associate degree in Health Professions And Related Programs.
      • program_assoc_history: Associate degree in History.
      • program_assoc_humanities: Associate degree in Liberal Arts And Sciences, General Studies And Humanities.
      • program_assoc_language: Associate degree in Foreign Languages, Literatures, And Linguistics.
      • program_assoc_legal: Associate degree in Legal Professions And Studies.
      • program_assoc_library: Associate degree in Library Science.
      • program_assoc_mathematics: Associate degree in Mathematics And Statistics.
      • program_assoc_mechanic_repair_technology: Associate degree in Mechanic And Repair Technologies/Technicians.
      • program_assoc_military: Associate degree in Military Technologies And Applied Sciences. ...
  9. The Great Indian Hiring Hackathon

    • kaggle.com
    zip
    Updated Nov 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ask9 (2020). The Great Indian Hiring Hackathon [Dataset]. https://www.kaggle.com/arbazkhan971/the-great-indian-hiring-hackathon
    Explore at:
    zip(6815872 bytes)Available download formats
    Dataset updated
    Nov 6, 2020
    Authors
    ask9
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    The current pandemic has dwindled the data science job market likewise recruiters are also facing difficulties filtering the right talent. To bridge this gap we bring a chance for the MachineHack community to compete for jobs with some of the key analytics players for a rewarding career in Data Science. In this competition, we are challenging the MachineHack community to come up with an algorithm to predict the price of retail items belonging to different categories. Foretelling the Retail price can be a daunting task due to the huge datasets with a variety of attributes ranging from Text, Numbers(floats, integers), and DateTime. Also, outliers can be a big problem when dealing with unit prices.

    With a key focus on the Data Scientist role in an esteemed organization, this hackathon can help freshers and experienced folks prove their mettle and land up in a rewarding career.

    By participating in this hackathon, every participant will be eligible for the Data Scientist job role by making sure their MachineHack Information with Resume is up to date.

    Dataset Description:

    Train.csv - 284780 rows x 8 columns (Inlcudes UnitPrice Columns as Target) Test.csv - 122049 rows x 7 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description:

    Invoice No - Invoice ID, encoded as Label StockCode - Unique code per stock, encoded as Label Description - The Description, encoded as Label Quantity - Quantity purchased InvoiceDate - Date of purchase UnitPrice - The target value, price of every product CustomerID - Unique Identifier for every country Country - Country of sales, encoded as Label

    Skills:

    Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSE to generalize well on unseen data

    Acknowledgements

    This dataset is taken from Machine Hack challenge https://www.machinehack.com/hackathons/retail_price_prediction_mega_hiring_hackathon/overview

  10. MH Indus PandoraBox

    • kaggle.com
    zip
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Ziauddin (2022). MH Indus PandoraBox [Dataset]. https://www.kaggle.com/datasets/mohamedziauddin/mh-indus-pandorabox
    Explore at:
    zip(138278864 bytes)Available download formats
    Dataset updated
    Sep 13, 2022
    Authors
    Mohamed Ziauddin
    Description

    All you need to know about the “Quote to Code I: Pandora's Box by Indus OS”:

    You are provided with data about:

    app metadata: Metadata about the app user metadata: Metadata about the user app installs: apps installed by a user in the last six months app usage: apps used by the user in the last one week. actual set: For participants to validate their model and results validation set: UIDs on which participants need to predict top four recommendations from the universe of apps in apps metadata. data dictionary: check this file for more details about the datasets sample submission: schema to submit the final results Participants need to predict top four recommendations from the universe of apps in apps metadata.

    Note: The purpose of uploading the data is to support kagglers to make prediction for this machine hack competition (source:https://machinehack.com/hackathons/quote_to_code_i_pandoras_box/data)

  11. Electrical Motor sound data

    • kaggle.com
    zip
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afroz (2024). Electrical Motor sound data [Dataset]. https://www.kaggle.com/datasets/pythonafroz/electrical-motor-anomaly-detection-from-sound-data
    Explore at:
    zip(4469982501 bytes)Available download formats
    Dataset updated
    Apr 16, 2024
    Authors
    Afroz
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is the "development dataset" for the DCASE 2021 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions".

    The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

    Fan Gearbox Pump Slide Rail Valve

    Why focus on domain shift?

    The task setup of the 2020 version was the ASD under ideal conditions. The training- and testing-phase datasets were generated under the same recording conditions, and enough normal training clips recorded under the test domain were made available. In contrast, real-world cases are more complicated and often involve different machine operating conditions between the training and testing phases. A frequent example of this is when the motor speed continuously varies in a conveyor transporting products on a production line based on the production volume in response to product demand. Since there is infinite variation in rotation speed, the sound will also change with infinite variation. Due to the seasonal demand for many products, a limited period of recording training data limits the motor speed during that period (e.g., 200-300 rpm for autumn) and variations in the training data. However, in the test phase, the ASD system must continue to monitor the conveyor through all seasons, so it must be able to monitor all possible motor speed conditions, including those that differ from the training data (such as 100-400 rpm). In addition to the conditions of the machine, environmental noise conditions (SNR, sound characteristics, etc.) also fluctuate uncontrollably depending on the seasonal demand. In such a situation, the normal state's distribution will be changed (i.e., domain shift).

    Definition

    First, we define some important terms in this task: "machine type," "section," "source domain," and "target domain."

    The machine type means the kind of machine, which can be one of seven in this task: fan, gearbox, pump, slide rail, ToyCar, ToyTrain, and valve. The section is defined as a subset of the dataset for calculating performance metrics and is almost identical to what was called "machine ID" in the 2020 version. In the 2020 version, there was a one-to-one correspondence between machine IDs and products, but in the 2021 version, the same product may appear in different sections. Different products may appear in the same section. The source domain means the condition under which most of the training data was recorded, and the target domain means a different condition under which some of the test data was recorded. The source and target domains differ in terms of operating speed, machine load, viscosity, heating temperature, environmental noise, SNR, etc.

    Data

    This dataset consists of three sections for each machine type (Section 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) around 1,000 clips of normal sounds in a source domain for training, (ii) only three clips of normal sounds in a target domain for training, (iii) around 100 clips each of normal and anomalous sounds in the source domain for the test, and (iv) around 100 clips each of normal and anomalous sounds in the target domain for the test.

    Recording procedure

    Normal/anomalous operating sounds of machines and related equipment were recorded. Anomalous sounds were collected by deliberately damaging machines. To simplify the task, we only used the first channel of the multi-channel recordings; all recordings were regarded as single-channel recordings from a fixed microphone. We mixed a machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise clips were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

    Reference labels

    The given labels for each training/test clip are machine type, section index, normal/anomaly information, and brief attribute information about conditions other than normal/abnormal. The machine type information is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information is given by their respective file names. For the training data, the attribute information is given by their respective file names.

    Baseline system

    Two simple baseline systems are available on the Github repository [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They...

  12. deloitte hackathon predict the loan defaulter

    • kaggle.com
    zip
    Updated Nov 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Tripathi (2021). deloitte hackathon predict the loan defaulter [Dataset]. https://www.kaggle.com/datasets/manishtripathi86/deloitte-hackathon-predict-the-loan-defaulter/discussion
    Explore at:
    zip(9131913 bytes)Available download formats
    Dataset updated
    Nov 30, 2021
    Authors
    Manish Tripathi
    Description

    Overview Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited (“DTTL”), its global network of member firms, and their related entities (collectively, the “Deloitte organization”). DTTL (also referred to as “Deloitte Global”) and each of its member firms and related entities are legally separate and independent entities, which cannot obligate or bind each other in respect of third parties. DTTL and each DTTL member firm and related entity is liable only for its own acts and omissions, and not those of each other. DTTL does not provide services to clients. Please see www.deloitte.com/about to learn more.

    All the facts and figures that talk to our size and diversity and years of experiences, as notable and important as they may be, are secondary to the truest measure of Deloitte: the impact we make in the world.

    So, when people ask, “what’s different about Deloitte?” the answer resides in the many specific examples of where we have helped Deloitte member firm clients, our people, and sections of society to achieve remarkable goals, solve complex problems or make meaningful progress. Deeper still, it’s in the beliefs, behaviours and fundamental sense of purpose that underpin all that we do. Deloitte Globally has grown in scale and diversity—more than 345,000 people in 150 countries, providing multidisciplinary services yet our shared culture remains the same.

    (C) 2021 Deloitte Touche Tohmatsu India LLP”

    Dataset Link: https://machinehack-staging.netlify.app/hackathons/deloitte_hackathon_predict_the_loan_defaulter/overview

    https://analyticsindiamag.com/deloitte-in-association-with-machine-hack-present-machine-learning-challenge-an-exclusive-online-hackathon-for-data-scientists/

    **The data has been posted here for easy use of kaggle kernels by competition participants. I do not claim any ownership for the data **

    Challenge Dataset Description Train.csv - 70,000 rows x 40 columns (Includes target column as Loan Status) Attributes ID Loan Amount Funded Amount Funded Amount Investor Term Batch Enrolled Interest Rate Grade Sub Grade CTC Designation Employment Duration Home Ownership Verification Status Payment Plan Loan Purpose Loan Title Zip Code Address State Debt to Income Delinquency - two years Inquires - six months Open Account Public Record Revolving Balance Revolving Utilities Total Accounts Initial List Status Total Received Interest Total Received Late Fee Recoveries Collection Recovery Fee Collection 12 months Medical Application Type Last week Pay Accounts Delinquent Total Collection Amount Total Current Balance Total Revolving Credit Limit Loan Status Test.csv - 30,000 rows x 39 columns(Includes target column as Loan Status) Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the Loan Status Knowledge and Skills Big dataset, underfitting vs overfitting Optimising log_loss to generalise well on unseen data

    Competition Rules:

    NE ACCOUNT PER PARTICIPANT One account per participant. Submissions from multiple accounts will lead to disqualification. All registered users are eligible to participate in the hackathon. We ask that you respect the spirit of the competition and do not cheat.

    NO PRIVATE SHARING OUTSIDE TEAMS No private sharing outside teams. Any discrepancies reported will be taken seriously and can lead to disqualification.

    SUBMISSION LIMITS The submission limit for the hackathon is 3 per day after which the submission will not be evaluated All registered users are eligible to participate in the hackathon We ask that you respect the spirit of the competition and do not cheat.

    COMPETTION TIMELINE Start Date: 26/11/2021 End Date: 13/12/2021

    Hackathon Specific Rules Deadline This hackathon will expire on 22nd November at 06:00 PM IST. Disqualification Analytics India Magazine and Deloitte reserve the right to disqualify any participant if the details provided are found incorrect. Any external dataset usage is strictly prohibited. The participants will be disqualified if found using any external dataset

    Evaluation What is the Metric In this competition? How is the Leaderboard Calculated? The submission will be evaluated using the Log Loss metric. One can use sklearn.metric.log_loss to calculate the same This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% of Test data The Final Score represents the score achieved based on the Best Score on the public leaderboard How to Generate a valid Submission File Sklearn models support the predict() method to generate the predicted values

    You should submit a .csv file with exactly 100,000 rows with 1 column(loan_status). Your submission will return an Invalid Score if you hav...

  13. Soccer Fever Challenge

    • kaggle.com
    zip
    Updated Aug 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aishik Rakshit (2021). Soccer Fever Challenge [Dataset]. https://www.kaggle.com/datasets/aishikai/soccer-fever-challenge
    Explore at:
    zip(263194 bytes)Available download formats
    Dataset updated
    Aug 21, 2021
    Authors
    Aishik Rakshit
    Description

    Overview

    Welcome back to the ‘Weekend Hackathon Edition 2- The Last Hacker Standing’ at Machine Hack. In this edition, we will be posing unique problem statements every week, which will test you over various aspects of being a Data Scientist. The Weekend Edition will be held for a 6 week period starting 30 July 2021 to 9 Sept 2021.

    This time it is dedicated to passion and fervour which a sport creates. Challenge Name: THE SOCCER FEVER

    Introduction

    Soccer aka Football is the most popular game in the world. It’s a religion of its own. If groups of 10 people can stop time and make people watch them in awe and reverence, it’s this beautiful game. Also, anybody can play soccer- all it needs is 4 poles, a ground and a ball. You can just get started with the play.

    In fact, Nelson Mandela very effectively used Football as the unifying factor when he was elected President of South Africa post the Apartheid era. The sport just cuts across all discriminating factors.

    Relevance

    An entire ecosystem revolves around this beautiful sport. Clubs, Merchandise, listed football clubs, fan clubs and a group of rivals who can just get into a fight based on the outcome of the game. The amount of currency involved in this game is just phenomenal. It impacts millions of people who depend on it for their livelihood and recreation. Criticality

    We live in ambiguity and always need some information to just make a decision. Decisions are made based on possible outcomes. Win/ Loss/ Pass / Fail etc.

    The below problem statement is a classic study for decision-making and understanding the odds stacked against a particular situation.

    Train

    Dataset: 7443*21
    Columns: 21
    Target Column: Outcome
    

    Evaluation Metric: Log Loss

    Test

    Dataset: 4008*20
    Columns: 20
    

    Submission Format :

    Dataset: 4008*1( Column Name - ‘Outcome’)
    

    Skills

    Multi-Class Classification
    Optimizing Log Loss
    
  14. HSBC ML Hackathon 2023

    • kaggle.com
    zip
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashis Parida (2023). HSBC ML Hackathon 2023 [Dataset]. https://www.kaggle.com/datasets/ashisparida/hsbc-ml-hackathon-2023
    Explore at:
    zip(53780476 bytes)Available download formats
    Dataset updated
    Apr 21, 2023
    Authors
    Ashis Parida
    Description
              <hr>
              <div>Payment transfer</div>
              <div>
                <div>Max. score: 100</div>
    
                <div></div>
              </div> 
    
    
    
            </div>
    
    Payment transfer

    You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been an increased scrutiny of the digital payments to check for its genuineness.
    To aid the team to deal with this problem, you are provided with the payments data to predict whether the customer themselves have made the transfer or not.The payments data contains the attributes which gets captures when a payment is initiated by a banking customer.

    Task

    You are required to build a machine learning model that can predict whether the customer themselves have made the transfer or not.

    Dataset description

    The dataset folder contains the following files:

    • train.csv: 233633 x 14
    • train_helper.csv: 1231200 x 10
    • test.csv: 215852 x 13
    • test_helper.csv: 1160950 x 10
    • sample_submission.csv: 215852 x 3

    Evaluation metric

    precision, recall, threshold = metrics.precision_recall_curve(actual, predicted)
    score = max(0, 100*metrics.auc(precision, recall))

    Result submission guidelines

    • The index is "V2" and the target is the ["Probability","Target"] columns.
    • The submission file must be submitted in .csv format only.
    • The size of this submission file must be 215852 x 3.

    Note: Ensure that your submission file contains the following:

    • Correct index values as per the test file
    • Correct names of columns as provided in the sample_submission.csv file
              </div> 
            </div>
    
              <hr>
              <div>Payment transfer</div>
              <div>
                <div>Max. score: 100</div>
    
                <div></div>
              </div> 
    
    
    
            </div> 
    
            <div>
              <div>
                <div>Payment transfer</div>
                <p>You are working as a data scientist with the Payments team of the bank. The team is continually responding to the emerging threats by building up cutting-edge machine learning driven models and strategies, working with the best-in-class service providers specialized in counter-fraud solutions.In recent years, there has been a...
    
  15. Cars Price Dataset

    • kaggle.com
    zip
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Jarupula (2021). Cars Price Dataset [Dataset]. https://www.kaggle.com/jarupula/machine-hack
    Explore at:
    zip(606314 bytes)Available download formats
    Dataset updated
    Aug 3, 2021
    Authors
    Rakesh Jarupula
    Description

    About Data

    With the rise in the variety of cars with differentiated capabilities and features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, colour, airbags and many more, we are bringing a car price prediction challenge for all. We all aspire to own a car within budget with the best features available. To solve the price problem we have created a dataset of 19237 for the training dataset and 8245 for the test dataset.

    Dataset Description

    • Train.csv - 19237 rows x 18 columns (Includes Price Columns as Target)
    • Attributes
      • ID
      • Price: price of the care(Target Column)
      • Levy
      • Manufacturer
      • Model
      • Prod. year
      • Category
      • Leather interior
      • Fuel type
      • Engine volume
      • Mileage
      • Cylinders
      • Gear box type
      • Drive wheels
      • Doors
      • Wheel
      • Color
      • Airbags
    • Test.csv - 8245 rows x 17 columns
    • Sample Submission.csv
  16. Dare In Reality

    • kaggle.com
    zip
    Updated Nov 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prakhar Prasad (2021). Dare In Reality [Dataset]. https://www.kaggle.com/datasets/prakharprasad/dare-in-reality
    Explore at:
    zip(370370 bytes)Available download formats
    Dataset updated
    Nov 8, 2021
    Authors
    Prakhar Prasad
    Description

    Context

    In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?

    Build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

    To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

    Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.

    Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.

    Content

    Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv -Please check the Evaluation section for more details on how to generate a valid submission. The challenge is to predict the LAP_TIME for the qualifying groups of location 6, 7 and 8. Knowledge and Skills Multivariate Regression Big dataset, underfitting vs overfitting Optimizing RMSLE to generalize well on unseen data

    Acknowledgements

    The hackathon and the dataset was published on Machine Hack https://machinehack.com/hackathons/dare_in_reality_hackathon/overview

  17. Detecting Anomalies in Wafer Manufacturing

    • kaggle.com
    zip
    Updated Aug 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SubhamNagar (2020). Detecting Anomalies in Wafer Manufacturing [Dataset]. https://www.kaggle.com/subham07/detecting-anomalies-in-water-manufacturing
    Explore at:
    zip(124481 bytes)Available download formats
    Dataset updated
    Aug 28, 2020
    Authors
    SubhamNagar
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    The Dataset is taken from Machine Hack Weekend Hackathon #18

    The filenames are described as follows:

    Train.csv: Training set containing 1559 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine. The last column is the target variable (class) it belongs to

    Test.csv: Test set containing 1558 feature columns with indices 0-1557 giving the information about various attributes that were collected from the Manufacturing Machine.

    Submission csv file will have just one column (Class) which will store predicted value of the target Variable

    Class (0 or 1): Represents Good/Anomalous Class labels for the products.

    Acknowledgement

    Machine Hack

  18. MH : Dare in Reality 2021

    • kaggle.com
    zip
    Updated Nov 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Ziauddin (2021). MH : Dare in Reality 2021 [Dataset]. https://www.kaggle.com/mohamedziauddin/mh-dare-in-reality-2021
    Explore at:
    zip(370370 bytes)Available download formats
    Dataset updated
    Nov 8, 2021
    Authors
    Mohamed Ziauddin
    Description

    Dataset Description train.csv - 10276 rows x 25 columns (Includes target column as LAP_TIME) Attributes NUMBER: Number in sequence DRIVER_NUMBER: Driver number LAP_NUMBER: lap number LAP_TIME: Lap time in seconds LAP_IMPROVEMENT: Number of Lap Improvement CROSSING_FINISH_LINE_IN_PIT S1: Sector 1 in [min:sec.microseconds] S1_IMPROVEMENT: Improvement in sector 1 S2: Sector 2 in [min:sec.microseconds] S2_IMPROVEMENT: Improvement in sector 2 S3: Sector 3 in [min:sec.microseconds] S3_IMPROVEMENT: Improvement in sector 3 KPH: speed in kilometer/hour ELAPSED: Time elapsed in [min:sec.microseconds] HOUR: in [min:sec.microseconds] S1_LARGE: in [min:sec.microseconds] S2_LARGE: in [min:sec.microseconds] S3_LARGE: in [min:sec.microseconds] DRIVER_NAME: Name of the driver PIT_TIME: time taken to car stops in the pits for fuel and other consumables to be renewed or replenished GROUP: Group of driver TEAM: Team name POWER: Brake Horsepower(bhp) LOCATION: Location of the event EVENT: Free practice or qualifying test.csv - 420 rows x 25 columns(Includes target column as LAP_TIME) submission.csv - predict the LAP_TIME for the qualifying groups of location 6, 7 and 8 for the test data.

    Note: this data is from below machine hack site, to help the kaggle users make use of kaggle notebook for modelling https://machinehack.com/hackathons/dare_in_reality_hackathon/data

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhinav Padmawar (2020). ML HACK Dataset [Dataset]. https://www.kaggle.com/abhinavpadmawar20/ml-hack-dataset
Organization logo

ML HACK Dataset

Dataset used in ML Hack Hackathon

Explore at:
zip(94446 bytes)Available download formats
Dataset updated
Nov 17, 2020
Authors
Abhinav Padmawar
Description

Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.

File descriptions:

train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player

Search
Clear search
Close search
Google apps
Main menu