25 datasets found
  1. P

    GMSC Dataset

    • paperswithcode.com
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). GMSC Dataset [Dataset]. https://paperswithcode.com/dataset/gmsc
    Explore at:
    Dataset updated
    Jan 24, 2024
    Description

    Data for a Kaggle competition

    Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

    Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

    The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.

    Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).

  2. Credit Card Behaviour Score

    • kaggle.com
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suvradeep (2025). Credit Card Behaviour Score [Dataset]. https://www.kaggle.com/datasets/suvroo/credit-card-behaviour-score/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    Kaggle
    Authors
    Suvradeep
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Bank A issues Credit Cards to eligible customers. The Bank deploys advanced ML models and frameworks to decide on eligibility, limit, and interest rate assignment. The models and frameworks are optimized to manage early risk and ensure profitability. The Bank has now decided to build a robust risk management framework for its existing Credit Card customers, irrespective of when they were acquired. To enable this, the Bank has decided to create a “Behaviour Score”. A Behaviour Score is a predictive model. It is developed on a base of customers whose Credit Cards are open and are not past due. The model predicts the probability of customers defaulting on the Credit Cards going forward. This model will then be used for several portfolio risk management activities.

    Problem statement

    Your objective is to develop the Behaviour Score for Bank A.

    Datasets

    You have been provided with a random sample of 96,806 Credit Card details in “Dev_data_to_be_shared.zip”, along with a flag (bad_flag) – henceforth known as “development data”. This is a historical snapshot of the Credit Card portfolio of Bank A. Credit Cards that have actually defaulted have bad_flag = 1. You have also been provided with several independent variables. These include: • On us attributes like credit limit (varables with names starting with onus_attributes) • Transaction level attributes like number of transactions / rupee value transactions on various kinds of merchants (variables with names starting with transaction_attribute) • Bureau tradeline level attributes (like product holdings, historical delinquencies) – variables starting with bureau • Bureau enquiry level attributes (like PL enquiries in the last 3 months etc) – variables starting with bureau_enquiry You have also been provided with another random sample of 41,792 Credit Card details in “validation_data_to_be_shared.zip” with the same set of input variables, but without “bad_flag”. This will be referred to going forward as “validation data”.

    Requirements

    Using the data provided, you will have to come up with a way to predict the probability that a given Credit Card customer will default. You can use the development data for this purpose. You are then required to use the same logic to predict the probability of all the Credit Cards which are a part of the validation data. Your submission should contain two columns – the Primary key from the validation data (account_number), and the predicted probability against that account. You are also required to submit a detailed documentation of this exercise. A good document should contain details about your approach. In this section, you should include a write up on any algorithms that you use. You should then cover each of the steps that you have followed in as much detail as you can. You should then move on to any key insights or observations that you have come across in the data provided to you. Finally, you should write about what metrics you have used to measure the effectiveness of the approach that you have followed.

    Evaluation

    As detailed in the previous section, you are required to submit the Primary key and predicted probabilities of all the accounts provided to you in the validation data, as well as a documentation. We will only evaluate submissions that are complete and pass sanity checks (probability values should be between 0 and 1 for example). Submissions will be evaluated basis how close the predicted probabilities are to the actual outcome. We will also evaluate the documentation basis it’s completeness and accuracy. Extra points will be granted to submissions that include interesting insights / observations on the data provided.

  3. Credit Card Fraud Detection

    • kaggle.com
    zip
    Updated Sep 14, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasanna Venkatesh (2019). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/prasy46/credit-card-fraud-detection
    Explore at:
    zip(70543178 bytes)Available download formats
    Dataset updated
    Sep 14, 2019
    Authors
    Prasanna Venkatesh
    Description

    Data

    We provide you with a data set in CSV format. The data set contains 2 lakhh+ record train instances and 56 thousand test instance There are 31 input features, labeled V1 to V28 and Amount .

    The target variable is labeled Class.

    Task

    Create a Classification model to predict the target variable Class.

    1. A report - A Power point presentation
    2. Any custom code you used
    3. Instructions for me to run your model on a separate data set

    What should be in the report?

    1. List of any assumptions that you made
    2. Description of your methodology and solution path
    3. List of algorithms and techniques you used
    4. List of tools and frameworks you used
    5. Results and evaluation of your models

    How to evaluate the model

    1. Use the F1 Score for metrics
    2. Any other evaluation measure that you believe is appropriate other than Accuracy.
  4. A

    ‘💳 CFPB Credit Card History’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘💳 CFPB Credit Card History’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-cfpb-credit-card-history-e038/da44b0a1/?iid=003-328&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘💳 CFPB Credit Card History’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/cfpb-credit-card-historye on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    Lending levels
    Monitoring developments in overall activity helps us identify new developments in the markets we regulate. These graphs show the number and aggregate dollar volume of new credit cards opened each month. Aggregated monthly originations are displayed along with a seasonally-adjusted series, which adjust for expected seasonal variation in lending activity.

    Year-over-year changes
    These graphs show the percentage change in the number of new credit cards originated in the month, compared to lending activity from one year ago. Positive changes indicate that lending activity is higher than it was last year and negative values indicate that lending has declined.

    Geographic changes
    This map shows the percentage change in the volume of new credit cards originated in each state, compared to lending activity from one year ago. Positive changes mean that the volume of credit cards originated in the state during the month are higher than they were one year ago and negative values indicate that the volume of credit cards has declined.

    Source: https://www.consumerfinance.gov/data-research/consumer-credit-trends/credit-cards/origination-activity/#anchor_geographic-changes

    This dataset was created by Adam Helsinger and contains around 300 samples along with Group, Month, technical information and other features such as: - Group - Month - and more.

    How to use this dataset

    • Analyze Group in relation to Month
    • Study the influence of Group on Month
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit Adam Helsinger

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  5. P

    Kaggle-Credit Card Fraud Dataset Dataset

    • paperswithcode.com
    Updated Sep 15, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Kaggle-Credit Card Fraud Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/kaggle-credit-card-fraud-dataset
    Explore at:
    Dataset updated
    Sep 15, 2013
    Description

    The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

    It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

    Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

  6. A

    ‘Corporate Credit Rating’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Corporate Credit Rating’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-corporate-credit-rating-7978/a5465968/?iid=023-805&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Corporate Credit Rating’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/agewerc/corporate-credit-rating on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A corporate credit rating expresses the ability of a firm to repay its debt to creditors. Credit rating agencies are the entities responsible to make the assessment and give a verdict. When a big corporation from the US or anywhere in the world wants to issue a new bond it hires a credit agency to make an assessment so that investors can know how trustworthy is the company. The assessment is based especially in the financials indicators that come from the balance sheet. Some of the most important agencies in the world are Moodys, Fitch and Standard and Poors.

    Content

    A list of 2029 credit ratings issued by major agencies such as Standard and Poors to big US firms (traded on NYSE or Nasdaq) from 2010 to 2016. There are 30 features for every company of which 25 are financial indicators. They can be divided in:

    • Liquidity Measurement Ratios: currentRatio, quickRatio, cashRatio, daysOfSalesOutstanding
    • Profitability Indicator Ratios: grossProfitMargin, operatingProfitMargin, pretaxProfitMargin, netProfitMargin, effectiveTaxRate, returnOnAssets, returnOnEquity, returnOnCapitalEmployed
    • Debt Ratios: debtRatio, debtEquityRatio Operating Performance Ratios:` assetTurnover
    • Cash Flow Indicator Ratios: operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare, operatingCashFlowSalesRatio, freeCashFlowOperatingCashFlowRatio

    For more information about financial indicators visit: https://financialmodelingprep.com/market-indexes-major-markets The additional features are Name, Symbol (for trading), Rating Agency Name, Date and Sector.

    The dataset is unbalanced, here is the frequency of ratings: - AAA: 7 - AA: 89 - A: 398 - BBB: 671 - BB: 490 - B: 302 - CCC: 64 - CC: 5 - C: 2 - D: 1

    Acknowledgements

    This dataset was possible thanks to financialmodelingprep and opendatasoft - the sources of the data. To see how the data was integrated and reshaped check here.

    Inspiration

    Is it possible to forecast the rating an agency will give to a company based on its financials?

    --- Original source retains full ownership of the source dataset ---

  7. Bank Rankings by Total Assets

    • kaggle.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Bank Rankings by Total Assets [Dataset]. https://www.kaggle.com/datasets/thedevastator/global-banking-rankings-by-total-assets-2017-12
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    Bank Rankings by Total Assets

    Tracking the Financial Performance of the Top Banks

    By Arthur Keen [source]

    About this dataset

    This dataset contains the top 100 global banks ranked by total assets on December 31, 2017. With a detailed list of key information for each bank's rank, country, balance sheet and US Total Assets (in billions), this data will be invaluable for those looking to research and study the current status of some of the world's leading financial organizations. From billion-dollar mega-banks such as JP Morgan Chase to small, local savings & loans institutions like BancorpSouth; this comprehensive overview allows researchers and analysts to gain a better understanding of who holds power in the world economy today

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains the rank and total asset information of the top 100 global banks as of December 31, 2017. It is a useful resource for researchers who wish to study how key financial institutions' asset information relate to each other across countries.

    Using this dataset is relatively straightforward – it consists of three columns - rank (the order in which each bank appears in the list), country (the country in which the bank is located) and total assets US billions (the total value expressed in US dollars). Additionally, there is a fourth column containing the balance sheet information for each bank as well.

    In order to make full use of this dataset, one should analyse it by creating comparison grids based on different factors such as region, size or ownership structures. This can provide an interesting insight into how financial markets are structured within different economies and allow researchers to better understand some banking sector dynamics that are particularly relevant for certain countries or regions. Additionally, one can compare any two banks side-by-side using their respective balance sheets or distribution plot graphs based on size or concentration metrics by leverage or other financial ratios as well.

    Overall, this dataset provides useful resources that can be put into practice through data visualization making an interesting reference point for trends analysis and forecasting purposes focusing on certain banking activities worldwide

    Research Ideas

    • Analyzing the differences in total assets across countries. By comparing and contrasting data, patterns could be found that give insight into the factors driving differences in banks’ assets between different markets.

    • Using predictive models to identify which banks are more likely to perform better based on their balance sheet data, such as by predicting future profits or cashflows of said banks.

    • Leveraging the information on holdings and investments of “top-ranked” banks as a guide for personal investments decisions or informing investment strategies of large financial institutions or hedge funds

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: top50banks2017-03-31.csv | Column name | Description | |:----------------------|:------------------------------------------------------------------------| | rank | The rank of the bank globally based on total assets. (Integer) | | country | The country where the bank is located. (String) | | total_assets_us_b | The total assets of a bank expressed in billions of US dollars. (Float) | | balance_sheet | A snapshot of banking activities for a specific date. (Date) |

    File: top100banks2017-12-31.csv | Column name | Description | |:----------------------|:--------------------------------------------...

  8. Loan Approval Dataset

    • kaggle.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Mohamad Ezzat (2024). Loan Approval Dataset [Dataset]. https://www.kaggle.com/datasets/zeyadmohamadezzat/loan-approval-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zeyad Mohamad Ezzat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides information about people applying for loans, including details on their personal background, finances, and loan specifics. It's meant to help us better understand how different personal factors impact whether a loan gets approved. The data includes things like the applicant's age, income, home ownership status, job history, and credit score, along with loan details such as the loan amount, interest rate, and purpose. It also shows whether the loan was approved or denied.

    Features in the dataset:

    1. person_age: The applicant's age.
    2. person_income: How much the applicant earns annually, in USD.
    3. person_home_ownership: Whether the person owns, rents, or has a mortgage on their home.
    4. person_emp_length: How long the person has been employed.
    5. loan_intent: The reason for applying for the loan (e.g., EDUCATION, MEDICAL, VENTURE, DEBT CONSOLIDATION, PERSONAL).
    6. loan_grade: The credit grade given to the loan (A, B, C, D, etc.).
    7. loan_amnt: The amount of money the applicant is requesting for the loan, in USD.
    8. loan_int_rate: The interest rate on the loan.
    9. loan_percent_income: The percentage of the applicant’s income being requested as a loan.
    10. cb_person_default_on_file: Whether the applicant has defaulted on any previous loans (Y/N).
    11. cb_person_cred_hist_length: The length of the applicant's credit history, in years.
    12. loan_status: The outcome of the loan application (0 = Rejected, 1 = Accepted).
  9. A

    ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-amazon-product-reviews-dataset-7933/latest
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    This dataset contains 30K records of product reviews from amazon.com.

    This dataset was created by PromptCloud and DataStock

    Content

    This dataset contains the following:

    • Total Records Count: 43729

    • Domain Name: amazon.com

    • Date Range: 01st Jan 2020 - 31st Mar 2020

    • File Extension: CSV

    • Available Fields:
      -- Uniq Id,
      -- Crawl Timestamp,
      -- Billing Uniq Id,
      -- Rating,
      -- Review Title,
      -- Review Rating,
      -- Review Date,
      -- User Id,
      -- Brand,
      -- Category,
      -- Sub Category,
      -- Product Description,
      -- Asin,
      -- Url,
      -- Review Content,
      -- Verified Purchase,
      -- Helpful Review Count,
      -- Manufacturer Response

    Acknowledgements

    We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.

    Inspiration

    The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.

    This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.

    How to use this dataset

    • Analyze Helpful Review Count in relation to Sub Category
    • Study the influence of Review Date on Product Description
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit PromptCloud

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  10. u

    Lending Club loan dataset for granting models

    • produccioncientifica.ucm.es
    • portalcientifico.uah.es
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club; Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club (2024). Lending Club loan dataset for granting models [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2366?lang=ca
    Explore at:
    Dataset updated
    2024
    Authors
    Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club; Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club
    Description

    Lending Club offers peer-to-peer (P2P) loans through a technological platform for various personal finance purposes and is today one of the companies that dominate the US P2P lending market. The original dataset is publicly available on Kaggle and corresponds to all the loans issued by Lending Club between 2007 and 2018. The present version of the dataset is for constructing a granting model, that is, a model designed to make decisions on whether to grant a loan based on information available at the time of the loan application. Consequently, our dataset only has a selection of variables from the original one, which are the variables known at the moment the loan request is made. Furthermore, the target variable of a granting model represents the final status of the loan, that are "default" or "fully paid". Thus, we filtered out from the original dataset all the loans in transitory states. Our dataset comprises 1,347,681 records or obligations (approximately 60% of the original) and it was also cleaned for completeness and consistency (less than 1% of our dataset was filtered out).

    TARGET VARIABLE

    The dataset includes a target variable based on the final resolution of the credit: the default category corresponds to the event charged off and the non-default category to the event fully paid. It does not consider other values in the loan status variable since this variable represents the state of the loan at the end of the considered time window. Thus, there are no loans in transitory states. The original dataset includes the target variable “loan status”, which contains several categories ('Fully Paid', 'Current', 'Charged Off', 'In Grace Period', 'Late (31-120 days)', 'Late (16-30 days)', 'Default'). However, in our dataset, we just consider loans that are either “Fully Paid” or “Default” and transform this variable into a binary variable called “Default”, with a 0 for fully paid loans and a 1 for defaulted loans.

    EXPLANATORY VARIABLES

    The explanatory variables that we use correspond only to the information available at the time of the application. Variables such as the interest rate, grade, or subgrade are generated by the company as a result of a credit risk assessment process, so they were filtered out from the dataset as they must not be considered in risk models to predict the default in granting of credit.

    FULL LIST OF VARIABLES

    Loan identification variables:

    id: Loan id (unique identifier).

    issue_d: Month and year in which the loan was approved.

    Quantitative variables:

    revenue: Borrower's self-declared annual income during registration.

    dti_n: Indebtedness ratio for obligations excluding mortgage. Monthly information. This ratio has been calculated considering the indebtedness of the whole group of applicants. It is estimated as the ratio calculated using the co-borrowers’ total payments on the total debt obligations divided by the co-borrowers’ combined monthly income.

    loan_amnt: Amount of credit requested by the borrower.

    fico_n: Defined between 300 and 850, reported by Fair Isaac Corporation as a risk measure based on historical credit information reported at the time of application. This value has been calculated as the average of the variables “fico_range_low” and “fico_range_high” in the original dataset.

    experience_c: Binary variable that indicates whether the borrower is new to the entity. This variable is constructed from the credit date of the previous obligation in LC and the credit date of the current obligation; if the difference between dates is positive, it is not considered as a new experience with LC.

    Categorical variables:

    emp_length: Categorical variable with the employment length of the borrower (includes the no information category)

    purpose: Credit purpose category for the loan request.

    home_ownership_n: Homeownership status provided by the borrower in the registration process. Categories defined by LC: “mortgage”, “rent”, “own”, “other”, “any”, “none”. We merged the categories “other”, “any” and “none” as “other”.

    addr_state: Borrower's residence state from the USA.

    zip_code: Zip code of the borrower's residence.

    Textual variables

    title: Title of the credit request description provided by the borrower.

    desc: Description of the credit request provided by the borrower.

    We cleaned the textual variables. First, we removed all those descriptions that contained the default description provided by Lending Club on its web form (“Tell your story. What is your loan for?”). Moreover, we removed the prefix “Borrower added on DD/MM/YYYY >” from the descriptions to avoid any temporal background on them. Finally, as these descriptions came from a web form, we substituted all the HTML elements by their character (e.g. “&” was substituted by “&”, “<” was substituted by “<”, etc.).

    RELATED WORKS

    This dataset has been used in the following academic articles:

    Sanz-Guerrero, M. Arroyo, J. (2024). Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. arXiv preprint arXiv:2401.16458. https://doi.org/10.48550/arXiv.2401.16458

    Ariza-Garzón, M.J., Arroyo, J., Caparrini, A., Segovia-Vargas, M.J. (2020). Explainability of a machine learning granting scoring model in peer-to-peer lending. IEEE Access 8, 64873 - 64890. https://doi.org/10.1109/ACCESS.2020.2984412

  11. Different Store Sales

    • kaggle.com
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KZ Data Lover (2024). Different Store Sales [Dataset]. https://www.kaggle.com/datasets/kzmontage/sales-from-different-stores/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    KZ Data Lover
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This is a fictional time-series dataset created for helping the data analysts to practice exploratory data analysis and data visualization. The dataset has data on orders placed by customers on a grocery delivery application.

    The dataset is designed with an assumption that the orders are placed by customers living in the states of United States.

    Columns Summary invoice_no: Invoice number associated with each transaction.

    customer_id: Identifier for each customer.

    gender: Gender of the customer (assumed to be binary: male/female).

    age: Age of the customer.

    category: Product category associated with the transaction.

    quantity: Quantity of products purchased in each transaction.

    selling_price_per_unit: Selling price per unit of the product.

    cost_price_per_unit: Cost price per unit of the product.

    payment_method: Method used for payment in the transaction.

    region: Geographic region associated with the transaction.

    state: State where the transaction took place.

    shopping_mall: Shopping mall where the transaction occurred.

    Please DO NOT reproduce the same dataset without giving me the credits. If you like this dataset, please consider upvoting.

    Thanks!

  12. A

    ‘🗳 Pollster Ratings’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘🗳 Pollster Ratings’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pollster-ratings-3cf7/4aa1e9a4/?iid=017-557&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘🗳 Pollster Ratings’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/pollster-ratingse on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    This dataset contains the data behind FiveThirtyEight's pollster ratings.

    pollster-stats-full contains a spreadsheet with all of the summary data and calculations involved in determining the pollster ratings as well as descriptions for each column.
    pollster-ratings has ratings and calculations for each pollster. A copy of this data and descriptions for each column can also be found in pollster-stats-full.
    raw-polls contains all of the polls analyzed to give each pollster a grade

    Source: https://github.com/fivethirtyeight/data

    License: The data is available under the Creative Commons Attribution 4.0 International License. If you find it useful, please let us know.

    Updated: Pollster-ratings and raw-polls synced from source weekly.

    This dataset was created by FiveThirtyEight and contains around 10000 samples along with Cand2 Id, Pollster, technical information and other features such as: - Samplesize - Partisan - and more.

    How to use this dataset

    • Analyze Cand2 Party in relation to Race Id
    • Study the influence of Margin Poll on Cand1 Actual
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit FiveThirtyEight

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  13. A

    ‘National Adult Tobacco Survey - United States’ analyzed by Analyst-2

    • analyst-2.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘National Adult Tobacco Survey - United States’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-national-adult-tobacco-survey-united-states-8007/f0c2e523/?iid=016-993&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Analysis of ‘National Adult Tobacco Survey - United States’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/national-adult-tobacco-survey-natse on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    2013-2014. The National Adult Tobacco Survey (NATS) was created to assess the prevalence of tobacco use, as well as the factors promoting and impeding tobacco use among adults. NATS also establishes a comprehensive framework for evaluating both the national and state-specific tobacco control programs. NATS was designed as a stratified, national, landline, and cell phone survey of non-institutionalized adults aged 18 years and older residing in the 50 states or D.C. It was developed to yield data representative and comparable at both national and state levels. The sample design also aims to provide national estimates for subgroups defined by gender, age, and race/ethnicity.

    Source: https://catalog.data.gov/dataset/national-adult-tobacco-survey-nats
    Last updated at https://catalog.data.gov/organization/hhs-gov : 2021-04-25

    This dataset was created by US Open Data Portal, data.gov and contains around 100 samples along with Locationdesc, Age, technical information and other features such as: - Topictypeid - Datasource - and more.

    How to use this dataset

    • Analyze Stratificationid4 in relation to Race
    • Study the influence of Stratificationid1 on Low Confidence Limit
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit US Open Data Portal, data.gov

    --- Original source retains full ownership of the source dataset ---

  14. Facial Key Point Detection Dataset

    • kaggle.com
    Updated Dec 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant Arora (2020). Facial Key Point Detection Dataset [Dataset]. https://www.kaggle.com/prashantarorat/facial-key-point-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prashant Arora
    Description

    About

    Few days ago i was thinking to start some new project but couldn't find one that looks a bit exciting to me. So , then i found about facial landmarks , then i started to found some datasets for it . There were many datasets , but Flickr Dataset came out to be the best out of them with 70,000 images having 68 landmarks coefficients and as the size shows the data was a big too around 900 GB , so i decided to form a smaller version of it so that we are able to atleast work on such task. So i created this dataset.

    The objective of creating this dataset is to predict keypoint positions on face images. This can be used as a building block in several applications, such as:

    1. tracking faces in images and video
    2. analysing facial expressions
    3. detecting dysmorphic facial signs for medical diagnosis
    4. biometrics / face recognition

    Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions. Computer vision research has come a long way in addressing these difficulties, but there remain many opportunities for improvement.

    Some Sample images

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2137176%2Fdb17e16db7aefd0848ca3acd99001262%2Fdownload.png?generation=1608374055920310&alt=media"> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2137176%2Fdfa119b710b9edb47f0f6b2326b4cbdd%2Fdownload_1.png?generation=1608374048827571&alt=media">

    Actual Dataset can be seen at https://github.com/NVlabs/ffhq-dataset

    Content

    This dataset contains 6000 records in two files : 1. A json file having below format {'face_landmarks': [[191.5, 617.5],[210.5, 717.5], ...............], 'file_name': '00000.png'}

    1. images folder , having the same names as provided in json file, cropped

    Licenses

    The individual images were published in Flickr by their respective authors under either Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark 1.0, Public Domain CC0 1.0, or U.S. Government Works license. All of these licenses allow free use, redistribution, and adaptation for non-commercial purposes. However, some of them require giving appropriate credit to the original author, as well as indicating any changes that were made to the images. The license and original author of each image are indicated in the metadata.

    https://creativecommons.org/licenses/by/2.0/ https://creativecommons.org/licenses/by-nc/2.0/ https://creativecommons.org/publicdomain/mark/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ http://www.usa.gov/copyright.shtml The dataset itself (including JSON metadata, download script, and documentation) is made available under Creative Commons BY-NC-SA 4.0 license by NVIDIA Corporation. You can use, redistribute, and adapt it for non-commercial purposes, as long as you (a) give appropriate credit by citing our paper, (b) indicate any changes that you've made, and (c) distribute any derivative works under the same license.

    https://creativecommons.org/licenses/by-nc-sa/4.0/

    News Regarding Updates

    Its takes a lot of time and resources to generate this dataset in one run. So , i need to run it multiple times generating different subsets ,hence it takes a lot of time to complete it. Date : 19/12/2020 Currently it has 6000 images and respective metadata. Date : 19/12/2020 Currently it has 10000 images and respective metadata. Date : 23/12/2020 updated correctly it has 5000 images and respective metadata.

  15. Short Jokes Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Short Jokes Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/short-jokes-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Short Jokes Dataset

    Humorous Short Jokes

    By Fraser Greenlee (From Huggingface) [source]

    About this dataset

    This dataset offers a valuable resource for various applications such as natural language processing, sentiment analysis, joke generation algorithms, or simply for entertainment purposes. Whether you're a data scientist looking to analyze humor patterns or an individual seeking some quick comedic relief, this dataset has got you covered.

    By utilizing this dataset, researchers can explore different aspects of humor and study the linguistic features that make these short jokes amusing. Moreover, it provides an opportunity for developing computer models capable of generating similar humorous content based on learned patterns.

    How to use the dataset

    • Understanding the Columns:

      • text: This column contains the text of the short joke.
      • **text: No information is provided about this column.
    • Exploring the Jokes:

      • Start by exploring the text column, which contains the actual jokes. You can read through them and have a good laugh!
    • Analyzing the Jokes:

      • To gain insights from this dataset, you can perform various analyses:
        • Sentiment Analysis: Use Natural Language Processing techniques to analyze the sentiment of each joke.
        • Categorization: Group jokes based on common themes or subjects, such as animals, professions, etc.
        • Length Distribution: Analyze and visualize the distribution of joke lengths.
    • Creating New Content or Applications: Since this dataset provides a large collection of short jokes, you can utilize it creatively:

      • Generating Random Jokes: Develop an algorithm that generates new jokes based on patterns found in this dataset.
      • Humor Classification: Build a model that predicts if a given piece of text is funny or not using machine learning techniques.
    • Sharing Your Findings: If you make interesting discoveries or create unique applications using this dataset, consider sharing them with others in Kaggle community.

    Please note that no information regarding dates is available in train.csv; therefore, any temporal analysis or date-based insights won't be feasible with this specific file.

    Research Ideas

    • Analyzing humor patterns: This dataset can be used to analyze different types of humor and identify patterns or common elements in jokes that make them funny. Researchers and linguists can use this dataset to gain insights into the structure, wordplay, or comedic techniques used in short jokes.
    • Natural language processing: With the text data available in this dataset, it can be used for training models in natural language processing (NLP) tasks such as sentiment analysis, joke generation, or understanding humor from written text. NLP researchers and developers can utilize this dataset to build and improve algorithms for detecting or generating funny content.
    • Social media analysis: Short jokes are popular on social media platforms like Twitter or Reddit where users frequently share humorous content. This dataset can be valuable for analyzing the reception and impact of these jokes on social media platforms. By examining trends, engagement metrics, or user reactions to specific jokes from the dataset, marketers or social media analysts can gain insights into what type of humor resonates with different online communities. Overall, this dataset provides a rich resource for exploring various aspects related to humor analysis and NLP tasks while offering opportunities for sociocultural studies related to online comedy culture

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:----------------------------------------------| | text | The actual content of the short jokes. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Fraser Greenlee (From Huggingface).

  16. Industrial Energy End Use in the U.S

    • kaggle.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Industrial Energy End Use in the U.S [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-industrial-energy-end-use-in-the-u-s
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Industrial Energy End Use in the U.S

    Facility-Level Combustion Energy Data

    By US Open Data Portal, data.gov [source]

    About this dataset

    This dataset contains in-depth facility-level information on industrial combustion energy use in the United States. It provides an essential resource for understanding consumption patterns across different sectors and industries, as reported by large emitters (>25,000 metric tons CO2e per year) under the U.S. EPA's Greenhouse Gas Reporting Program (GHGRP). Our records have been calculated using EPA default emissions factors and contain data on fuel type, location (latitude, longitude), combustion unit type and energy end use classified by manufacturing NAICS code. Additionally, our dataset reveals valuable insight into the thermal spectrum of low-temperature energy use from a 2010 Energy Information Administration Manufacturing Energy Consumption Survey (MECS). This information is critical to assessing industrial trends of energy consumption in manufacturing sectors and can serve as an informative baseline for efficient or renewable alternative plans of operation at these facilities. With this dataset you're just a few clicks away from analyzing research questions related to consumption levels across industries, waste issues associated with unconstrained fossil fuel burning practices and their environmental impacts

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides detailed information on industrial combustion energy end use in the United States. Knowing how certain industries use fuel can be valuable for those interested in reducing energy consumption and its associated environmental impacts.

    • To make the most out of this dataset, users should first become familiar with what's included by looking at the columns and their respective definitions. After becoming familiar with the data, users should start to explore areas of interest such as Fuel Type, Report Year, Primary NAICS Code, Emissions Indicators etc. The more granular and specific details you can focus on will help build a stronger analysis from which to draw conclusions from your data set.

    • Next steps could include filtering your data set down by region or end user type (such as direct related processes or indirect support activities). Segmenting your data set further can allow you to identify trends between fuel type used in different regions or compare emissions indicators between different processes within manufacturing industries etc. By taking a closer look through this lens you may be able to find valuable insights that can help inform better decision making when it comes to reducing energy consumption throughout industry in both public and private sectors alike.

    • if exploring specific trends within industry is not something that’s of particular interest to you but rather understanding general patterns among large emitters across regions then it may be beneficial for your analysis to group like-data together and take averages over larger samples which better represent total production across an area or multiple states (timeline varies depending on needs). This approach could open up new possibilities for exploring correlations between economic productivity metrics compared against industrial energy use over periods of time which could lead towards more formal investigations about where efforts are being made towards improved resource efficiency standards among certain industries/areas of production compared against other more inefficient sectors/regionsetc — all from what's already present here!

    By leveraging the information provided within this dataset users have access to many opportunities for finding all sorts of interesting yet practical insights which can have important impacts far beyond understanding just another singular statistic alone; so happy digging!

    Research Ideas

    • Analyzing the trends in combustion energy uses by region across different industries.
    • Predicting the potential of transitioning to clean and renewable sources of energy considering the current end-uses and their magnitude based on this data.
    • Creating an interactive web map application to visualize multiple industrial sites, including their energy sources and emissions data from this dataset combined with other sources (EPA’s GHGRP, MECS survey, etc)

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons...

  17. The Three Hair Types

    • kaggle.com
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vyom bhatia (2020). The Three Hair Types [Dataset]. https://www.kaggle.com/vyombhatia/the-three-hair-types/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Kaggle
    Authors
    vyom bhatia
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Okay so one random day I felt like making a web app cum image classifier and putting up in my Instagram bio for people to play with it. It classified hair and it helped me learn a lot on how training CNNs for real world application works.

    Content

    Below are about a thousand images that represent the three most common hair types in the world. Each hair type has 300+ images to it.

    Acknowledgements

    I scrapped all these images from google images using a chrome extension and sorted them out, image by image. I feel bad because I cannot give credit to the owners and Data Ethics is something I have to improve in as a person.

    Inspiration

    Fellow data practitioner, the question I put in front of you today is: In what creative ways can you play with this beginner's boring data?

  18. Miss America Titleholders

    • kaggle.com
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Miss America Titleholders [Dataset]. https://www.kaggle.com/datasets/thedevastator/miss-america-titleholders-a-comprehensive-datase
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Miss America Titleholders

    Miss America over the years

    About this dataset

    Every year, young women from across the United States compete for the title of Miss America. The competition is open to women between the ages of 17 and 25, and includes a talent portion, an interview, and a swimsuit competition (which was removed in 2018). The winner is crowned by the previous year's titleholder and goes on to tour the nation for about 20,000 miles a month, promoting her particular platform of interest.

    The Miss America dataset contains information on all Miss America titleholders from 1921 to 2022. It includes columns for the year of the pageant, the name of the crowned winner, her state or district represented, awards won, talent performed, and notes about her win

    How to use the dataset

    This dataset contains information on Miss America titleholders from 1921 to 2022. The data includes the name of the winner, her state or district, the city she represented, her talent, and the year she won

    Research Ideas

    • Miss America could be used to study changes in American culture over time. For example, the decline in the swimsuit competition could be seen as a sign of increasing body positivity in the US.
    • The dataset could be used to study the effect of winning Miss America has on a woman's career. Does winning lead to more opportunities?
    • The dataset could be used to study geographical patterns inMiss America winners. For example, are there any states that have produced more winners than others?

    Acknowledgements

    License

    License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original.

    Columns

    File: miss_america_titleholders.csv | Column name | Description | |:----------------------|:-----------------------------------------------------------------------| | year | The year the Miss America pageant was held. (Integer) | | crowned | The name of the Miss America titleholder. (String) | | winner | The name of the Miss America winner. (String) | | state_or_district | The state or district represented by the Miss America winner. (String) | | city | The city represented by the Miss America winner. (String) | | awards | The awards won by the Miss America winner. (String) | | talent | The talent performed by the Miss America winner. (String) | | notes | Notes about the Miss America winner. (String) |

    File: eurovision_winners.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------| | Year | The year the pageant was held. (Integer) | | Date | The date the pageant was held. (Date) | | Host City | The city where the pageant was held. (String) | | Winner | The name of the pageant winner. (String) | | Song | The song performed by the pageant winner. (String) | | Performer | The name of the performer of the pageant winner's song. (String) | | Points | The number of points the pageant winner received. (Integer) | | Margin | The margin of points between the pageant winner and runner-up. (Integer) | | Runner-up | The name of the pageant runner-up. (String) |

  19. Shark Tank US Dataset (1274, 48)

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suvradeep (2024). Shark Tank US Dataset (1274, 48) [Dataset]. https://www.kaggle.com/datasets/suvroo/shark-tank-us-dataset-1274-48
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Suvradeep
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Shark Tank US Dataset Description

    This dataset provides comprehensive information about the American business reality television series, Shark Tank, covering seasons 1 to 14. The dataset includes 50 fields/columns and over 1260 records, capturing various details about each episode, pitch, and deal made on the show. Below is a detailed description of the columns included in the dataset:

    Columns and Descriptions

    • Season Number: The number of the season.
    • Episode Number: The episode number within the season.
    • Pitch Number: The overall pitch number.
    • Original Air Date: The original or first aired date of the episode.
    • Startup Name: The name of the startup company.
    • Industry: The industry name or type.
    • Business Description: A brief description of the business.
    • Pitchers Gender: The gender of the pitchers.
    • Pitchers City: The US city where the pitchers are from.
    • Pitchers State: The US state or country of the pitchers, represented by a two-letter shortcut.
    • Pitchers Average Age: The average age of all pitchers, categorized as <30 (young), 30-50 (middle), or >50 (old).
    • Entrepreneur Names: The names of the pitchers.
    • Company Website: The website of the startup or company.
    • Multiple Entrepreneurs: Indicates whether there are multiple entrepreneurs (1 for yes, 0 for no).
    • US Viewership: The viewership in the US, TRP rating, in millions.
    • Original Ask Amount: The original ask amount in USD.
    • Original Offered Equity: The original offered equity in percentages.
    • Valuation Requested: The valuation requested in USD.
    • Got Deal: Indicates whether the deal was secured (1 for yes, 0 for no).
    • Total Deal Amount: The total deal amount in USD.
    • Total Deal Equity: The total deal equity in percentages.
    • Deal Valuation: The deal valuation in USD.
    • Number of sharks in deal: The number of sharks involved in the deal.
    • Investment Amount Per Shark: The investment amount per shark.
    • Equity Per Shark: The equity received by each shark.
    • Royalty Deal: Indicates whether it is a royalty deal or a deal with advisory shares.
    • Loan: The loan or debt (line of credit) amount given by sharks, in USD.
    • Barbara Corcoran Investment Amount: The amount invested by Barbara Corcoran.
    • Barbara Corcoran Investment Equity: The equity received by Barbara Corcoran.
    • Mark Cuban Investment Amount: The amount invested by Mark Cuban.
    • Mark Cuban Investment Equity: The equity received by Mark Cuban.
    • Lori Greiner Investment Amount: The amount invested by Lori Greiner.
    • Lori Greiner Investment Equity: The equity received by Lori Greiner.
    • Robert Herjavec Investment Amount: The amount invested by Robert Herjavec.
    • Robert Herjavec Investment Equity: The equity received by Robert Herjavec.
    • Daymond John Investment Amount: The amount invested by Daymond John.
    • Daymond John Investment Equity: The equity received by Daymond John.
    • Kevin O'Leary Investment Amount: The amount invested by Kevin O'Leary.
    • Kevin O'Leary Investment Equity: The equity received by Kevin O'Leary.
    • Guest Investment Amount: The amount invested by guest sharks.
    • Guest Investment Equity: The equity received by guest sharks.
    • Guest Name: The name of the guest shark.
    • Barbara Corcoran Present: Indicates whether Barbara Corcoran is present in the episode.
    • Mark Cuban Present: Indicates whether Mark Cuban is present in the episode.
    • Lori Greiner Present: Indicates whether Lori Greiner is present in the episode.
    • Robert Herjavec Present: Indicates whether Robert Herjavec is present in the episode.
    • Daymond John Present: Indicates whether Daymond John is present in the episode.
    • Kevin O'Leary Present: Indicates whether Kevin O'Leary is present in the episode.

    This dataset provides a rich source of information for analyzing the trends, investments, and outcomes of pitches on Shark Tank.

  20. Data from: Menstrual Cycle Data

    • kaggle.com
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Bisht (2021). Menstrual Cycle Data [Dataset]. https://www.kaggle.com/nikitabisht/menstrual-cycle-data/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nikita Bisht
    Description

    Context

    Periods usually arrive every month in a woman’s life. But we all are so busy with mundane work that we tend to forget our period dates. Moreover, most of the women have such an inconsistent cycle that it is worthless for them to remember their previous dates. This dataset can be used to create apps and websites to predict the menstrual days more efficiently.

    Content

    The dataset has 80 columns and I downloaded this from https://epublications.marquette.edu/data_nfp/7/ . I cleaned this dataset a lot and then made a webapp to predict the menstrual cycle and ovulation days of a woman. Though the dataset that I have uploaded is original and not the one that I cleaned.

    Acknowledgements

    The whole credit for this dataset goes to those people who made it and uploaded it on this website https://epublications.marquette.edu/data_nfp/7/ . I have just downloaded the dataset and uploaded it on Kaggle.

    Inspiration

    Before this, I was unable to find a dataset on the menstrual cycle. I searched so many websites like Kaggle, Google datasets, arxiv and went through a lot of research papers but the dataset that I was looking for was nowhere to be found. Then I stumbled on this publication which made it possible for me to build an app that enables women to get to know more about their bodies. So I am uploading it here so that all of you don't have to go through thousands of datasets that might not be of great use to you.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). GMSC Dataset [Dataset]. https://paperswithcode.com/dataset/gmsc

GMSC Dataset

Give Me Some Credit

Explore at:
Dataset updated
Jan 24, 2024
Description

Data for a Kaggle competition

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.

Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).

Search
Clear search
Close search
Google apps
Main menu