66 datasets found
  1. similar-loss

    • kaggle.com
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KOOK HEEJIN (2022). similar-loss [Dataset]. https://www.kaggle.com/datasets/kookheejin/similarloss
    Explore at:
    zip(1798645210 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    KOOK HEEJIN
    Description

    Dataset

    This dataset was created by KOOK HEEJIN

    Contents

  2. Critical Habitats Data

    • kaggle.com
    Updated May 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Singh (2023). Critical Habitats Data [Dataset]. https://www.kaggle.com/datasets/utkarshx27/critical-habitats-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2023
    Dataset provided by
    Kaggle
    Authors
    Utkarsh Singh
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description
    Note: There are 5 files

    Description

    Connecticut Critical Habitats is a polygon feature-based layer with a resolution of +/- 10 meters that represents significant natural community types occurring in Connecticut. This layer is a subset of habitat-related vegetation associations, described in Connecticut's Natural Vegetation Classification, that were designated as key habitats for species of Greatest Conservation Need in the Comprehensive Wildlife Conservation Strategy. These habitats are known to host a number of rare species including highly specialized invertebrates with very specific habitat associations. Some key habitats are broken into subtypes based on natural variations in plant species dominance and/or vegetation structure. These differences are apparent in the subtype names. Connecticut Critical Habitats can serve to highlight ecologically significant areas and to target areas of species diversity.

    This layer can be used to perform various spatial analyses that pertain to Critical Habitats, to aid in determining site management and conservation priorities, prioritizing field surveys, and to further document the distribution and abundance of State-listed and/or rare vertebrate and invertebrate species within the significant habitats. Use this layer appropriately with data maintaining similar resolution. Not intended for maps printed at a resolution greater or more detailed than 1:2000.

    Purpose

    Connecticut Critical Habitats provides the identification and distribution of a subset of important wildlife habitats identified in the Connecticut Comprehensive Wildlife Conservation Strategy. Connecticut Critical Habitats can be used in conjunction with other environmental and natural resource information to provide a more thorough understanding of the physical characteristics of each habitat. The spatial relationships between these areas and data such as land ownership and past, present and projected land use can be analyzed. The Connecticut Critical Habitats can serve to highlight ecologically significant areas and to target areas of species diversity for land conservation and protection. Biologists may use this data to target further research on associated plant and animal species.

    Use Limitations

    Connecticut Critical Habitats is not a comprehensive map of all critical habitat types in Connecticut. It represents a subset of the key habitats of greatest conservation need identified in Connecticut's Comprehensive Wildlife Conservation Strategy. Sites were mapped according to their known distribution. For some habitats the distribution may not be complete since no state-wide exhaustive surveys have been conducted. Most critical habitat sites were not field visited and publicly available oblique imagery such as the Bing Maps web mapping service was used as a surrogate for field investigation. Caution is advised when using this information without field verifying the habitat delineation and characterization for accuracy. Since many of these areas occur on private property, visiting these sites will require permission from the landowner for access. The recommended scale for viewing Critical Habitats is 1:2,000 to 1:12,000. Displaying Connecticut Critical Habitats at map scales larger and more detailed than 1:2,000 scale may result in minor locational differences and inaccuracies.

  3. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  4. f

    Alternative predictor variables.

    • plos.figshare.com
    xls
    Updated May 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rivalani Hlongwane; Kutlwano K. K. M. Ramaboa; Wilson Mongwe (2024). Alternative predictor variables. [Dataset]. http://doi.org/10.1371/journal.pone.0303566.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rivalani Hlongwane; Kutlwano K. K. M. Ramaboa; Wilson Mongwe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study explores the potential of utilizing alternative data sources to enhance the accuracy of credit scoring models, compared to relying solely on traditional data sources, such as credit bureau data. A comprehensive dataset from the Home Credit Group’s home loan portfolio is analysed. The research examines the impact of incorporating alternative predictors that are typically overlooked, such as an applicant’s social network default status, regional economic ratings, and local population characteristics. The modelling approach applies the model-X knockoffs framework for systematic variable selection. By including these alternative data sources, the credit scoring models demonstrate improved predictive performance, achieving an area under the curve metric of 0.79360 on the Kaggle Home Credit default risk competition dataset, outperforming models that relied solely on traditional data sources, such as credit bureau data. The findings highlight the significance of leveraging diverse, non-traditional data sources to augment credit risk assessment capabilities and overall model accuracy.

  5. Z

    Dollar street 10 - 64x64x3

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    van der burg, Sven (2024). Dollar street 10 - 64x64x3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10837089
    Explore at:
    Dataset updated
    Apr 14, 2024
    Dataset authored and provided by
    van der burg, Sven
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

    This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

    These are the preprocessing steps that were performed:

    Only take examples with one imagenet_synonym label

    Use only examples with the 10 most frequently occuring labels

    Downscale images to 64 x 64 pixels

    Split data in train and test

    Store as numpy array

    This is the label mapping:

    Category label

    day bed 0

    dishrag 1

    plate 2

    running shoe 3

    soap dispenser 4

    street sign 5

    table lamp 6

    tile roof 7

    toilet seat 8

    washing machine 9

    Checkout this notebook to see how the subset was created.

    The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.

  6. NYS Alternative Fuel Stations in New York

    • kaggle.com
    zip
    Updated Dec 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of New York (2020). NYS Alternative Fuel Stations in New York [Dataset]. https://www.kaggle.com/new-york-state/nys-alternative-fuel-stations-in-new-york
    Explore at:
    zip(552277 bytes)Available download formats
    Dataset updated
    Dec 28, 2020
    Dataset authored and provided by
    State of New York
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York, New York
    Description

    Content

    Go to https://afdc.energy.gov/stations/#/find/nearest to access the full database of alternative fuel station locations nationwide, collected and maintained by the U.S. Department of Energy National Renewable Energy Laboratory. A station appears as one point in the data and on the map, regardless of the number of fuel dispensers or charging outlets at that location. For EV charging stations for example, the data includes the number of number of charging ports available at the specific station.

    How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.

    Context

    This is a dataset hosted by the State of New York. The state has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York State using Kaggle and all of the data sources available through the State of New York organization page!

    • Update Frequency: This dataset is updated annually.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

  7. A

    ‘Boston House Prices-Advanced Regression Techniques’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Boston House Prices-Advanced Regression Techniques’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-boston-house-prices-advanced-regression-techniques-bae0/fd606ebf/?iid=003-689&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Boston
    Description

    Analysis of ‘Boston House Prices-Advanced Regression Techniques’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/the-boston-houseprice-data on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Similar Datasets

    • Gender Pay Gap Dataset: LINK
    • California Housing Prices Data (5 new features!): LINK
    • Company Bankruptcy Prediction: LINK

    Context

    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

    Attribute Information

    Input features in order: 1) CRIM: per capita crime rate by town 2) ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3) INDUS: proportion of non-retail business acres per town 4) CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise) 5) NOX: nitric oxides concentration (parts per 10 million) [parts/10M] 6) RM: average number of rooms per dwelling 7) AGE: proportion of owner-occupied units built prior to 1940 8) DIS: weighted distances to five Boston employment centres 9) RAD: index of accessibility to radial highways 10) TAX: full-value property-tax rate per $10,000 [$/10k] 11) PTRATIO: pupil-teacher ratio by town 12) B: The result of the equation B=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13) LSTAT: % lower status of the population

    Output variable: 1) MEDV: Median value of owner-occupied homes in $1000's [k$]

    Source

    StatLib - Carnegie Mellon University

    Relevant Papers

    Harrison, David & Rubinfeld, Daniel. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management. 5. 81-102. 10.1016/0095-0696(78)90006-2. LINK

    Belsley, David A. & Kuh, Edwin. & Welsch, Roy E. (1980). Regression diagnostics: identifying influential data and sources of collinearity. New York: Wiley LINK

    --- Original source retains full ownership of the source dataset ---

  8. A

    ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-s-impact-on-educational-stress-49b5/4f12e21a/?iid=019-227&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID-19's Impact on Educational Stress’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bsoyka3/educational-stress-due-to-the-coronavirus-pandemic on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Made by Statistry

    The survey collecting this information is still open for responses here.

    Context

    I just made this public survey because I want someone to be able to do something fun or insightful with the data that's been gathered. You can fill it out too!

    Content

    Each row represents a response to the survey. A few things have been done to sanitize the raw responses: - Column names and options have been renamed to make them easier to work with without much loss of meaning. - Responses from non-students have been removed. - Responses with ages greater than or equal to 22 have been removed.

    Take a look at the column description for each column to see what exactly it represents.

    Acknowledgements

    This dataset wouldn't exist without the help of others. I'd like to thank the following people for their contributions: - Every student who responded to the survey with valid responses - @radcliff on GitHub for providing the list of countries and abbreviations used in the survey and dataset - Giovanna de Vincenzo for providing the list of US states used in the survey and dataset - Simon Migaj for providing the image used for the survey and this dataset

    --- Original source retains full ownership of the source dataset ---

  9. A

    ‘WHO national life expectancy ’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘WHO national life expectancy ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-who-national-life-expectancy-c4c7/d31e495e/?iid=008-857&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘WHO national life expectancy ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mmattson/who-national-life-expectancy on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    I am developing my data science skills in areas outside of my previous work. An interesting problem for me was to identify which factors influence life expectancy on a national level. There is an existing Kaggle data set that explored this, but that information was corrupted. Part of the problem solving process is to step back periodically and ask "does this make sense?" Without reasonable data, it is harder to notice mistakes in my analysis code (as opposed to unusual behavior due to the data itself). I wanted to make a similar data set, but with reliable information.

    This is my first time exploring life expectancy, so I had to guess which features might be of interest when making the data set. Some were included for comparison with the other Kaggle data set. A number of potentially interesting features (like air pollution) were left off due to limited year or country coverage. Since the data was collected from more than one server, some features are present more than once, to explore the differences.

    Content

    A goal of the World Health Organization (WHO) is to ensure that a billion more people are protected from health emergencies, and provided better health and well-being. They provide public data collected from many sources to identify and monitor factors that are important to reach this goal. This set was primarily made using GHO (Global Health Observatory) and UNESCO (United Nations Educational Scientific and Culture Organization) information. The set covers the years 2000-2016 for 183 countries, in a single CSV file. Missing data is left in place, for the user to decide how to deal with it.

    Three notebooks are provided for my cursory analysis, a comparison with the other Kaggle set, and a template for creating this data set.

    Inspiration

    There is a lot to explore, if the user is interested. The GHO server alone has over 2000 "indicators". - How are the GHO and UNESCO life expectancies calculated, and what is causing the difference? That could also be asked for Gross National Income (GNI) and mortality features. - How does the life expectancy after age 60 compare to the life expectancy at birth? Is the relationship with the features in this data set different for those two targets? - What other indicators on the servers might be interesting to use? Some of the GHO indicators are different studies with different coverage. Can they be combined to make a more useful and robust data feature? - Unraveling the correlations between the features would take significant work.

    --- Original source retains full ownership of the source dataset ---

  10. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  11. A

    ‘Customer Segmentation Classification’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Customer Segmentation Classification’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-customer-segmentation-classification-4965/7267b2f5/?iid=015-403&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Customer Segmentation Classification’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kaushiksuresh147/customer-segmentation on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4, and P5). After intensive market research, they’ve deduced that the behavior of the new market is similar to their existing market.

    In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for a different segment of customers. This strategy has work e exceptionally well for them. They plan to use the same strategy for the new markets and have identified 2627 new potential customers.

    You are required to help the manager to predict the right group of the new customers.

    Content

    |Variable|Definition| |--|--| |ID|Unique ID| |Gender|Gender of the customer| |Ever_Married|Marital status of the customer| |Age|Age of the customer| |Graduated|Is the customer a graduate?| |Profession|Profession of the customer| |Work_Experience|Work Experience in years| |Spending_Score|Spending score of the customer| |Family_Size|Number of family members for the customer (including the customer)| |Var_1|Anonymised Category for the customer| |Segmentation|(target) Customer Segment of the customer|

    Acknowledgements

    This dataset was acquired from the Analytics Vidhya hackathon.

    --- Original source retains full ownership of the source dataset ---

  12. options-3

    • kaggle.com
    zip
    Updated Oct 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _godmode_ (2020). options-3 [Dataset]. https://www.kaggle.com/kaustubh243/options3
    Explore at:
    zip(14923790 bytes)Available download formats
    Dataset updated
    Oct 21, 2020
    Authors
    _godmode_
    Description

    Dataset

    This dataset was created by godmode

    Contents

  13. A

    ‘Cardano Data’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Cardano Data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-cardano-data-6b5c/e7fad47b/?iid=003-943&v=presentation
    Explore at:
    Dataset updated
    Nov 13, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Cardano Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/varpit94/cardano-data on 13 November 2021.

    --- Dataset description provided by original source is as follows ---

    What is Cardano?

    Cardano is a public blockchain platform. It is open-source and decentralized, with consensus achieved using proof of stake. It can facilitate peer-to-peer transactions with its internal cryptocurrency, Ada. Cardano was founded in 2015 by Ethereum co-founder Charles Hoskinson. The development of the project is overseen and supervised by the Cardano Foundation based in Zug, Switzerland. It is also the largest cryptocurrency to use a proof-of-stake blockchain, which is seen as a greener alternative to proof-of-work protocols.

    Data Description

    This dataset provides the history of daily prices of Cardano. The data starts from 01-Oct-2017. All the column descriptions are provided. Currency is USD.

    --- Original source retains full ownership of the source dataset ---

  14. 1000-options-day-data

    • kaggle.com
    zip
    Updated Jul 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChengHB (2023). 1000-options-day-data [Dataset]. https://www.kaggle.com/datasets/chenghb/1000-options-day-data/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(3283565 bytes)Available download formats
    Dataset updated
    Jul 31, 2023
    Authors
    ChengHB
    Description

    Dataset

    This dataset was created by ChengHB

    Contents

  15. A

    ‘last.fm Music Artist Scrobbles’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘last.fm Music Artist Scrobbles’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-last-fm-music-artist-scrobbles-b1d2/0776ba62/?iid=000-706&v=presentation
    Explore at:
    Dataset updated
    Feb 14, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘last.fm Music Artist Scrobbles’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/pcbreviglieri/lastfm-music-artist-scrobbles on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    This dataset is a summarized, sanitized subset of the one released at The 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011), currently hosted at the GroupLens website (here).

    Sanitization included: (a) artist name mispelling correction and standardization; (b) reassignment of artists referenced with two or more artist id's; (c) removal of artists listed as 'unknown' or through their website addresses.

    The original dataset contains a larger number of files, including tag-related information, in addition to users, artists and scrobble counts. last.fm was contacted by the author and asked for some recent version of this content, in similar format, with no return until June 15th, 2020.

    --- Original source retains full ownership of the source dataset ---

  16. Caucasian People - Liveness Detection Dataset

    • kaggle.com
    Updated Apr 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2024). Caucasian People - Liveness Detection Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/caucasian-people-liveness-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Biometric Attack Dataset, Caucasian People

    The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset

    The dataset for face anti spoofing and face recognition includes images and videos of сaucasian people. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group.

    The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.

    The dataset contains images and videos of real humans with various resolutions, views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.

    People in the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F09524087833ccb985350545376670f7d%2FFrame%20102.png?generation=1712318079960855&alt=media" alt="">

    Types of files in the dataset:

    • photo - selfie of the person
    • video - real video of the person

    Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.

    💴 For Commercial Usage: Full version of the dataset includes 19,000 files, leave a request on TrainingData to buy the dataset

    Metadata for the full dataset:

    • assignment_id - unique identifier of the media file
    • worker_id - unique identifier of the person
    • age - age of the person
    • true_gender - gender of the person
    • country - country of the person
    • ethnicity - ethnicity of the person
    • video_extension - video extensions in the dataset
    • video_resolution - video resolution in the dataset
    • video_duration - video duration in the dataset
    • video_fps - frames per second for video in the dataset
    • photo_extension - photo extensions in the dataset
    • photo_resolution - photo resolution in the dataset

    Statistics for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F0b17f6b68aea01fda89c4608db97a94f%2FFrame%20101.png?generation=1712314613427348&alt=media" alt="">

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

    Content

    The dataset consists of: - files - includes 10 folders corresponding to each person and including 1 image and 1 video, - .csv file - contains information about the files and people in the dataset

    File with the extension .csv

    • id: id of the person,
    • selfie_link: link to access the photo,
    • video_link: link to access the video,
    • age: age of the person,
    • country: country of the person,
    • gender: gender of the person,
    • video_extension: video extension,
    • video_resolution: video resolution,
    • video_duration: video duration,
    • video_fps: frames per second for video,
    • photo_extension: photo extension,
    • photo_resolution: photo resolution

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset

  17. A

    ‘Argentina provincial data’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Argentina provincial data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-argentina-provincial-data-4425/b5eea614/?iid=001-550&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Argentina
    Description

    Analysis of ‘Argentina provincial data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kingabzpro/argentina-provincial-data on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.

    Content

    We will analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, we will use principal component analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, we will use k-means clustering to partition the provinces into groups with similar development levels.

    These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.

    Acknowledgements

    DataCamp

    --- Original source retains full ownership of the source dataset ---

  18. P

    Paimon Dataset YOLO Detection Dataset

    • paperswithcode.com
    • gts.ai
    Updated Mar 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Paimon Dataset YOLO Detection Dataset [Dataset]. https://paperswithcode.com/dataset/paimon-dataset-yolo-detection
    Explore at:
    Dataset updated
    Mar 18, 2025
    Description

    Description:

    👉 Download the dataset here

    This dataset consists of a diverse collection of images featuring Paimon, a popular character from the game Genshin Impact. The images have been sourced from in-game gameplay footage and capture Paimon from various angles and in different sizes (scales), making the dataset suitable for training YOLO object detection models.

    The dataset provides a comprehensive view of Paimon in different lighting conditions, game environments, and positions, ensuring the model can generalize well to similar characters or object detection tasks. While most annotations are accurately labeled, a small number of annotations may include minor inaccuracies due to manual labeling errors. This is ideal for researchers and developers working on character recognition, object detection in gaming environments, or other AI vision tasks.

    Download Dataset

    Dataset Features:

    Image Format: .jpg files in 640×320 resolution.

    Annotation Format: .txt files in YOLO format, containing bounding box data with:

    class_id

    x_center

    y_center

    width

    height

    Use Cases:

    Character Detection in Games: Train YOLO models to detect and identify in-game characters or NPCs.

    Gaming Analytics: Improve recognition of specific game elements for AI-powered game analytics tools.

    Research: Contribute to academic research focused on object detection or computer vision in animated and gaming environments.

    Data Structure:

    Images: High-quality .jpg images captured from multiple perspectives, ensuring robust model training across various orientations and lighting scenarios.

    Annotations: Each image has an associated .txt file that follows the YOLO format. The annotations are structured to include class identification, object location (center coordinates), and

    bounding box dimensions.

    Key Advantages:

    Varied Angles and Scales: The dataset includes Paimon from multiple perspectives, aiding in creating more versatile and adaptable object detection models.

    Real-World Scenario: Extracted from actual gameplay footage, the dataset simulates real-world detection challenges such as varying backgrounds, motion blur, and changing character scales.

    Training Ready: Suitable for training YOLO models and other deep learning frameworks that require object detection capabilities.

    This dataset is sourced from Kaggle.

  19. h

    finance-alpaca

    • huggingface.co
    Updated Apr 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    finance-alpaca [Dataset]. https://huggingface.co/datasets/gbharti/finance-alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Authors
    Gaurang Bharti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a combination of Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5 Script for tuning through Kaggle's (https://www.kaggle.com) free resources using PEFT/LoRa: https://www.kaggle.com/code/gbhacker23/wealth-alpaca-lora GitHub repo with performance analyses, training and data generation scripts, and inference notebooks: https://github.com/gaurangbharti1/wealth-alpaca… See the full description on the dataset page: https://huggingface.co/datasets/gbharti/finance-alpaca.

  20. Numenta Anomaly Benchmark (NAB)

    • kaggle.com
    Updated Aug 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BoltzmannBrain (2016). Numenta Anomaly Benchmark (NAB) [Dataset]. https://www.kaggle.com/datasets/boltzmannbrain/nab/discussion?sortBy=hot&group=upvoted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BoltzmannBrain
    Description

    The Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.

    NAB Data Corpus

    The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.

    The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.

    The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.

    Real data

    • realAWSCloudwatch/

      AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.

    • realAdExchange/

      Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.

    • realKnownCause/

      This is data for which we know the anomaly causes; no hand labeling.

      • ambient_temperature_system_failure.csv: The ambient temperature in an office setting.
      • cpu_utilization_asg_misconfiguration.csv: From Amazon Web Services (AWS) monitoring CPU usage – i.e. average CPU usage across a given cluster. When usage is high, AWS spins up a new machine, and uses fewer machines when usage is low.
      • ec2_request_latency_system_failure.csv: CPU usage data from a server in Amazon's East Coast datacenter. The dataset ends with complete system failure resulting from a documented failure of AWS API servers. There's an interesting story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta blog.
      • machine_temperature_system_failure.csv: Temperature sensor data of an internal component of a large, industrial mahcine. The first anomaly is a planned shutdown of the machine. The second anomaly is difficult to detect and directly led to the third anomaly, a catastrophic failure of the machine.
      • nyc_taxi.csv: Number of NYC taxi passengers, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets.
      • rogue_agent_key_hold.csv: Timing the key holds for several users of a computer, where the anomalies represent a change in the user.
      • rogue_agent_key_updown.csv: Timing the key strokes for several users of a computer, where the anomalies represent a change in the user.
    • realTraffic/

      Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.

    • realTweets/

      A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.

    Artificial data

    • artificialNoAnomaly/

      Artifically-generated data without any anomalies.

    • artificialWithAnomaly/

      Artifically-generated data with varying types of anomalies.

    Acknowledgments

    We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:

    Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
KOOK HEEJIN (2022). similar-loss [Dataset]. https://www.kaggle.com/datasets/kookheejin/similarloss
Organization logo

similar-loss

Explore at:
zip(1798645210 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
KOOK HEEJIN
Description

Dataset

This dataset was created by KOOK HEEJIN

Contents

Search
Clear search
Close search
Google apps
Main menu