47 datasets found
  1. issues-kaggle-notebooks

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  2. Testing github actions for upload datasets

    • kaggle.com
    zip
    Updated Oct 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaime Valero (2020). Testing github actions for upload datasets [Dataset]. https://www.kaggle.com/jaimevalero/my-new-dataset
    Explore at:
    zip(183 bytes)Available download formats
    Dataset updated
    Oct 12, 2020
    Authors
    Jaime Valero
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Example of dataset syncronized by github actions
    Source https://github.com/jaimevalero/test-actions and https://github.com/jaimevalero/push-kaggle-dataset

  3. HAIS: sample data

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abderrazak Chahid (2023). HAIS: sample data [Dataset]. https://www.kaggle.com/datasets/abderrazakchahid1/sample-data-hais/code
    Explore at:
    zip(40407274 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    Abderrazak Chahid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Abderrazak Chahid

    Released under MIT

    Contents

  4. h

    hagrid-sample-250k-384p

    • huggingface.co
    Updated Jul 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Mills (2023). hagrid-sample-250k-384p [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-sample-250k-384p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Authors
    Christian Mills
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 254,661 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

      Original Authors:
    

    Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

      Original Dataset Links
    

    GitHub Kaggle Datasets Page

      Object Classes
    

    ['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-250k-384p.

  5. Developers and programming languages

    • kaggle.com
    Updated Dec 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaime Valero (2017). Developers and programming languages [Dataset]. https://www.kaggle.com/jaimevalero/developers-and-programming-languages/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2017
    Dataset provided by
    Kaggle
    Authors
    Jaime Valero
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Sample of 17.000 github.com developers, and programming language they know - or want to -.

    Content

    I acquired the data listing the 1.000 most starred repos dataset, and getting the first 30 users that starred each repo. Cleaning the dupes. Then for each of the 17.000 users, I calculate the frequency of each of the 1.400 technologies in the user and forked repositories metadata.

    Acknowledgements

    Thanks to Jihye Sofia Seo, because their dataset Top 980 Starred Open Source Projects on GitHub is the source for this dataset.

    Inspiration

    I am using this dataset for my github recommendation engine, I use it to find similar developers, to use his stared repositories as recommendation. Also, I use this dataset to categorize developer types, trying to understand the weight of a developer in a team, specially when a developer leaves the company, so It is possible to draw the talent lost for the team and the company.

  6. notMNIST

    • kaggle.com
    • opendatalab.com
    • +3more
    Updated Feb 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    jwjohnson314
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

    This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

    Content

    notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

    The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

    Acknowledgements

    Thanks to Yaroslav Bulatov for putting together the dataset.

  7. Google Landmarks Dataset v2

    • github.com
    • paperswithcode.com
    • +2more
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
    Explore at:
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.

  8. A

    ‘My Uber Drives’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Mar 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ‘My Uber Drives’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-my-uber-drives-8b97/latest
    Explore at:
    Dataset updated
    Mar 23, 2017
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘My Uber Drives’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/uberdrives on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    My Uber Drives (2016)

    Here are the details of my Uber Drives of 2016. I am sharing this dataset for data science community to learn from the behavior of an ordinary Uber customer.

    Content

    Geography: USA, Sri Lanka and Pakistan

    Time period: January - December 2016

    Unit of analysis: Drives

    Total Drives: 1,155

    Total Miles: 12,204

    Dataset: The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)

    Acknowledgements & References

    Users are allowed to use, download, copy, distribute and cite the dataset for their pet projects and training. Please cite it as follows: “Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017.”

    Past Research

    Uber TLC FOIL Response - The dataset contains over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015 https://github.com/fivethirtyeight/uber-tlc-foil-response

    1.1 Billion Taxi Pickups from New York - http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

    What you can do with this data - a good example by Yao-Jen Kuo - https://yaojenkuo.github.io/uber.html

    Inspiration

    Some ideas worth exploring:

    • What is the average length of the trip?

    • Average number of rides per week or per month?

    • Total tax savings based on traveled business miles?

    • Percentage of business miles vs personal vs. Meals

    • How much money can be saved by a typical customer using Uber, Careem, or Lyft versus regular cab service?

    --- Original source retains full ownership of the source dataset ---

  9. Coughs: ESC-50 and FSDKaggle2018

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton; Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton (2021). Coughs: ESC-50 and FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.5136592
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton; Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.

    Citation

    This dataset was generated and used in our paper:

    Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.

    Please cite this paper if you use the timestamps.csv file in your work.

    Generation

    The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:

    A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006

    More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.

    Description

    The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.

    The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:

    file_name,cough_number,start_time,end_time

    Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.

    Licensing

    The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.

    The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.

    The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.

  10. i

    COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset

    • ieee-dataport.org
    Updated Oct 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narinder Singh Punn (2020). COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset [Dataset]. http://doi.org/10.21227/x2r3-xk48
    Explore at:
    Dataset updated
    Oct 27, 2020
    Dataset provided by
    IEEE Dataport
    Authors
    Narinder Singh Punn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is genrated by the fusion of three publicly available datasets: COVID-19 cxr image (https://github.com/ieee8023/covid-chestxray-dataset), Radiological Society of North America (RSNA) (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC) (https://lhncbc.nlm.nih.gov/publication/pub9931). These datasets were annotated by expert radiologists. The fused dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train and evaulate deep learning and machine learning models as binary and multi-class classification problem.

  11. A

    ‘Mayweather Marketing Tactics’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 18, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ‘Mayweather Marketing Tactics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-mayweather-marketing-tactics-4526/latest
    Explore at:
    Dataset updated
    Aug 18, 2017
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Mayweather Marketing Tactics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/undefeated-boxerse on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    https://i.ibb.co/Z1FW3kh/Selection-708.png" alt="">

    About this dataset

    See Readme for more details.
    This repository contains a selection of the data -- and the data-processing scripts -- behind the articles, graphics and interactives at FiveThirtyEight.

    We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.

    Source: https://github.com/fivethirtyeight/data

    This dataset was created by FiveThirtyEight and contains around 2000 samples along with Date, Url, technical information and other features such as: - Name - Wins - and more.

    How to use this dataset

    • Analyze Date in relation to Url
    • Study the influence of Name on Wins
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit FiveThirtyEight

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  12. Pretraining data of SkySense++

    • zenodo.org
    bin
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kang Wu; Kang Wu (2025). Pretraining data of SkySense++ [Dataset]. http://doi.org/10.5281/zenodo.14994430
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kang Wu; Kang Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 9, 2024
    Description

    This repository contains the data description and processing for the paper titled "SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation." The code is in here

    📢 Latest Updates

    🔥🔥🔥 Last Updated on 2024.03.14 🔥🔥🔥

    Pretrain Data

    RS-Semantic Dataset

    We conduct semantic-enhanced pretraining on the RS-Semantic dataset, which consists of 13 datasets with pixel-level annotations. Below are the specifics of these datasets.

    DatasetModalitiesGSD(m)SizeCategoriesDownload Link
    Five Billion PixelsGaofen-246800x720024Download
    PotsdamAirborne0.056000x60005Download
    VaihingenAirborne0.052494x20645Download
    DeepglobeWorldView0.52448x24486Download
    iSAIDMultiple Sensors-800x800 to 4000x1300015Download
    LoveDASpaceborne0.31024x10247Download
    DynamicEarthNetWorldView0.31024x10247Download
    Sentinel-2*1032x32
    Sentinel-1*1032x33
    Pastis-MMWorldView0.31024x102418Download
    Sentinel-2*1032x32
    Sentinel-1*1032x33
    C2Seg-ABSentinel-2*10128x12813Download
    Sentinel-1*10128x128
    FLAIRSpot-50.2512x51212Download
    Sentinel-2*1040x40
    DFC20Sentinel-210256x2569Download
    Sentinel-110256x256
    S2-naipNAIP1512x51232Download
    Sentinel-2*1064x64
    Sentinel-1*1064x64
    JL-16Jilin-10.72512x51216Download
    Sentinel-1*1040x40

    * for time-series data.

    EO Benchmark

    We evaluate our SkySense++ on 12 typical Earth Observation (EO) tasks across 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying, and disaster management. The detailed information about the datasets used for evaluation is as follows.

    DomainTask typeDatasetModalitiesGSDImage sizeDownload LinkNotes
    AgricultureCrop classificationGermanySentinel-2*1024x24Download
    ForesetryTree species classificationTreeSatAI-Time-SeriesAirborne,0.2304x304Download
    Sentinel-2*106x6
    Sentinel-1*106x6
    Deforestation segmentationAtlanticSentinel-210512x512Download
    OceanographyOil spill segmentationSOSSentinel-110256x256Download
    AtmosphereAir pollution regression3pollutionSentinel-210200x200Download
    Sentinel-5P2600120x120
    BiologyWildlife detectionKenyaAirborne-3068x4603Download
    Land surveyingLULC mappingC2Seg-BWGaofen-610256x256Download
    Gaofen-310256x256
    Change detectiondsifn-cdGoogleEarth0.3512x512Download
    Disaster managementFlood monitoringFlood-3iAirborne0.05256 × 256Download
    C2SMSFloodsSentinel-2, Sentinel-110512x512Download
    Wildfire monitoringCABUARSentinel-2105490 × 5490Download
    Landslide mappingGVLMGoogleEarth0.31748x1748 ~ 10808x7424Download
    Building damage assessmentxBDWorldView0.31024x1024Download

    * for time-series data.

  13. test_sample_wavs

    • kaggle.com
    zip
    Updated Sep 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sas Pav (2023). test_sample_wavs [Dataset]. https://www.kaggle.com/datasets/saspav/sample-wavs
    Explore at:
    zip(2763449369 bytes)Available download formats
    Dataset updated
    Sep 9, 2023
    Authors
    Sas Pav
    Description

    Dataset

    This dataset was created by Sas Pav

    Contents

  14. Bangla Newspaper Dataset

    • kaggle.com
    • huggingface.co
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zabir Al Nazi Nabil (2020). Bangla Newspaper Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1576225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zabir Al Nazi Nabil
    Description

    Bangla Newspaper Dataset

    400k+ bangla news samples, 25+ categories

    Source

    Data collected from https://www.prothomalo.com/archive [Copyright owned by the actual source]

    Github

    Github repository (Classification with LSTM): https://github.com/zabir-nabil/bangla-news-rnn

    HuggingFace

    https://huggingface.co/datasets/zabir-nabil/bangla_newspaper_dataset

    Inspiration

    The dataset can be used for bangla text classification and generation experiments.

  15. JAFFE (Deprecated, use v.2 instead)

    • zenodo.org
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Lyons; Michael Lyons; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba; Miyuki Kamachi (2025). JAFFE (Deprecated, use v.2 instead) [Dataset]. http://doi.org/10.5281/zenodo.3451524
    Explore at:
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Lyons; Michael Lyons; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba; Miyuki Kamachi
    Description

    V.1 is deprecated, use V.2 instead.

    The images are the same: only the README file has been updated.

    https://doi.org/10.5281/zenodo.14974867

    The JAFFE images may be used only for non-commercial scientific research.

    The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.

    Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
    Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
    arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf

    Michael J. Lyons
    "Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
    arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998

    The following is not allowed:

    • Redistribution of the JAFFE dataset (incl. via Github, Kaggle, Colaboratory, GitCafe, CSDN etc.)
    • Posting JAFFE images on the web and social media
    • Public exhibition of JAFFE images in museums/galleries etc.
    • Broadcast in the mass media (tv shows, films, etc.)

    A few sample images (not more than 10) may be displayed in scientific publications.

  16. n

    COVID19, Pneumonia and Normal Chest X-ray PA Dataset

    • narcis.nl
    Updated Mar 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asraf, A (via Mendeley Data) (2021). COVID19, Pneumonia and Normal Chest X-ray PA Dataset [Dataset]. http://doi.org/10.17632/mxc6vb7svm.1
    Explore at:
    Dataset updated
    Mar 22, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Asraf, A (via Mendeley Data)
    Description

    The dataset is organized into 3 folders (covid, pneumonia , normal) which contain chest X-ray posteroanterior (PA) images. X-ray samples of COVID-19 were retrieved from different sources for the unavailability of a large specific dataset. Firstly, a total 1401 samples of COVID-19 were collected using GitHub repository [1] , [2] , the Radiopaedia [3] , Italian Society of Radiology (SIRM) [4] , Figshare data repository websites [5] , [6] . Then, 912 augmented images were also collected from Mendeley instead of using data augmentation techniques explicitly [7] . Finally, 2313 samples of normal and pneumonia cases were obtained from Kaggle [8] , [9] . A total of 6939 samples were used in the experiment, where 2313 samples were used for each case.

  17. d

    SemMdf - Semantic Database for Moksha - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). SemMdf - Semantic Database for Moksha - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/e6562163-b681-54b2-a92b-6780d27abe72
    Explore at:
    Dataset updated
    Apr 25, 2023
    Description

    This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the corpus. Also, the frequency of a syntactic relation between two words is recorded. This means that it is possible to see how frequently for example the word for a dog has appeared with a subject relation with the verb for bark. These database is translated from SemFi by using Giellatekno XML dictionaries. For a detailed description of the structure, see https://www.kaggle.com/mikahama/semfi-finnish-semantics-with-syntactic-relations An easy programmatic interface is provided in UralicNLP: https://github.com/mikahama/uralicNLP/wiki/Semantics-(SemFi,-SemUr) Cite as Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)

  18. Numenta Anomaly Benchmark (NAB)

    • kaggle.com
    Updated Aug 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BoltzmannBrain (2016). Numenta Anomaly Benchmark (NAB) [Dataset]. https://www.kaggle.com/datasets/boltzmannbrain/nab/discussion?sortBy=hot&group=upvoted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BoltzmannBrain
    Description

    The Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.

    NAB Data Corpus

    The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.

    The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.

    The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.

    Real data

    • realAWSCloudwatch/

      AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.

    • realAdExchange/

      Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.

    • realKnownCause/

      This is data for which we know the anomaly causes; no hand labeling.

      • ambient_temperature_system_failure.csv: The ambient temperature in an office setting.
      • cpu_utilization_asg_misconfiguration.csv: From Amazon Web Services (AWS) monitoring CPU usage – i.e. average CPU usage across a given cluster. When usage is high, AWS spins up a new machine, and uses fewer machines when usage is low.
      • ec2_request_latency_system_failure.csv: CPU usage data from a server in Amazon's East Coast datacenter. The dataset ends with complete system failure resulting from a documented failure of AWS API servers. There's an interesting story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta blog.
      • machine_temperature_system_failure.csv: Temperature sensor data of an internal component of a large, industrial mahcine. The first anomaly is a planned shutdown of the machine. The second anomaly is difficult to detect and directly led to the third anomaly, a catastrophic failure of the machine.
      • nyc_taxi.csv: Number of NYC taxi passengers, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets.
      • rogue_agent_key_hold.csv: Timing the key holds for several users of a computer, where the anomalies represent a change in the user.
      • rogue_agent_key_updown.csv: Timing the key strokes for several users of a computer, where the anomalies represent a change in the user.
    • realTraffic/

      Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.

    • realTweets/

      A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.

    Artificial data

    • artificialNoAnomaly/

      Artifically-generated data without any anomalies.

    • artificialWithAnomaly/

      Artifically-generated data with varying types of anomalies.

    Acknowledgments

    We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:

    Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]

  19. Experimental Data for Question Classification

    • kaggle.com
    Updated May 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JunYu (2020). Experimental Data for Question Classification [Dataset]. https://www.kaggle.com/owen1226/textsdata/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2020
    Dataset provided by
    Kaggle
    Authors
    JunYu
    Description

    Context

    This data collection contains all the data used in our learning question classification experiments, which has question class definitions, the training and testing question sets, examples of preprocessing the questions, feature definition scripts and examples of semantically related word features.

    Content

    ABBR - 'abbreviation': expression abbreviated, etc. DESC - 'description and abstract concepts': manner of an action, description of sth. etc. ENTY - 'entities': animals, colors, events, food, etc. HUM - 'human beings': a group or organization of persons, an individual, etc. LOC - 'locations': cities, countries, etc. NUM - 'numeric values': postcodes, dates, speed,temperature, etc

    Acknowledgements

    https://cogcomp.seas.upenn.edu/Data/QA/QC/ https://github.com/Tony607/Keras-Text-Transfer-Learning/blob/master/README.md

  20. The Japanese Female Facial Expression (JAFFE) Dataset

    • zenodo.org
    • data.niaid.nih.gov
    txt, zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Lyons; Michael Lyons; Miyuki Kamachi; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba (2025). The Japanese Female Facial Expression (JAFFE) Dataset [Dataset]. http://doi.org/10.5281/zenodo.14974867
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Lyons; Michael Lyons; Miyuki Kamachi; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba
    Time period covered
    1997
    Description

    The JAFFE images may be used only for non-commercial scientific research.

    The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.

    Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
    Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
    arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf

    Michael J. Lyons
    "Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
    arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998

    The following is not allowed:

    • Redistribution of the JAFFE dataset (incl. via Github, Kaggle, Colaboratory, GitCafe, CSDN etc.)
    • Posting JAFFE images on the web and social media
    • Public exhibition of JAFFE images in museums/galleries etc.
    • Broadcast in the mass media (tv shows, films, etc.)

    A few sample images (not more than 10) may be displayed in scientific publications.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
Organization logo

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Explore at:
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description

GitHub Issues & Kaggle Notebooks

  Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

Search
Clear search
Close search
Google apps
Main menu