47 datasets found

issues-kaggle-notebooks
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
Explore at:
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
GitHub Issues & Kaggle Notebooks

Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
Testing github actions for upload datasets
kaggle.com
zip
Updated Oct 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaime Valero (2020). Testing github actions for upload datasets [Dataset]. https://www.kaggle.com/jaimevalero/my-new-dataset
Explore at:
zip(183 bytes)Available download formats
Dataset updated
Oct 12, 2020
Authors
Jaime Valero
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Example of dataset syncronized by github actions
Source https://github.com/jaimevalero/test-actions and https://github.com/jaimevalero/push-kaggle-dataset
HAIS: sample data
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abderrazak Chahid (2023). HAIS: sample data [Dataset]. https://www.kaggle.com/datasets/abderrazakchahid1/sample-data-hais/code
Explore at:
zip(40407274 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
Abderrazak Chahid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Abderrazak Chahid

Released under MIT

Contents
h
hagrid-sample-250k-384p
huggingface.co
Updated Jul 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Mills (2023). hagrid-sample-250k-384p [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-sample-250k-384p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Authors
Christian Mills
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 254,661 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

Original Dataset Links

GitHub Kaggle Datasets Page

Object Classes

['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-250k-384p.
Developers and programming languages
kaggle.com
Updated Dec 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaime Valero (2017). Developers and programming languages [Dataset]. https://www.kaggle.com/jaimevalero/developers-and-programming-languages/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2017
Dataset provided by
Kaggle
Authors
Jaime Valero
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Sample of 17.000 github.com developers, and programming language they know - or want to -.

Content

I acquired the data listing the 1.000 most starred repos dataset, and getting the first 30 users that starred each repo. Cleaning the dupes. Then for each of the 17.000 users, I calculate the frequency of each of the 1.400 technologies in the user and forked repositories metadata.

Acknowledgements

Thanks to Jihye Sofia Seo, because their dataset Top 980 Starred Open Source Projects on GitHub is the source for this dataset.

Inspiration

I am using this dataset for my github recommendation engine, I use it to find similar developers, to use his stared repositories as recommendation. Also, I use this dataset to categorize developer types, trying to understand the weight of a developer in a team, specially when a developer leaves the company, so It is possible to draw the talent lost for the team and the company.
notMNIST
kaggle.com
opendatalab.com
+3more
Updated Feb 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
jwjohnson314
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

Content

notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

Acknowledgements

Thanks to Yaroslav Bulatov for putting together the dataset.
Google Landmarks Dataset v2
github.com
paperswithcode.com
+2more
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
A
‘My Uber Drives’ analyzed by Analyst-2
analyst-2.ai
Updated Mar 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ‘My Uber Drives’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-my-uber-drives-8b97/latest
Explore at:
Dataset updated
Mar 23, 2017
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘My Uber Drives’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/uberdrives on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

My Uber Drives (2016)

Here are the details of my Uber Drives of 2016. I am sharing this dataset for data science community to learn from the behavior of an ordinary Uber customer.

Content

Geography: USA, Sri Lanka and Pakistan

Time period: January - December 2016

Unit of analysis: Drives

Total Drives: 1,155

Total Miles: 12,204

Dataset: The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)

Acknowledgements & References

Users are allowed to use, download, copy, distribute and cite the dataset for their pet projects and training. Please cite it as follows: “Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017.”

Past Research

Uber TLC FOIL Response - The dataset contains over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015 https://github.com/fivethirtyeight/uber-tlc-foil-response

1.1 Billion Taxi Pickups from New York - http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

What you can do with this data - a good example by Yao-Jen Kuo - https://yaojenkuo.github.io/uber.html

Inspiration

Some ideas worth exploring:

• What is the average length of the trip?

• Average number of rides per week or per month?

• Total tax savings based on traveled business miles?

• Percentage of business miles vs personal vs. Meals

• How much money can be saved by a typical customer using Uber, Careem, or Lyft versus regular cab service?

--- Original source retains full ownership of the source dataset ---
Coughs: ESC-50 and FSDKaggle2018
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton; Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton (2021). Coughs: ESC-50 and FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.5136592
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5136592
Dataset updated
Jul 27, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton; Mahmoud Abdelkhalek; Jinyi Qiu; Michelle Hernandez; Alper Bozkurt; Edgar Lobaton
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.

Citation

This dataset was generated and used in our paper:

Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.

Please cite this paper if you use the timestamps.csv file in your work.

Generation

The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:

A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006

More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.

Description

The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.

The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:

file_name,cough_number,start_time,end_time

Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.

Licensing

The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.

The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.

The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.
i
COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset
ieee-dataport.org
Updated Oct 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narinder Singh Punn (2020). COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset [Dataset]. http://doi.org/10.21227/x2r3-xk48
Explore at:
Unique identifier
https://doi.org/10.21227/x2r3-xk48
Dataset updated
Oct 27, 2020
Dataset provided by
IEEE Dataport
Authors
Narinder Singh Punn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is genrated by the fusion of three publicly available datasets: COVID-19 cxr image (https://github.com/ieee8023/covid-chestxray-dataset), Radiological Society of North America (RSNA) (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC) (https://lhncbc.nlm.nih.gov/publication/pub9931). These datasets were annotated by expert radiologists. The fused dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train and evaulate deep learning and machine learning models as binary and multi-class classification problem.
A
‘Mayweather Marketing Tactics’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 18, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ‘Mayweather Marketing Tactics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-mayweather-marketing-tactics-4526/latest
Explore at:
Dataset updated
Aug 18, 2017
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Mayweather Marketing Tactics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/undefeated-boxerse on 28 January 2022.

--- Dataset description provided by original source is as follows ---

https://i.ibb.co/Z1FW3kh/Selection-708.png" alt="">

About this dataset

See Readme for more details.
This repository contains a selection of the data -- and the data-processing scripts -- behind the articles, graphics and interactives at FiveThirtyEight.

2017-08-18 Mayweather Is Defined By The Zero Next To His Name

We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.

Source: https://github.com/fivethirtyeight/data

This dataset was created by FiveThirtyEight and contains around 2000 samples along with Date, Url, technical information and other features such as: - Name - Wins - and more.

How to use this dataset

Analyze Date in relation to Url

Study the influence of Name on Wins

More datasets

Acknowledgements

If you use this dataset in your research, please credit FiveThirtyEight

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---

Pretraining data of SkySense++

zenodo.org

bin

Updated Mar 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kang Wu; Kang Wu (2025). Pretraining data of SkySense++ [Dataset]. http://doi.org/10.5281/zenodo.14994430

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14994430

Dataset updated

Mar 18, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Kang Wu; Kang Wu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Mar 9, 2024

Description

This repository contains the data description and processing for the paper titled "SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation." The code is in here

📢 Latest Updates

🔥🔥🔥 Last Updated on 2024.03.14 🔥🔥🔥

[2025.3.14] updated optical images of JL-16 dataset (https://huggingface.co/datasets/KKKKKKang/JL-16)
[2025.3.12] updated sentinel-1 images and labels of JL-16 dataset (https://zenodo.org/records/15010418)

Pretrain Data

RS-Semantic Dataset

We conduct semantic-enhanced pretraining on the RS-Semantic dataset, which consists of 13 datasets with pixel-level annotations. Below are the specifics of these datasets.

Dataset	Modalities	GSD(m)	Size	Categories	Download Link
Five Billion Pixels	Gaofen-2	4	6800x7200	24	Download
Potsdam	Airborne	0.05	6000x6000	5	Download
Vaihingen	Airborne	0.05	2494x2064	5	Download
Deepglobe	WorldView	0.5	2448x2448	6	Download
iSAID	Multiple Sensors	-	800x800 to 4000x13000	15	Download
LoveDA	Spaceborne	0.3	1024x1024	7	Download
DynamicEarthNet	WorldView	0.3	1024x1024	7	Download
	Sentinel-2*	10	32x32
	Sentinel-1*	10	32x33
Pastis-MM	WorldView	0.3	1024x1024	18	Download
	Sentinel-2*	10	32x32
	Sentinel-1*	10	32x33
C2Seg-AB	Sentinel-2*	10	128x128	13	Download
	Sentinel-1*	10	128x128
FLAIR	Spot-5	0.2	512x512	12	Download
	Sentinel-2*	10	40x40
DFC20	Sentinel-2	10	256x256	9	Download
	Sentinel-1	10	256x256
S2-naip	NAIP	1	512x512	32	Download
	Sentinel-2*	10	64x64
	Sentinel-1*	10	64x64
JL-16	Jilin-1	0.72	512x512	16	Download
	Sentinel-1*	10	40x40

* for time-series data.

EO Benchmark

We evaluate our SkySense++ on 12 typical Earth Observation (EO) tasks across 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying, and disaster management. The detailed information about the datasets used for evaluation is as follows.

Domain	Task type	Dataset	Modalities	GSD	Image size	Download Link
Agriculture	Crop classification	Germany	Sentinel-2*	10	24x24	Download
Foresetry	Tree species classification	TreeSatAI-Time-Series	Airborne,	0.2	304x304	Download
			Sentinel-2*	10	6x6
			Sentinel-1*	10	6x6
	Deforestation segmentation	Atlantic	Sentinel-2	10	512x512	Download
Oceanography	Oil spill segmentation	SOS	Sentinel-1	10	256x256	Download
Atmosphere	Air pollution regression	3pollution	Sentinel-2	10	200x200	Download
			Sentinel-5P	2600	120x120
Biology	Wildlife detection	Kenya	Airborne	-	3068x4603	Download
Land surveying	LULC mapping	C2Seg-BW	Gaofen-6	10	256x256	Download
			Gaofen-3	10	256x256
	Change detection	dsifn-cd	GoogleEarth	0.3	512x512	Download
Disaster management	Flood monitoring	Flood-3i	Airborne	0.05	256 × 256	Download
		C2SMSFloods	Sentinel-2, Sentinel-1	10	512x512	Download
	Wildfire monitoring	CABUAR	Sentinel-2	10	5490 × 5490	Download
	Landslide mapping	GVLM	GoogleEarth	0.3	1748x1748 ~ 10808x7424	Download
	Building damage assessment	xBD	WorldView	0.3	1024x1024	Download

* for time-series data.

test_sample_wavs
kaggle.com
zip
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sas Pav (2023). test_sample_wavs [Dataset]. https://www.kaggle.com/datasets/saspav/sample-wavs
Explore at:
zip(2763449369 bytes)Available download formats
Dataset updated
Sep 9, 2023
Authors
Sas Pav
Description
Dataset

This dataset was created by Sas Pav

Contents
Bangla Newspaper Dataset
kaggle.com
huggingface.co
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zabir Al Nazi Nabil (2020). Bangla Newspaper Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1576225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1576225
Dataset updated
Oct 21, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zabir Al Nazi Nabil
Description
Bangla Newspaper Dataset

400k+ bangla news samples, 25+ categories

Source

Data collected from https://www.prothomalo.com/archive [Copyright owned by the actual source]

Github

Github repository (Classification with LSTM): https://github.com/zabir-nabil/bangla-news-rnn

HuggingFace

https://huggingface.co/datasets/zabir-nabil/bangla_newspaper_dataset

Inspiration

The dataset can be used for bangla text classification and generation experiments.
JAFFE (Deprecated, use v.2 instead)
zenodo.org
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Lyons; Michael Lyons; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba; Miyuki Kamachi (2025). JAFFE (Deprecated, use v.2 instead) [Dataset]. http://doi.org/10.5281/zenodo.3451524
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3451524
Dataset updated
Mar 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Lyons; Michael Lyons; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba; Miyuki Kamachi
Description
V.1 is deprecated, use V.2 instead.

The images are the same: only the README file has been updated.

https://doi.org/10.5281/zenodo.14974867

The JAFFE images may be used only for non-commercial scientific research.

The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.

Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf

Michael J. Lyons
"Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998

The following is not allowed:

Redistribution of the JAFFE dataset (incl. via Github, Kaggle, Colaboratory, GitCafe, CSDN etc.)

Posting JAFFE images on the web and social media

Public exhibition of JAFFE images in museums/galleries etc.

Broadcast in the mass media (tv shows, films, etc.)

A few sample images (not more than 10) may be displayed in scientific publications.
n
COVID19, Pneumonia and Normal Chest X-ray PA Dataset
narcis.nl
Updated Mar 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asraf, A (via Mendeley Data) (2021). COVID19, Pneumonia and Normal Chest X-ray PA Dataset [Dataset]. http://doi.org/10.17632/mxc6vb7svm.1
Explore at:
Unique identifier
https://doi.org/10.17632/mxc6vb7svm.1
Dataset updated
Mar 22, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Asraf, A (via Mendeley Data)
Description
The dataset is organized into 3 folders (covid, pneumonia , normal) which contain chest X-ray posteroanterior (PA) images. X-ray samples of COVID-19 were retrieved from different sources for the unavailability of a large specific dataset. Firstly, a total 1401 samples of COVID-19 were collected using GitHub repository [1] , [2] , the Radiopaedia [3] , Italian Society of Radiology (SIRM) [4] , Figshare data repository websites [5] , [6] . Then, 912 augmented images were also collected from Mendeley instead of using data augmentation techniques explicitly [7] . Finally, 2313 samples of normal and pneumonia cases were obtained from Kaggle [8] , [9] . A total of 6939 samples were used in the experiment, where 2313 samples were used for each case.
d
SemMdf - Semantic Database for Moksha - Dataset - B2FIND
b2find.dkrz.de
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). SemMdf - Semantic Database for Moksha - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/e6562163-b681-54b2-a92b-6780d27abe72
Explore at:
Dataset updated
Apr 25, 2023
Description
This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the corpus. Also, the frequency of a syntactic relation between two words is recorded. This means that it is possible to see how frequently for example the word for a dog has appeared with a subject relation with the verb for bark. These database is translated from SemFi by using Giellatekno XML dictionaries. For a detailed description of the structure, see https://www.kaggle.com/mikahama/semfi-finnish-semantics-with-syntactic-relations An easy programmatic interface is provided in UralicNLP: https://github.com/mikahama/uralicNLP/wiki/Semantics-(SemFi,-SemUr) Cite as Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)
Numenta Anomaly Benchmark (NAB)
kaggle.com
Updated Aug 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BoltzmannBrain (2016). Numenta Anomaly Benchmark (NAB) [Dataset]. https://www.kaggle.com/datasets/boltzmannbrain/nab/discussion?sortBy=hot&group=upvoted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BoltzmannBrain
Description
The Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.

NAB Data Corpus

The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.

The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.

The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.

Real data

realAWSCloudwatch/

AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.

realAdExchange/

Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.

realKnownCause/

This is data for which we know the anomaly causes; no hand labeling.

ambient_temperature_system_failure.csv: The ambient temperature in an office setting.

cpu_utilization_asg_misconfiguration.csv: From Amazon Web Services (AWS) monitoring CPU usage – i.e. average CPU usage across a given cluster. When usage is high, AWS spins up a new machine, and uses fewer machines when usage is low.

ec2_request_latency_system_failure.csv: CPU usage data from a server in Amazon's East Coast datacenter. The dataset ends with complete system failure resulting from a documented failure of AWS API servers. There's an interesting story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta blog.

machine_temperature_system_failure.csv: Temperature sensor data of an internal component of a large, industrial mahcine. The first anomaly is a planned shutdown of the machine. The second anomaly is difficult to detect and directly led to the third anomaly, a catastrophic failure of the machine.

nyc_taxi.csv: Number of NYC taxi passengers, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets.

rogue_agent_key_hold.csv: Timing the key holds for several users of a computer, where the anomalies represent a change in the user.

rogue_agent_key_updown.csv: Timing the key strokes for several users of a computer, where the anomalies represent a change in the user.

realTraffic/

Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.

realTweets/

A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.

Artificial data

artificialNoAnomaly/

Artifically-generated data without any anomalies.

artificialWithAnomaly/

Artifically-generated data with varying types of anomalies.

Acknowledgments

We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:

Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]
Experimental Data for Question Classification
kaggle.com
Updated May 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JunYu (2020). Experimental Data for Question Classification [Dataset]. https://www.kaggle.com/owen1226/textsdata/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2020
Dataset provided by
Kaggle
Authors
JunYu
Description
Context

This data collection contains all the data used in our learning question classification experiments, which has question class definitions, the training and testing question sets, examples of preprocessing the questions, feature definition scripts and examples of semantically related word features.

Content

ABBR - 'abbreviation': expression abbreviated, etc. DESC - 'description and abstract concepts': manner of an action, description of sth. etc. ENTY - 'entities': animals, colors, events, food, etc. HUM - 'human beings': a group or organization of persons, an individual, etc. LOC - 'locations': cities, countries, etc. NUM - 'numeric values': postcodes, dates, speed,temperature, etc

Acknowledgements

https://cogcomp.seas.upenn.edu/Data/QA/QC/ https://github.com/Tony607/Keras-Text-Transfer-Learning/blob/master/README.md
The Japanese Female Facial Expression (JAFFE) Dataset
zenodo.org
data.niaid.nih.gov
txt, zip
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Lyons; Michael Lyons; Miyuki Kamachi; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba (2025). The Japanese Female Facial Expression (JAFFE) Dataset [Dataset]. http://doi.org/10.5281/zenodo.14974867
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14974867
Dataset updated
Mar 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Lyons; Michael Lyons; Miyuki Kamachi; Miyuki Kamachi; Jiro Gyoba; Jiro Gyoba
Time period covered
1997
Description
The JAFFE images may be used only for non-commercial scientific research.

The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.

Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf

Michael J. Lyons
"Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998

The following is not allowed:

Redistribution of the JAFFE dataset (incl. via Github, Kaggle, Colaboratory, GitCafe, CSDN etc.)

Posting JAFFE images on the web and social media

Public exhibition of JAFFE images in museums/galleries etc.

Broadcast in the mass media (tv shows, films, etc.)

A few sample images (not more than 10) may be displayed in scientific publications.

Facebook

Twitter

Click to copy link

Link copied

Cite

issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Explore at:

Dataset provided by

Hugging Facehttps://huggingface.co/

Authors

Hugging Face Smol Models Research

Description

GitHub Issues & Kaggle Notebooks

  Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

Clear search

Close search

Google apps

Main menu

issues-kaggle-notebooks

Testing github actions for upload datasets

Example of dataset syncronized by github actions Source https://github.com/jaimevalero/test-actions and https://github.com/jaimevalero/push-kaggle-dataset

HAIS: sample data

Dataset

Contents

hagrid-sample-250k-384p

Developers and programming languages

Context

Content

Acknowledgements

Inspiration

notMNIST

Context

Content

Acknowledgements

Google Landmarks Dataset v2

‘My Uber Drives’ analyzed by Analyst-2

Context

Content

Acknowledgements & References

Past Research

Inspiration

Coughs: ESC-50 and FSDKaggle2018

COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset

‘Mayweather Marketing Tactics’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Start A New Notebook!

Pretraining data of SkySense++

📢 Latest Updates

Pretrain Data

RS-Semantic Dataset

EO Benchmark

test_sample_wavs

Dataset

Contents

Bangla Newspaper Dataset

Bangla Newspaper Dataset

Source

Github

HuggingFace

Inspiration

JAFFE (Deprecated, use v.2 instead)

V.1 is deprecated, use V.2 instead.

The images are the same: only the README file has been updated.

https://doi.org/10.5281/zenodo.14974867

COVID19, Pneumonia and Normal Chest X-ray PA Dataset

SemMdf - Semantic Database for Moksha - Dataset - B2FIND

Numenta Anomaly Benchmark (NAB)

NAB Data Corpus

Real data

Artificial data

Acknowledgments

Experimental Data for Question Classification

Context

Content

Acknowledgements

The Japanese Female Facial Expression (JAFFE) Dataset

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Example of dataset syncronized by github actions
Source https://github.com/jaimevalero/test-actions and https://github.com/jaimevalero/push-kaggle-dataset