GitHub Issues & Kaggle Notebooks
Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Abderrazak Chahid
Released under MIT
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 254,661 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.
Original Authors:
Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani
Original Dataset Links
GitHub Kaggle Datasets Page
Object Classes
['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-250k-384p.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Sample of 17.000 github.com developers, and programming language they know - or want to -.
I acquired the data listing the 1.000 most starred repos dataset, and getting the first 30 users that starred each repo. Cleaning the dupes. Then for each of the 17.000 users, I calculate the frequency of each of the 1.400 technologies in the user and forked repositories metadata.
Thanks to Jihye Sofia Seo, because their dataset Top 980 Starred Open Source Projects on GitHub is the source for this dataset.
I am using this dataset for my github recommendation engine, I use it to find similar developers, to use his stared repositories as recommendation. Also, I use this dataset to categorize developer types, trying to understand the weight of a developer in a team, specially when a developer leaves the company, so It is possible to draw the talent lost for the team and the company.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.
This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.
notMNIST _large.zip
is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip
is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.
The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip
contains 529,119 images and notMNIST_small.zip
contains 18726 images.
Thanks to Yaroslav Bulatov for putting together the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘My Uber Drives’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/uberdrives on 13 February 2022.
--- Dataset description provided by original source is as follows ---
My Uber Drives (2016)
Here are the details of my Uber Drives of 2016. I am sharing this dataset for data science community to learn from the behavior of an ordinary Uber customer.
Geography: USA, Sri Lanka and Pakistan
Time period: January - December 2016
Unit of analysis: Drives
Total Drives: 1,155
Total Miles: 12,204
Dataset: The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)
Users are allowed to use, download, copy, distribute and cite the dataset for their pet projects and training. Please cite it as follows: “Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017.”
Uber TLC FOIL Response - The dataset contains over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015 https://github.com/fivethirtyeight/uber-tlc-foil-response
1.1 Billion Taxi Pickups from New York - http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
What you can do with this data - a good example by Yao-Jen Kuo - https://yaojenkuo.github.io/uber.html
Some ideas worth exploring:
• What is the average length of the trip?
• Average number of rides per week or per month?
• Total tax savings based on traveled business miles?
• Percentage of business miles vs personal vs. Meals
• How much money can be saved by a typical customer using Uber, Careem, or Lyft versus regular cab service?
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.
Citation
This dataset was generated and used in our paper:
Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.
Please cite this paper if you use the timestamps.csv file in your work.
Generation
The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:
A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006
More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.
Description
The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.
The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:
file_name,cough_number,start_time,end_time
Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.
Licensing
The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.
The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.
The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is genrated by the fusion of three publicly available datasets: COVID-19 cxr image (https://github.com/ieee8023/covid-chestxray-dataset), Radiological Society of North America (RSNA) (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC) (https://lhncbc.nlm.nih.gov/publication/pub9931). These datasets were annotated by expert radiologists. The fused dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train and evaulate deep learning and machine learning models as binary and multi-class classification problem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Mayweather Marketing Tactics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/undefeated-boxerse on 28 January 2022.
--- Dataset description provided by original source is as follows ---
https://i.ibb.co/Z1FW3kh/Selection-708.png" alt="">
See Readme for more details.
This repository contains a selection of the data -- and the data-processing scripts -- behind the articles, graphics and interactives at FiveThirtyEight.We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know.
Source: https://github.com/fivethirtyeight/data
This dataset was created by FiveThirtyEight and contains around 2000 samples along with Date, Url, technical information and other features such as: - Name - Wins - and more.
- Analyze Date in relation to Url
- Study the influence of Name on Wins
- More datasets
If you use this dataset in your research, please credit FiveThirtyEight
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data description and processing for the paper titled "SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation." The code is in here
🔥🔥🔥 Last Updated on 2024.03.14 🔥🔥🔥
We conduct semantic-enhanced pretraining on the RS-Semantic dataset, which consists of 13 datasets with pixel-level annotations. Below are the specifics of these datasets.
Dataset | Modalities | GSD(m) | Size | Categories | Download Link |
---|---|---|---|---|---|
Five Billion Pixels | Gaofen-2 | 4 | 6800x7200 | 24 | Download |
Potsdam | Airborne | 0.05 | 6000x6000 | 5 | Download |
Vaihingen | Airborne | 0.05 | 2494x2064 | 5 | Download |
Deepglobe | WorldView | 0.5 | 2448x2448 | 6 | Download |
iSAID | Multiple Sensors | - | 800x800 to 4000x13000 | 15 | Download |
LoveDA | Spaceborne | 0.3 | 1024x1024 | 7 | Download |
DynamicEarthNet | WorldView | 0.3 | 1024x1024 | 7 | Download |
Sentinel-2* | 10 | 32x32 | |||
Sentinel-1* | 10 | 32x33 | |||
Pastis-MM | WorldView | 0.3 | 1024x1024 | 18 | Download |
Sentinel-2* | 10 | 32x32 | |||
Sentinel-1* | 10 | 32x33 | |||
C2Seg-AB | Sentinel-2* | 10 | 128x128 | 13 | Download |
Sentinel-1* | 10 | 128x128 | |||
FLAIR | Spot-5 | 0.2 | 512x512 | 12 | Download |
Sentinel-2* | 10 | 40x40 | |||
DFC20 | Sentinel-2 | 10 | 256x256 | 9 | Download |
Sentinel-1 | 10 | 256x256 | |||
S2-naip | NAIP | 1 | 512x512 | 32 | Download |
Sentinel-2* | 10 | 64x64 | |||
Sentinel-1* | 10 | 64x64 | |||
JL-16 | Jilin-1 | 0.72 | 512x512 | 16 | Download |
Sentinel-1* | 10 | 40x40 |
* for time-series data.
We evaluate our SkySense++ on 12 typical Earth Observation (EO) tasks across 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying, and disaster management. The detailed information about the datasets used for evaluation is as follows.
Domain | Task type | Dataset | Modalities | GSD | Image size | Download Link | Notes |
---|---|---|---|---|---|---|---|
Agriculture | Crop classification | Germany | Sentinel-2* | 10 | 24x24 | Download | |
Foresetry | Tree species classification | TreeSatAI-Time-Series | Airborne, | 0.2 | 304x304 | Download | |
Sentinel-2* | 10 | 6x6 | |||||
Sentinel-1* | 10 | 6x6 | |||||
Deforestation segmentation | Atlantic | Sentinel-2 | 10 | 512x512 | Download | ||
Oceanography | Oil spill segmentation | SOS | Sentinel-1 | 10 | 256x256 | Download | |
Atmosphere | Air pollution regression | 3pollution | Sentinel-2 | 10 | 200x200 | Download | |
Sentinel-5P | 2600 | 120x120 | |||||
Biology | Wildlife detection | Kenya | Airborne | - | 3068x4603 | Download | |
Land surveying | LULC mapping | C2Seg-BW | Gaofen-6 | 10 | 256x256 | Download | |
Gaofen-3 | 10 | 256x256 | |||||
Change detection | dsifn-cd | GoogleEarth | 0.3 | 512x512 | Download | ||
Disaster management | Flood monitoring | Flood-3i | Airborne | 0.05 | 256 × 256 | Download | |
C2SMSFloods | Sentinel-2, Sentinel-1 | 10 | 512x512 | Download | |||
Wildfire monitoring | CABUAR | Sentinel-2 | 10 | 5490 × 5490 | Download | ||
Landslide mapping | GVLM | GoogleEarth | 0.3 | 1748x1748 ~ 10808x7424 | Download | ||
Building damage assessment | xBD | WorldView | 0.3 | 1024x1024 | Download |
* for time-series data.
This dataset was created by Sas Pav
400k+ bangla news samples, 25+ categories
Data collected from https://www.prothomalo.com/archive [Copyright owned by the actual source]
Github repository (Classification with LSTM): https://github.com/zabir-nabil/bangla-news-rnn
https://huggingface.co/datasets/zabir-nabil/bangla_newspaper_dataset
The dataset can be used for bangla text classification and generation experiments.
The JAFFE images may be used only for non-commercial scientific research.
The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.
Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf
Michael J. Lyons
"Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998
The following is not allowed:
A few sample images (not more than 10) may be displayed in scientific publications.
The dataset is organized into 3 folders (covid, pneumonia , normal) which contain chest X-ray posteroanterior (PA) images. X-ray samples of COVID-19 were retrieved from different sources for the unavailability of a large specific dataset. Firstly, a total 1401 samples of COVID-19 were collected using GitHub repository [1] , [2] , the Radiopaedia [3] , Italian Society of Radiology (SIRM) [4] , Figshare data repository websites [5] , [6] . Then, 912 augmented images were also collected from Mendeley instead of using data augmentation techniques explicitly [7] . Finally, 2313 samples of normal and pneumonia cases were obtained from Kaggle [8] , [9] . A total of 6939 samples were used in the experiment, where 2313 samples were used for each case.
This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the corpus. Also, the frequency of a syntactic relation between two words is recorded. This means that it is possible to see how frequently for example the word for a dog has appeared with a subject relation with the verb for bark. These database is translated from SemFi by using Giellatekno XML dictionaries. For a detailed description of the structure, see https://www.kaggle.com/mikahama/semfi-finnish-semantics-with-syntactic-relations An easy programmatic interface is provided in UralicNLP: https://github.com/mikahama/uralicNLP/wiki/Semantics-(SemFi,-SemUr) Cite as Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)
The Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.
The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.
The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.
The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.
realAWSCloudwatch/
AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.
realAdExchange/
Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.
realKnownCause/
This is data for which we know the anomaly causes; no hand labeling.
ambient_temperature_system_failure.csv
: The ambient temperature in an office
setting.cpu_utilization_asg_misconfiguration.csv
: From Amazon Web Services (AWS)
monitoring CPU usage – i.e. average CPU usage across a given cluster. When
usage is high, AWS spins up a new machine, and uses fewer machines when usage
is low.ec2_request_latency_system_failure.csv
: CPU usage data from a server in
Amazon's East Coast datacenter. The dataset ends with complete system failure
resulting from a documented failure of AWS API servers. There's an interesting
story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta
blog.machine_temperature_system_failure.csv
: Temperature sensor data of an
internal component of a large, industrial mahcine. The first anomaly is a
planned shutdown of the machine. The second anomaly is difficult to detect and
directly led to the third anomaly, a catastrophic failure of the machine.nyc_taxi.csv
: Number of NYC taxi passengers, where the five anomalies occur
during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow
storm. The raw data is from the NYC Taxi and Limousine Commission.
The data file included here consists of aggregating the total number of
taxi passengers into 30 minute buckets.rogue_agent_key_hold.csv
: Timing the key holds for several users of a
computer, where the anomalies represent a change in the user.rogue_agent_key_updown.csv
: Timing the key strokes for several users of a
computer, where the anomalies represent a change in the user.realTraffic/
Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.
realTweets/
A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.
artificialNoAnomaly/
Artifically-generated data without any anomalies.
artificialWithAnomaly/
Artifically-generated data with varying types of anomalies.
We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:
Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]
This data collection contains all the data used in our learning question classification experiments, which has question class definitions, the training and testing question sets, examples of preprocessing the questions, feature definition scripts and examples of semantically related word features.
ABBR - 'abbreviation': expression abbreviated, etc. DESC - 'description and abstract concepts': manner of an action, description of sth. etc. ENTY - 'entities': animals, colors, events, food, etc. HUM - 'human beings': a group or organization of persons, an individual, etc. LOC - 'locations': cities, countries, etc. NUM - 'numeric values': postcodes, dates, speed,temperature, etc
https://cogcomp.seas.upenn.edu/Data/QA/QC/ https://github.com/Tony607/Keras-Text-Transfer-Learning/blob/master/README.md
The JAFFE images may be used only for non-commercial scientific research.
The source and background of the dataset must be acknowledged by citing the following two articles. Users should read both carefully.
Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba.
Coding Facial Expressions with Gabor Wavelets (IVC Special Issue)
arXiv:2009.05938 (2020) https://arxiv.org/pdf/2009.05938.pdf
Michael J. Lyons
"Excavating AI" Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset
arXiv: 2107.13998 (2021) https://arxiv.org/abs/2107.13998
The following is not allowed:
A few sample images (not more than 10) may be displayed in scientific publications.
GitHub Issues & Kaggle Notebooks
Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.