100+ datasets found

Real Indian users on Github
kaggle.com
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Archit Tyagi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
📊 GitHub Indian Users Dataset

Overview

This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

🧑‍💻 Dataset Contents

The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

🌟 Source and Inspiration

This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

Potential Use Cases

Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.

Community Growth: Analyze how the Indian developer community has grown over time on GitHub.

Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.

Regional Insights: Discover which cities or regions in India have the most active GitHub users.

Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

💡 Ideal for

This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
g
Coronavirus COVID-19 Global Cases by the Center for Systems Science and...
github.com
systems.jhu.edu
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
Explore at:
Dataset provided by
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
Area covered
Global
Description
2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Confirmed Cases by Country/Region/Sovereignty
Confirmed Cases by Province/State/Dependency
Deaths
Recovered
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
n
Coronavirus (Covid-19) Data in the United States
nytimes.com
openicpsr.org
+2more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Explore at:
Dataset provided by
New York Times
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
m
Data from: Large-scale Ridesharing DARP Instances Based on Real Travel...
data.mendeley.com
ieee-dataport.org
Updated Dec 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Fiedler (2023). Large-scale Ridesharing DARP Instances Based on Real Travel Demand [Dataset]. http://doi.org/10.17632/fj6nwvbt48.1
Explore at:
Unique identifier
https://doi.org/10.17632/fj6nwvbt48.1
Dataset updated
Dec 5, 2023
Authors
David Fiedler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset presents a set of large-scale Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

The instances are based on real demand and realistic travel time data from 3 different US cities: Chicago, New York City, and Washington, DC. The instances consist of real travel requests from the selected period, positions of vehicles with their capacities, and realistic shortest travel times between all pairs of locations in each city.

The instances and results of two solution methods, the Insertion Heuristic and the optimal Vehicle-group Assignment method, can be found in the dataset.

📄 Paper: arXiv:2305.18859 📁 Data: DOI:10.5281/zenodo.7986103 👩‍💻 Code: https://github.com/aicenter/Ridesharing_DARP_instances

The dataset was presented at the IEEE International Conference on Intelligent Transportation Systems (ITSC 2023) in Bilbao, Bizkaia, Spain, 24-28 September 2023 (Session CON03)
Z
Data from: Covid19Kerala.info-Data: A collective open dataset of COVID-19...
data.niaid.nih.gov
zenodo.org
Updated Sep 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharadh Manian (2020). Covid19Kerala.info-Data: A collective open dataset of COVID-19 outbreak in the south Indian state of Kerala [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818096
Explore at:
Dataset updated
Sep 6, 2020
Dataset provided by
Shabeesh Balan
Manoj Karingamadathil
Akhil Balakrishnan
Jeevan Uthaman
Jijo Ulahannan
Sharadh Manian
Sreehari Pillai
Nikhil Narayanan
Hritwik N Edavalath
Sreekanth Chaliyeduth
Musfir Mohammed
Nishad Thalhath
Kumar Sujith
Sooraj P Suresh
E Rajeevan
Sindhu Joseph
Prem Prabhakaran
Unnikrishnan Sureshkumar
Neetha Nanoth Vellichirammal
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
South India, India, Kerala
Description
Covid19Kerala.info-Data is a consolidated multi-source open dataset of metadata from the COVID-19 outbreak in the Indian state of Kerala. It is created and maintained by volunteers of ‘Collective for Open Data Distribution-Keralam’ (CODD-K), a nonprofit consortium of individuals formed for the distribution and longevity of open-datasets. Covid19Kerala.info-Data covers a set of correlated temporal and spatial metadata of SARS-CoV-2 infections and prevention measures in Kerala. Static releases of this dataset snapshots are manually produced from a live database maintained as a set of publicly accessible Google sheets. This dataset is made available under the Open Data Commons Attribution License v1.0 (ODC-BY 1.0).

Schema and data package Datapackage with schema definition is accessible at https://codd-k.github.io/covid19kerala.info-data/datapackage.json. Provided datapackage and schema are based on Frictionless data Data Package specification.

Temporal and Spatial Coverage

This dataset covers COVID-19 outbreak and related data from the state of Kerala, India, from January 31, 2020 till the date of the publication of this snapshot. The dataset shall be maintained throughout the entirety of the COVID-19 outbreak.

The spatial coverage of the data lies within the geographical boundaries of the Kerala state which includes its 14 administrative subdivisions. The state is further divided into Local Self Governing (LSG) Bodies. Reference to this spatial information is included on appropriate data facets. Available spatial information on regions outside Kerala was mentioned, but it is limited as a reference to the possible origins of the infection clusters or movement of the individuals.

Longevity and Provenance

The dataset snapshot releases are published and maintained in a designated GitHub repository maintained by CODD-K team. Periodic snapshots from the live database will be released at regular intervals. The GitHub commit logs for the repository will be maintained as a record of provenance, and archived repository will be maintained at the end of the project lifecycle for the longevity of the dataset.

Data Stewardship

CODD-K expects all administrators, managers, and users of its datasets to manage, access, and utilize them in a manner that is consistent with the consortium’s need for security and confidentiality and relevant legal frameworks within all geographies, especially Kerala and India. As a responsible steward to maintain and make this dataset accessible— CODD-K absolves from all liabilities of the damages, if any caused by inaccuracies in the dataset.

License

This dataset is made available by the CODD-K consortium under ODC-BY 1.0 license. The Open Data Commons Attribution License (ODC-By) v1.0 ensures that users of this dataset are free to copy, distribute and use the dataset to produce works and even to modify, transform and build upon the database, as long as they attribute the public use of the database or works produced from the same, as mentioned in the citation below.

Disclaimer

Covid19Kerala.info-Data is provided under the ODC-BY 1.0 license as-is. Though every attempt is taken to ensure that the data is error-free and up to date, the CODD-K consortium do not bear any responsibilities for inaccuracies in the dataset or any losses—monetary or otherwise—that users of this dataset may incur.
Dataset of Automatically Orchestrable GitHub Projects
data.europa.eu
unknown
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Dataset of Automatically Orchestrable GitHub Projects [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7194189?locale=no
Explore at:
unknown(11388978)Available download formats
Dataset updated
Nov 24, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the submission "Generating representative, live network traffic out of millions of code repositories" at HotNets'22: The 21st ACM Workshop on Hot Topics in Networks. Please see the files: - list_of_github_repositories.txt for a list of GitHub repositories that we found containing a docker-compose*.yml file - list_of_executed_repositories.csv for more detailed information on the success of capturing traffic with specific orchestration files found in ~67% of the repositories If you use our dataset, please cite our work as follows: Tobias Bühler, Roland Schmid, Sandro Lutz, and Laurent Vanbever. 2022. Generating representative, live network traffic out of millions of code repositories. In The 21st ACM Workshop on Hot Topics in Networks (HotNets ’22), November 14–15, 2022, Austin, TX, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ 3563766.3564084
g
Real-time data from public transport via MQTT & ESP32 XIAO & OpenDATA |...
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Real-time data from public transport via MQTT & ESP32 XIAO & OpenDATA | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_bd836e2c-70ac-44cc-8240-e93aa8ea676c/
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Real-time data from the Vienna lines are processed in Node-RED and sent via MQTT to an ESP32. Content is displayed on a 0.91" OLED display. Instructions for evaluation in Node-RED and implementation on the ESP32 code: Part 1: Part 2: Wiring and code can be found at github.com/pixeledi. Have fun replicating!
Data from: RNN-DAS: A New Deep Learning Approach for Detection and Real-Time...
zenodo.org
zip
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Fernandez-Carabantes; Javier Fernandez-Carabantes; Manuel Titos; Manuel Titos; Luca D'Auria; Luca D'Auria; Jesús García; Jesús García; Luz García; Luz García; Carmen Benítez; Carmen Benítez (2025). RNN-DAS: A New Deep Learning Approach for Detection and Real-Time Monitoring of Volcano-Tectonic Events Using Distributed Acoustic Sensing [Dataset]. http://doi.org/10.5281/zenodo.15105596
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15105596
Dataset updated
Sep 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Javier Fernandez-Carabantes; Javier Fernandez-Carabantes; Manuel Titos; Manuel Titos; Luca D'Auria; Luca D'Auria; Jesús García; Jesús García; Luz García; Luz García; Carmen Benítez; Carmen Benítez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HDAS Data from La Palma - DigiVolCan Project

This repository contains various co-eruptive VT datasets collected during the 2021 eruption at La Palma, Spain, by an underwater High-fidelity Distributed Acoustic Sensing (HDAS). These datasets have been used to train and test the RNN-DAS model, a deep learning framework designed for volcano-seismic event detection using DAS data.

This data was collected as part of the DigiVolCan project, which is a collaboration between the University of Granada, the Canary Islands Volcanological Institute (INVOLCAN), the Institute of Technological and Renewable Energies (ITER), the University of La Laguna, and Aragón Photonics. It is funded by the Ministry of Science, Innovation, and Universities / State Research Agency (MICIU/AEI) of Spain and the European Union through the Recovery, Transformation, and Resilience Plan, Next Generation EU Funds. The project reference is PLEC2022-009271, funded by MICIU/AEI /10.13039/501100011033 and by the European Union Next GenerationEU/PRTR.

Dataset Description

The shared dataset contains HDAS data recorded over several periods, with one file per minute. Each file is in .h5 format and follows the structure:

file_path │ └───"data" (dataset) │ ├───data (2D matrix of strain rate) │ └───[channels x time_samples] │ ├───attrs │ ├───"dt_s" (temporal sampling in seconds) ├───"dx_m" (spatial sampling in meters) └───"begin_time" (start date in 'YYYY-MM-DDTHH:MM:SS.SSS' format)

Five datasets are provided as separate compressed .zip archives due to their size. Each archive contains DAS waveform data in the HDF5 (.h5) format described above, organized in one-minute files. These datasets correspond to figures presented in the RNN-DAS model article and are intended to facilitate reproducibility and further analysis.

Dataset 1 – Main event and aftershocks (Figure 5)

This dataset contains a one-hour DAS recording from November 30, between 07:00 and 08:00 UTC, featuring a main seismic event with magnitude Ml = 3.22 along with several aftershocks.

Dataset 2 – Continuous Test Segment (Figure 6)

This dataset contains one hour of continuous DAS recordings from October 29, between 04:00 and 05:00 UTC.

Dataset 3 – Events with Varying SNR and Magnitude (Figure 4)
This dataset includes three separate 3-minute DAS recordings, each corresponding to a different seismic event with distinct characteristics. The selected events represent a range of conditions, including attenuated signals, low signal-to-noise ratio (SNR), and nearby high-SNR events.

Dataset 4 – High-Magnitude Event Example (Figure 3)
This dataset contains a 3-minute DAS recording corresponding to a seismic event with magnitude Ml = 4.23. This example demonstrates the model’s response to a clear, high-magnitude event.

Dataset 5 – Moderate Events and Noise-Only Sample (Figure 7)
This dataset includes four separate DAS recordings: three corresponding to moderate seismic events and another containing only seismic noise.

All .zip archives can be easily decompressed and used directly.

Note: The full HDAS dataset from La Palma used for model training and evaluation is not included due to its large size. It is available upon request from the corresponding author.

RNN-DAS Model

The RNN-DAS model is an innovative Deep Learning model based on Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells, developed for real-time Volcano-seismic Signal Recognition (VSR) using Distributed Acoustic Sensing (DAS) measurements. The model was trained on a comprehensive dataset of Volcano-Tectonic (VT) events from the 2021 La Palma eruption, recorded by a High-fidelity submarine Distributed Acoustic Sensing array (HDAS) located near the eruption site.

RNN-DAS can detect VT events, track their temporal evolution, and classify their waveforms with approximately 97% accuracy when tested on a database of over 2 million unique strain waveforms, enabling real-time continuous data predictions. The model has demonstrated excellent generalization capabilities for different time intervals and volcanoes, facilitating continuous, real-time seismic monitoring with minimal computational resources and retraining requirements.

The model is available in the RNN-DAS GitHub repository:

https://github.com/Javier-FernandezCarabantes/RNN-DAS

Fernández-Carabantes, J., Titos, M., D'Auria, L., García, J., García, L., & Benítez, C. (2025). Javier-FernandezCarabantes/RNN-DAS: RNN-DAS v1.1.1 (v1.1.1). Zenodo. https://doi.org/10.5281/zenodo.15858492

A copy of the repository is also provided here as the RNN-DAS_main.zip file. This archive mirrors the contents of the GitHub repository at the time of submission (v1.0.0). For correct usage, it is recommended to read the included README file. Users are encouraged to refer to the GitHub repository for future updates or changes.

Usage

This dataset is provided as a sample data for the RNN-DAS model. It can be used to test and validate our model, as well as for the development of other machine learning approaches.

Citation

If you use this dataset in your research or if you use the RNN-DAS model, proper citation of the related article and this dataset are needed (Fernández-Carabantes et al., 2025)

Fernández-Carabantes, J., Titos, M., D'Auria, L., García, J., García, L., & Benítez, C. (2025). RNN-DAS: A new deep learning approach for detection and real-time monitoring of volcano-tectonic events using distributed acoustic sensing. Journal of Geophysical Research: Solid Earth, 130, e2025JB031756. https://doi.org/10.1029/2025JB031756

For further details, please refer to the project documentation or contact the research team (corresponding author email: javierfyc@ugr.es).
US counties COVID 19 dataset
kaggle.com
Updated May 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyrnaMFL (2020). US counties COVID 19 dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1197018
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1197018
Dataset updated
May 29, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MyrnaMFL
Area covered
United States
Description
From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data

Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."

The specific data here, is the data PER US COUNTY.

The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
application/gzip
Updated Mar 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3519618
Dataset updated
Mar 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; A large-scale study about quality and reproducibility of jupyter notebooks. In: International Conference on Mining Software Repositories (MSR), 2019, Montreal, Canada.

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering, 2021 (in press)

This repository contains three files:

db2020-09-22.dump.gz

sample.tar.gz

julynter_reproducility.tar.gz

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz

Install the dependencies: $ pip install julynter/requirements.txt

Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
Data from: Beyond Textual Issues: Understanding the Usage and Impact of...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hudson Borges; Rodrigo Brito; Marco Tulio Valente; Hudson Borges; Rodrigo Brito; Marco Tulio Valente (2020). Beyond Textual Issues: Understanding the Usage and Impact of GitHub Reactions [Dataset]. http://doi.org/10.5281/zenodo.2558596
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2558596
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hudson Borges; Rodrigo Brito; Marco Tulio Valente; Hudson Borges; Rodrigo Brito; Marco Tulio Valente
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recently, GitHub introduced a new social feature, named reactions, which are pictorial characters similar to the emoji symbols widely used nowadays in text-based communications. Particularly, GitHub users can use a set of such symbols to react to issues and pull requests. However, little is known about the real usage and benefits of GitHub reactions. In this paper, we analyze the reactions provided by developers to more than 2.5 million issues and 9.7 million issue comments, in order to answer an extensive list of ten research questions about the usage and adoption of reactions. We show that reactions are being increasingly used by open-source developers. Moreover, we also found that issues with reactions usually take more time to be closed and have longer discussions.

This dataset contains the data used in the paper "Beyond Textual Issues: Understanding the Usage and Impact of GitHub Reactions", accepted for SBES 2019.
realtime traffic
researchdata.edu.au
data.act.gov.au
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACT Government Open Data (2025). realtime traffic [Dataset]. https://researchdata.edu.au/realtime-traffic/3734239
Explore at:
Dataset updated
Jun 23, 2025
Dataset provided by
Data.govhttps://data.gov/
Authors
ACT Government Open Data
Description
In the ACT, we have bluetooth detectors placed in certain roads to monitor traffic flow that provides network-wide performance indicators in real time. Details about congestion & travel time can be accessed via APIs provided in this dataset
O
Real-Time Road Conditions
data.austintexas.gov
datahub.austintexas.gov
+2more
Updated Oct 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Austin, Texas - data.austintexas.gov (2025). Real-Time Road Conditions [Dataset]. https://data.austintexas.gov/w/ypbq-i42h/7r79-5ncn?cur=DjGplNNa8x8
Explore at:
application/geo+json, application/rdfxml, csv, application/rssxml, tsv, kml, kmz, xmlAvailable download formats
Dataset updated
Oct 11, 2025
Dataset authored and provided by
City of Austin, Texas - data.austintexas.gov
Description
Austin Transportation & Public Works maintains road condition sensors across the city which monitor the temperature and surface condition of roadways. These sensors enable our Mobility Management Center to stay apprised of potential roadway freezing events and intervene when necessary.

This data is updated continuously every 5 minutes.

See also, the data descriptions from the sensor's instruction manual:

https://github.com/cityofaustin/atd-road-conditions/blob/production/5433-3X-manual.pdf
E
GitHub Java Corpus - Function Identifiers
live.european-language-grid.eu
data.niaid.nih.gov
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). GitHub Java Corpus - Function Identifiers [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7992
Explore at:
Dataset updated
Nov 16, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains function identifiers extracted from the GitHub Java Corpus (http://groups.inf.ed.ac.uk/cup/javaGithub/).
Each line corresponds to a method declaration. A line contains the name of the method declaration followed by the function identifiers (i.e., function calls) contained within the method body.
The file embeddings_train.json can be used to train a word/sentence embedding model using the code in the Github repository (link below).
The corpus was used for the experiments in the paper Combining Code Embedding with Static Analysis for Function-Call Completion.
Github repository to replicate the experiments: https://github.com/mweyssow/cse-saner
o
Social Media Profile Links by Name
openwebninja.com
json
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
Explore at:
jsonAvailable download formats
Dataset updated
Feb 2, 2025
Dataset authored and provided by
OpenWeb Ninja
Area covered
Worldwide
Description
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.
Z
Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms
data.niaid.nih.gov
zenodo.org
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
Mao, Yiwei
Garske, Samuel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np

Load image file

hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

Load image mask

mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Accurate normalization of real-time quantitative RT-PCR data by geometric...
healthdata.gov
application/rdfxml +5
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes - f4wx-3iqh - Archive Repository [Dataset]. https://healthdata.gov/dataset/Accurate-normalization-of-real-time-quantitative-R/ep29-w9cc
Explore at:
tsv, application/rssxml, csv, json, xml, application/rdfxmlAvailable download formats
Dataset updated
Sep 10, 2025
Description
This dataset tracks the updates made on the dataset "Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes" as a repository for previous versions of the data and metadata.
[Decommissioned] Intellectual Property Government Open Live Data
researchdata.edu.au
Updated Feb 3, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IP Australia (2016). [Decommissioned] Intellectual Property Government Open Live Data [Dataset]. https://researchdata.edu.au/decommissioned-intellectual-property-live-data/2989210
Explore at:
Dataset updated
Feb 3, 2016
Dataset provided by
Data.govhttps://data.gov/
Authors
IP Australia
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
Important Notice\r

\r This dataset is not being updated currently due to data migration work at IP Australia. We are sorry for the inconvenience and we will update this page once the migration is complete.\r \r The Intellectual Property Government Open Live Data (IPGOLD) includes over 100 years of Intellectual Property (IP) rights administered by IP Australia comprising patents, trade marks, designs and plant breeder's rights. The data is highly detailed, including information on each aspect of the application process from application through to granting of IP rights. We have published a paper to accompany IPGOLD which describes the data and illustrates its use, as well as a technical paper on the firm matching.\r \r IPGOLD is inherently the same data as the IPGOD data set, with a weekly update instead of the annual snapshot available in IPGOD. Many of the scripts of IPGOLD are still being developed and tested. As such IPGOLD should be considered a Beta release.
e
rt-me-fMRI: A task and resting state dataset for real-time, multi-echo fMRI...
b2find.eudat.eu
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). rt-me-fMRI: A task and resting state dataset for real-time, multi-echo fMRI methods development and validation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2f223bc9-110f-5779-94cc-ca197fe53ceb
Explore at:
Dataset updated
Aug 7, 2024
Description
rt-me-fMRI is a multi-echo functional magnetic resonance imaging dataset (N=28 healthy volunteers) with four task-based and two resting state runs. Its main purpose is to advance the development of methods for real-time multi-echo fMRI analysis with applications in neurofeedback, real-time quality control, and adaptive paradigms, although the variety of experimental task paradigms can support multiple use cases. Tasks include finger tapping, emotional face and shape matching, imagined finger tapping and imagined emotion processing. Further information is available at https://github.com/jsheunis/rt-me-fMRI IMPORTANT FOR DATASET DOWNOAD: Due to an issue with the current installation of Dataverse, it is not currently possible to download the full rt-me-fMRI dataset in bulk. This issue is scheduled to be resolved in early 2021. Individual downloads or downloading small sets of files is currently possible, although cumbersome. In order to download the full dataset in bulk, please request access to the dataset on this page. You will then be required to complete and sign the Data Use Agreement, after which you will be provided with a secure download link for the full dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github

Real Indian users on Github

An overview of real time Indian users on Github

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 6, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Archit Tyagi

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered

India

Description

📊 GitHub Indian Users Dataset

Overview

This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

🧑‍💻 Dataset Contents

The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

🌟 Source and Inspiration

This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

Potential Use Cases

Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.
Community Growth: Analyze how the Indian developer community has grown over time on GitHub.
Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.
Regional Insights: Discover which cities or regions in India have the most active GitHub users.
Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

💡 Ideal for

This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

Clear search

Close search

Google apps

Main menu

Real Indian users on Github

📊 GitHub Indian Users Dataset

Overview

🧑‍💻 Dataset Contents

🌟 Source and Inspiration

Potential Use Cases

💡 Ideal for

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

Coronavirus (Covid-19) Data in the United States

Data from: Large-scale Ridesharing DARP Instances Based on Real Travel...

Data from: Covid19Kerala.info-Data: A collective open dataset of COVID-19...

Dataset of Automatically Orchestrable GitHub Projects

Real-time data from public transport via MQTT & ESP32 XIAO & OpenDATA |...

Data from: RNN-DAS: A New Deep Learning Approach for Detection and Real-Time...

HDAS Data from La Palma - DigiVolCan Project

Dataset Description

RNN-DAS Model

Usage

Citation

US counties COVID 19 dataset

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Data from: Beyond Textual Issues: Understanding the Usage and Impact of...

realtime traffic

Real-Time Road Conditions

GitHub Java Corpus - Function Identifiers

Social Media Profile Links by Name

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Load image file

Load image mask

Accurate normalization of real-time quantitative RT-PCR data by geometric...

[Decommissioned] Intellectual Property Government Open Live Data

Important Notice\r

rt-me-fMRI: A task and resting state dataset for real-time, multi-echo fMRI...

Real Indian users on Github

An overview of real time Indian users on Github

📊 GitHub Indian Users Dataset

Overview

🧑‍💻 Dataset Contents

🌟 Source and Inspiration

Potential Use Cases

💡 Ideal for