100+ datasets found
  1. Real Indian users on Github

    • kaggle.com
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Archit Tyagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    📊 GitHub Indian Users Dataset

    Overview

    This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

    đŸ§‘â€đŸ’» Dataset Contents

    The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

    🌟 Source and Inspiration

    This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

    Potential Use Cases

    1. Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.
    2. Community Growth: Analyze how the Indian developer community has grown over time on GitHub.
    3. Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.
    4. Regional Insights: Discover which cities or regions in India have the most active GitHub users.
    5. Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

    💡 Ideal for

    This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

  2. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  3. g

    Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

    • github.com
    • systems.jhu.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
    Explore at:
    Dataset provided by
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
    Area covered
    Global
    Description

    2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
    https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

    • Confirmed Cases by Country/Region/Sovereignty
    • Confirmed Cases by Province/State/Dependency
    • Deaths
    • Recovered

    Downloadable data:
    https://github.com/CSSEGISandData/COVID-19

    Additional Information about the Visual Dashboard:
    https://systems.jhu.edu/research/public-health/ncov

  4. n

    Coronavirus (Covid-19) Data in the United States

    • nytimes.com
    • openicpsr.org
    • +2more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
    Explore at:
    Dataset provided by
    New York Times
    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  5. m

    Data from: Large-scale Ridesharing DARP Instances Based on Real Travel...

    • data.mendeley.com
    • ieee-dataport.org
    Updated Dec 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Fiedler (2023). Large-scale Ridesharing DARP Instances Based on Real Travel Demand [Dataset]. http://doi.org/10.17632/fj6nwvbt48.1
    Explore at:
    Dataset updated
    Dec 5, 2023
    Authors
    David Fiedler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset presents a set of large-scale Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

    The instances are based on real demand and realistic travel time data from 3 different US cities: Chicago, New York City, and Washington, DC. The instances consist of real travel requests from the selected period, positions of vehicles with their capacities, and realistic shortest travel times between all pairs of locations in each city.

    The instances and results of two solution methods, the Insertion Heuristic and the optimal Vehicle-group Assignment method, can be found in the dataset.

    📄 Paper: arXiv:2305.18859 📁 Data: DOI:10.5281/zenodo.7986103 đŸ‘©â€đŸ’» Code: https://github.com/aicenter/Ridesharing_DARP_instances

    The dataset was presented at the IEEE International Conference on Intelligent Transportation Systems (ITSC 2023) in Bilbao, Bizkaia, Spain, 24-28 September 2023 (Session CON03)

  6. Z

    Data from: Covid19Kerala.info-Data: A collective open dataset of COVID-19...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharadh Manian (2020). Covid19Kerala.info-Data: A collective open dataset of COVID-19 outbreak in the south Indian state of Kerala [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818096
    Explore at:
    Dataset updated
    Sep 6, 2020
    Dataset provided by
    Shabeesh Balan
    Manoj Karingamadathil
    Akhil Balakrishnan
    Jeevan Uthaman
    Jijo Ulahannan
    Sharadh Manian
    Sreehari Pillai
    Nikhil Narayanan
    Hritwik N Edavalath
    Sreekanth Chaliyeduth
    Musfir Mohammed
    Nishad Thalhath
    Kumar Sujith
    Sooraj P Suresh
    E Rajeevan
    Sindhu Joseph
    Prem Prabhakaran
    Unnikrishnan Sureshkumar
    Neetha Nanoth Vellichirammal
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    South India, India, Kerala
    Description

    Covid19Kerala.info-Data is a consolidated multi-source open dataset of metadata from the COVID-19 outbreak in the Indian state of Kerala. It is created and maintained by volunteers of ‘Collective for Open Data Distribution-Keralam’ (CODD-K), a nonprofit consortium of individuals formed for the distribution and longevity of open-datasets. Covid19Kerala.info-Data covers a set of correlated temporal and spatial metadata of SARS-CoV-2 infections and prevention measures in Kerala. Static releases of this dataset snapshots are manually produced from a live database maintained as a set of publicly accessible Google sheets. This dataset is made available under the Open Data Commons Attribution License v1.0 (ODC-BY 1.0).

    Schema and data package Datapackage with schema definition is accessible at https://codd-k.github.io/covid19kerala.info-data/datapackage.json. Provided datapackage and schema are based on Frictionless data Data Package specification.

    Temporal and Spatial Coverage

    This dataset covers COVID-19 outbreak and related data from the state of Kerala, India, from January 31, 2020 till the date of the publication of this snapshot. The dataset shall be maintained throughout the entirety of the COVID-19 outbreak.

    The spatial coverage of the data lies within the geographical boundaries of the Kerala state which includes its 14 administrative subdivisions. The state is further divided into Local Self Governing (LSG) Bodies. Reference to this spatial information is included on appropriate data facets. Available spatial information on regions outside Kerala was mentioned, but it is limited as a reference to the possible origins of the infection clusters or movement of the individuals.

    Longevity and Provenance

    The dataset snapshot releases are published and maintained in a designated GitHub repository maintained by CODD-K team. Periodic snapshots from the live database will be released at regular intervals. The GitHub commit logs for the repository will be maintained as a record of provenance, and archived repository will be maintained at the end of the project lifecycle for the longevity of the dataset.

    Data Stewardship

    CODD-K expects all administrators, managers, and users of its datasets to manage, access, and utilize them in a manner that is consistent with the consortium’s need for security and confidentiality and relevant legal frameworks within all geographies, especially Kerala and India. As a responsible steward to maintain and make this dataset accessible— CODD-K absolves from all liabilities of the damages, if any caused by inaccuracies in the dataset.

    License

    This dataset is made available by the CODD-K consortium under ODC-BY 1.0 license. The Open Data Commons Attribution License (ODC-By) v1.0 ensures that users of this dataset are free to copy, distribute and use the dataset to produce works and even to modify, transform and build upon the database, as long as they attribute the public use of the database or works produced from the same, as mentioned in the citation below.

    Disclaimer

    Covid19Kerala.info-Data is provided under the ODC-BY 1.0 license as-is. Though every attempt is taken to ensure that the data is error-free and up to date, the CODD-K consortium do not bear any responsibilities for inaccuracies in the dataset or any losses—monetary or otherwise—that users of this dataset may incur.

  7. Dataset of Automatically Orchestrable GitHub Projects

    • data.europa.eu
    unknown
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). Dataset of Automatically Orchestrable GitHub Projects [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7194189?locale=no
    Explore at:
    unknown(11388978)Available download formats
    Dataset updated
    Nov 24, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the submission "Generating representative, live network traffic out of millions of code repositories" at HotNets'22: The 21st ACM Workshop on Hot Topics in Networks. Please see the files: - list_of_github_repositories.txt for a list of GitHub repositories that we found containing a docker-compose*.yml file - list_of_executed_repositories.csv for more detailed information on the success of capturing traffic with specific orchestration files found in ~67% of the repositories If you use our dataset, please cite our work as follows: Tobias BĂŒhler, Roland Schmid, Sandro Lutz, and Laurent Vanbever. 2022. Generating representative, live network traffic out of millions of code repositories. In The 21st ACM Workshop on Hot Topics in Networks (HotNets ’22), November 14–15, 2022, Austin, TX, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ 3563766.3564084

  8. g

    Real-time data from public transport via MQTT & ESP32 XIAO & OpenDATA |...

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Real-time data from public transport via MQTT & ESP32 XIAO & OpenDATA | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_bd836e2c-70ac-44cc-8240-e93aa8ea676c/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Real-time data from the Vienna lines are processed in Node-RED and sent via MQTT to an ESP32. Content is displayed on a 0.91" OLED display. Instructions for evaluation in Node-RED and implementation on the ESP32 code: Part 1: Part 2: Wiring and code can be found at github.com/pixeledi. Have fun replicating!

  9. Data from: RNN-DAS: A New Deep Learning Approach for Detection and Real-Time...

    • zenodo.org
    zip
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Fernandez-Carabantes; Javier Fernandez-Carabantes; Manuel Titos; Manuel Titos; Luca D'Auria; Luca D'Auria; JesĂșs GarcĂ­a; JesĂșs GarcĂ­a; Luz GarcĂ­a; Luz GarcĂ­a; Carmen BenĂ­tez; Carmen BenĂ­tez (2025). RNN-DAS: A New Deep Learning Approach for Detection and Real-Time Monitoring of Volcano-Tectonic Events Using Distributed Acoustic Sensing [Dataset]. http://doi.org/10.5281/zenodo.15105596
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Javier Fernandez-Carabantes; Javier Fernandez-Carabantes; Manuel Titos; Manuel Titos; Luca D'Auria; Luca D'Auria; JesĂșs GarcĂ­a; JesĂșs GarcĂ­a; Luz GarcĂ­a; Luz GarcĂ­a; Carmen BenĂ­tez; Carmen BenĂ­tez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HDAS Data from La Palma - DigiVolCan Project

    This repository contains various co-eruptive VT datasets collected during the 2021 eruption at La Palma, Spain, by an underwater High-fidelity Distributed Acoustic Sensing (HDAS). These datasets have been used to train and test the RNN-DAS model, a deep learning framework designed for volcano-seismic event detection using DAS data.

    This data was collected as part of the DigiVolCan project, which is a collaboration between the University of Granada, the Canary Islands Volcanological Institute (INVOLCAN), the Institute of Technological and Renewable Energies (ITER), the University of La Laguna, and AragĂłn Photonics. It is funded by the Ministry of Science, Innovation, and Universities / State Research Agency (MICIU/AEI) of Spain and the European Union through the Recovery, Transformation, and Resilience Plan, Next Generation EU Funds. The project reference is PLEC2022-009271, funded by MICIU/AEI /10.13039/501100011033 and by the European Union Next GenerationEU/PRTR.

    Dataset Description

    The shared dataset contains HDAS data recorded over several periods, with one file per minute. Each file is in .h5 format and follows the structure:

    file_path
    │
    └───"data" (dataset)
      │
      ├───data (2D matrix of strain rate)
      │  └───[channels x time_samples]
      │
      ├───attrs
         │
         ├───"dt_s" (temporal sampling in seconds)
         ├───"dx_m" (spatial sampling in meters)
         └───"begin_time" (start date in 'YYYY-MM-DDTHH:MM:SS.SSS' format)
    

    Five datasets are provided as separate compressed .zip archives due to their size. Each archive contains DAS waveform data in the HDF5 (.h5) format described above, organized in one-minute files. These datasets correspond to figures presented in the RNN-DAS model article and are intended to facilitate reproducibility and further analysis.

    Dataset 1 – Main event and aftershocks (Figure 5)

    This dataset contains a one-hour DAS recording from November 30, between 07:00 and 08:00 UTC, featuring a main seismic event with magnitude Ml = 3.22 along with several aftershocks.

    Dataset 2 – Continuous Test Segment (Figure 6)

    This dataset contains one hour of continuous DAS recordings from October 29, between 04:00 and 05:00 UTC.

    Dataset 3 – Events with Varying SNR and Magnitude (Figure 4)
    This dataset includes three separate 3-minute DAS recordings, each corresponding to a different seismic event with distinct characteristics. The selected events represent a range of conditions, including attenuated signals, low signal-to-noise ratio (SNR), and nearby high-SNR events.

    Dataset 4 – High-Magnitude Event Example (Figure 3)
    This dataset contains a 3-minute DAS recording corresponding to a seismic event with magnitude Ml = 4.23. This example demonstrates the model’s response to a clear, high-magnitude event.

    Dataset 5 – Moderate Events and Noise-Only Sample (Figure 7)
    This dataset includes four separate DAS recordings: three corresponding to moderate seismic events and another containing only seismic noise.

    All .zip archives can be easily decompressed and used directly.

    Note: The full HDAS dataset from La Palma used for model training and evaluation is not included due to its large size. It is available upon request from the corresponding author.

    RNN-DAS Model

    The RNN-DAS model is an innovative Deep Learning model based on Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells, developed for real-time Volcano-seismic Signal Recognition (VSR) using Distributed Acoustic Sensing (DAS) measurements. The model was trained on a comprehensive dataset of Volcano-Tectonic (VT) events from the 2021 La Palma eruption, recorded by a High-fidelity submarine Distributed Acoustic Sensing array (HDAS) located near the eruption site.

    RNN-DAS can detect VT events, track their temporal evolution, and classify their waveforms with approximately 97% accuracy when tested on a database of over 2 million unique strain waveforms, enabling real-time continuous data predictions. The model has demonstrated excellent generalization capabilities for different time intervals and volcanoes, facilitating continuous, real-time seismic monitoring with minimal computational resources and retraining requirements.

    The model is available in the RNN-DAS GitHub repository:

    https://github.com/Javier-FernandezCarabantes/RNN-DAS

    FernĂĄndez-Carabantes, J., Titos, M., D'Auria, L., GarcĂ­a, J., GarcĂ­a, L., & BenĂ­tez, C. (2025). Javier-FernandezCarabantes/RNN-DAS: RNN-DAS v1.1.1 (v1.1.1). Zenodo. https://doi.org/10.5281/zenodo.15858492

    A copy of the repository is also provided here as the RNN-DAS_main.zip file. This archive mirrors the contents of the GitHub repository at the time of submission (v1.0.0). For correct usage, it is recommended to read the included README file. Users are encouraged to refer to the GitHub repository for future updates or changes.

    Usage

    This dataset is provided as a sample data for the RNN-DAS model. It can be used to test and validate our model, as well as for the development of other machine learning approaches.

    Citation

    If you use this dataset in your research or if you use the RNN-DAS model, proper citation of the related article and this dataset are needed (FernĂĄndez-Carabantes et al., 2025)

    FernĂĄndez-Carabantes, J., Titos, M., D'Auria, L., GarcĂ­a, J., GarcĂ­a, L., & BenĂ­tez, C. (2025). RNN-DAS: A new deep learning approach for detection and real-time monitoring of volcano-tectonic events using distributed acoustic sensing. Journal of Geophysical Research: Solid Earth, 130, e2025JB031756. https://doi.org/10.1029/2025JB031756

    For further details, please refer to the project documentation or contact the research team (corresponding author email: javierfyc@ugr.es).

  10. US counties COVID 19 dataset

    • kaggle.com
    Updated May 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyrnaMFL (2020). US counties COVID 19 dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1197018
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MyrnaMFL
    Area covered
    United States
    Description

    From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data

    Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

    Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."

    The specific data here, is the data PER US COUNTY.

    The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

  11. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    application/gzip
    Updated Mar 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JoĂŁo Felipe; JoĂŁo Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    JoĂŁo Felipe; JoĂŁo Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

    Papers:

    This repository contains three files:

    Reproducing the Notebook Study

    The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

    gunzip -c db2020-09-22.dump.gz | psql jupyter

    Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

    For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

    The sample.tar.gz file contains the repositories obtained during the manual sampling.

    Reproducing the Julynter Experiment

    The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

    • Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
    • Install the dependencies: $ pip install julynter/requirements.txt
    • Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

    The collected data is stored in the julynter/data folder.

    Changelog

    2019/01/14 - Version 1 - Initial version
    2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
    2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
    2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

  12. Data from: Beyond Textual Issues: Understanding the Usage and Impact of...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hudson Borges; Rodrigo Brito; Marco Tulio Valente; Hudson Borges; Rodrigo Brito; Marco Tulio Valente (2020). Beyond Textual Issues: Understanding the Usage and Impact of GitHub Reactions [Dataset]. http://doi.org/10.5281/zenodo.2558596
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hudson Borges; Rodrigo Brito; Marco Tulio Valente; Hudson Borges; Rodrigo Brito; Marco Tulio Valente
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recently, GitHub introduced a new social feature, named reactions, which are pictorial characters similar to the emoji symbols widely used nowadays in text-based communications. Particularly, GitHub users can use a set of such symbols to react to issues and pull requests. However, little is known about the real usage and benefits of GitHub reactions. In this paper, we analyze the reactions provided by developers to more than 2.5 million issues and 9.7 million issue comments, in order to answer an extensive list of ten research questions about the usage and adoption of reactions. We show that reactions are being increasingly used by open-source developers. Moreover, we also found that issues with reactions usually take more time to be closed and have longer discussions.

    This dataset contains the data used in the paper "Beyond Textual Issues: Understanding the Usage and Impact of GitHub Reactions", accepted for SBES 2019.

  13. realtime traffic

    • researchdata.edu.au
    • data.act.gov.au
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACT Government Open Data (2025). realtime traffic [Dataset]. https://researchdata.edu.au/realtime-traffic/3734239
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    ACT Government Open Data
    Description

    In the ACT, we have bluetooth detectors placed in certain roads to monitor traffic flow that provides network-wide performance indicators in real time. Details about congestion & travel time can be accessed via APIs provided in this dataset

  14. O

    Real-Time Road Conditions

    • data.austintexas.gov
    • datahub.austintexas.gov
    • +2more
    Updated Oct 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Austin, Texas - data.austintexas.gov (2025). Real-Time Road Conditions [Dataset]. https://data.austintexas.gov/w/ypbq-i42h/7r79-5ncn?cur=DjGplNNa8x8
    Explore at:
    application/geo+json, application/rdfxml, csv, application/rssxml, tsv, kml, kmz, xmlAvailable download formats
    Dataset updated
    Oct 11, 2025
    Dataset authored and provided by
    City of Austin, Texas - data.austintexas.gov
    Description

    Austin Transportation & Public Works maintains road condition sensors across the city which monitor the temperature and surface condition of roadways. These sensors enable our Mobility Management Center to stay apprised of potential roadway freezing events and intervene when necessary.

    This data is updated continuously every 5 minutes.

    See also, the data descriptions from the sensor's instruction manual:

    https://github.com/cityofaustin/atd-road-conditions/blob/production/5433-3X-manual.pdf

  15. E

    GitHub Java Corpus - Function Identifiers

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). GitHub Java Corpus - Function Identifiers [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7992
    Explore at:
    Dataset updated
    Nov 16, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains function identifiers extracted from the GitHub Java Corpus (http://groups.inf.ed.ac.uk/cup/javaGithub/).

    Each line corresponds to a method declaration. A line contains the name of the method declaration followed by the function identifiers (i.e., function calls) contained within the method body.

    The file embeddings_train.json can be used to train a word/sentence embedding model using the code in the Github repository (link below).

    The corpus was used for the experiments in the paper Combining Code Embedding with Static Analysis for Function-Call Completion.

    Github repository to replicate the experiments: https://github.com/mweyssow/cse-saner

  16. o

    Social Media Profile Links by Name

    • openwebninja.com
    json
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 2, 2025
    Dataset authored and provided by
    OpenWeb Ninja
    Area covered
    Worldwide
    Description

    This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.

  17. Z

    Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Mao, Yiwei
    Garske, Samuel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

    They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

    How to Get Started

    All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

    import numpy as np

    Load image file

    hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

    Load image mask

    mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

    Citing the Datasets

    If you use any of these datasets, please cite the following paper:

    @article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

    If you use the beach dataset please cite the following paper as well (original source):

    @article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

  18. Accurate normalization of real-time quantitative RT-PCR data by geometric...

    • healthdata.gov
    application/rdfxml +5
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes - f4wx-3iqh - Archive Repository [Dataset]. https://healthdata.gov/dataset/Accurate-normalization-of-real-time-quantitative-R/ep29-w9cc
    Explore at:
    tsv, application/rssxml, csv, json, xml, application/rdfxmlAvailable download formats
    Dataset updated
    Sep 10, 2025
    Description

    This dataset tracks the updates made on the dataset "Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes" as a repository for previous versions of the data and metadata.

  19. [Decommissioned] Intellectual Property Government Open Live Data

    • researchdata.edu.au
    Updated Feb 3, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2016). [Decommissioned] Intellectual Property Government Open Live Data [Dataset]. https://researchdata.edu.au/decommissioned-intellectual-property-live-data/2989210
    Explore at:
    Dataset updated
    Feb 3, 2016
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    IP Australia
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    Important Notice\r

    \r This dataset is not being updated currently due to data migration work at IP Australia. We are sorry for the inconvenience and we will update this page once the migration is complete.\r \r The Intellectual Property Government Open Live Data (IPGOLD) includes over 100 years of Intellectual Property (IP) rights administered by IP Australia comprising patents, trade marks, designs and plant breeder's rights. The data is highly detailed, including information on each aspect of the application process from application through to granting of IP rights. We have published a paper to accompany IPGOLD which describes the data and illustrates its use, as well as a technical paper on the firm matching.\r \r IPGOLD is inherently the same data as the IPGOD data set, with a weekly update instead of the annual snapshot available in IPGOD. Many of the scripts of IPGOLD are still being developed and tested. As such IPGOLD should be considered a Beta release.

  20. e

    rt-me-fMRI: A task and resting state dataset for real-time, multi-echo fMRI...

    • b2find.eudat.eu
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). rt-me-fMRI: A task and resting state dataset for real-time, multi-echo fMRI methods development and validation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2f223bc9-110f-5779-94cc-ca197fe53ceb
    Explore at:
    Dataset updated
    Aug 7, 2024
    Description

    rt-me-fMRI is a multi-echo functional magnetic resonance imaging dataset (N=28 healthy volunteers) with four task-based and two resting state runs. Its main purpose is to advance the development of methods for real-time multi-echo fMRI analysis with applications in neurofeedback, real-time quality control, and adaptive paradigms, although the variety of experimental task paradigms can support multiple use cases. Tasks include finger tapping, emotional face and shape matching, imagined finger tapping and imagined emotion processing. Further information is available at https://github.com/jsheunis/rt-me-fMRI IMPORTANT FOR DATASET DOWNOAD: Due to an issue with the current installation of Dataverse, it is not currently possible to download the full rt-me-fMRI dataset in bulk. This issue is scheduled to be resolved in early 2021. Individual downloads or downloading small sets of files is currently possible, although cumbersome. In order to download the full dataset in bulk, please request access to the dataset on this page. You will then be required to complete and sign the Data Use Agreement, after which you will be provided with a secure download link for the full dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github
Organization logo

Real Indian users on Github

An overview of real time Indian users on Github

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Archit Tyagi
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered
India
Description

📊 GitHub Indian Users Dataset

Overview

This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

đŸ§‘â€đŸ’» Dataset Contents

The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

🌟 Source and Inspiration

This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

Potential Use Cases

  1. Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.
  2. Community Growth: Analyze how the Indian developer community has grown over time on GitHub.
  3. Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.
  4. Regional Insights: Discover which cities or regions in India have the most active GitHub users.
  5. Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

💡 Ideal for

This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

Search
Clear search
Close search
Google apps
Main menu