100+ datasets found
  1. videogames-companies-regions

    • kaggle.com
    Updated Dec 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AndresHG (2020). videogames-companies-regions [Dataset]. https://www.kaggle.com/datasets/andreshg/videogamescompaniesregions/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AndresHG
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Context

    There are many developers in the world of video-games. Here they are!

    Content

    This is a short dataset that contains information about video-games publishers. The idea behind the data is to explain a little bit some information abut those video-games publishers.

    Inspiration

    The idea behind this dataset is to complement the video-games-sales-2019 dataset.

  2. Global sought-after database skills for developers 2021

    • statista.com
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Global sought-after database skills for developers 2021 [Dataset]. https://www.statista.com/statistics/793854/worldwide-developer-survey-most-wanted-database/
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 25, 2021 - Jun 15, 2021
    Area covered
    Worldwide
    Description

    According to the survey, just under 18 percent of respondents identified PostgreSQQL as one of the most-wanted database skills. MongoDB ranked second with 17.89 percent stating they are not developing with it, but want to.

  3. CommitBench

    • zenodo.org
    csv, json
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Data Statement for CommitBench

    - Dataset Title: CommitBench
    - Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo
    - Dataset Version: 1.0, 15.12.2023
    - Data Statement Author: Maximilian Schall, Tamara Czinczoll
    - Data Statement Version: 1.0, 16.01.2023

    EXECUTIVE SUMMARY

    We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

    CURATION RATIONALE

    We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

    DOCUMENTATION FOR SOURCE DATASETS

    Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

    LANGUAGE VARIETIES

    Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

    LanguageNumber of Samples
    Java153,119
    Ruby233,710
    Go137,998
    JavaScript373,598
    Python472,469
    PHP294,394

    SPEAKER DEMOGRAPHIC

    Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

    ANNOTATOR DEMOGRAPHIC

    Due to the automated generation of the dataset, no annotators were used.

    SPEECH SITUATION AND CHARACTERISTICS

    The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

    PREPROCESSING AND DATA FORMATTING

    See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

    CAPTURE QUALITY

    While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

    LIMITATIONS

    While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

    Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

    METADATA

    License: Dataset under the CC BY-NC 4.0 license

    DISCLOSURES AND ETHICAL REVIEW

    While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

    ABOUT THIS DOCUMENT

    A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

    This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.

  4. World Development Indicators

    • kaggle.com
    zip
    Updated Apr 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). World Development Indicators [Dataset]. https://www.kaggle.com/theworldbank/world-development-indicators
    Explore at:
    zip(134125679 bytes)Available download formats
    Dataset updated
    Apr 10, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets

    Description

    Content

    The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using the World Bank's APIs and Kaggle's API.

    Cover photo by Alex Block on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  5. G

    VIIRS Stray Light Corrected Nighttime Day/Night Band Composites Version 1

    • developers.google.com
    Updated May 31, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines (2017). VIIRS Stray Light Corrected Nighttime Day/Night Band Composites Version 1 [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/NOAA_VIIRS_DNB_MONTHLY_V1_VCMSLCFG
    Explore at:
    Dataset updated
    May 31, 2017
    Dataset provided by
    Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines
    Time period covered
    Jan 1, 2014 - Mar 1, 2025
    Area covered
    Description

    Monthly average radiance composite images using nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). As these data are composited monthly, there are many areas of the globe where it is impossible to get good quality data coverage for that month. This can be due to …

  6. World Bank Quarterly External Debt Statistics

    • kaggle.com
    zip
    Updated May 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). World Bank Quarterly External Debt Statistics [Dataset]. https://www.kaggle.com/theworldbank/world-bank-quarterly-external-debt-statistics
    Explore at:
    zip(11652734 bytes)Available download formats
    Dataset updated
    May 4, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Content

    More details about each file are in the individual file descriptions.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using the World Bank's APIs and Kaggle's API.

    Cover photo by Markus Spiske on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  7. G

    GRIDMET: University of Idaho Gridded Surface Meteorological Dataset

    • developers.google.com
    Updated Aug 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California Merced (2018). GRIDMET: University of Idaho Gridded Surface Meteorological Dataset [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET
    Explore at:
    Dataset updated
    Aug 15, 2018
    Dataset provided by
    University of California Merced
    Time period covered
    Jan 1, 1979 - Jul 10, 2025
    Area covered
    Description

    The Gridded Surface Meteorological dataset provides high spatial resolution (~4-km) daily surface fields of temperature, precipitation, winds, humidity and radiation across the contiguous United States from 1979. The dataset blends the high resolution spatial data from PRISM with the high temporal resolution data from the National Land Data Assimilation System (NLDAS) to produce spatially and temporally continuous fields that lend themselves to additional land surface modeling. This dataset contains provisional products that are replaced with updated versions when the complete source data become available. Products can be distinguished by the value of the 'status' property. At first, assets are ingested with status='early'. After several days, they are replaced by assets with status='provisional'. After about 2 months, they are replaced by the final assets with status='permanent'.

  8. Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA)

    • developers.google.com
    Updated Feb 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Union/ESA/Copernicus (2024). Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA) [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_HARMONIZED
    Explore at:
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    European Space Agencyhttp://www.esa.int/
    Time period covered
    Jun 27, 2015 - Jul 13, 2025
    Area covered
    Description

    After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas. The Sentinel-2 data contain 13 UINT16 spectral bands representing TOA reflectance scaled by 10000. See the Sentinel-2 User Handbook for details. QA60 is a bitmask band that contained rasterized cloud mask polygons until Feb 2022, when these polygons stopped being produced. Starting in February 2024, legacy-consistent QA60 bands are constructed from the MSK_CLASSI cloud classification bands. For more details, see the full explanation of how cloud masks are computed.. Each Sentinel-2 product (zip archive) may contain multiple granules. Each granule becomes a separate Earth Engine asset. EE asset ids for Sentinel-2 assets have the following format: COPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the first numeric part represents the sensing date and time, the second numeric part represents the product generation date and time, and the final 6-character string is a unique granule identifier indicating its UTM grid reference (see MGRS). The Level-2 data produced by ESA can be found in the collection COPERNICUS/S2_SR. For datasets to assist with cloud and/or cloud shadow detection, see COPERNICUS/S2_CLOUD_PROBABILITY and GOOGLE/CLOUD_SCORE_PLUS/V1/S2_HARMONIZED. For more details on Sentinel-2 radiometric resolution, see this page.

  9. World Bank Millennium Development Goals

    • kaggle.com
    Updated May 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). World Bank Millennium Development Goals [Dataset]. https://www.kaggle.com/theworldbank/world-bank-millennium-development-goals/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2019
    Dataset provided by
    Kaggle
    Authors
    World Bank
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Content

    More details about each file are in the individual file descriptions.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using the World Bank's APIs and Kaggle's API.

    Cover photo by İrfan Simsar on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  10. Most popular database management systems worldwide 2024

    • statista.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2024
    Area covered
    Worldwide
    Description

    As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.

  11. BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023

    • zenodo.org
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Kacic; Patrick Kacic (2025). BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023 [Dataset]. http://doi.org/10.5281/zenodo.14850688
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Kacic; Patrick Kacic
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 24, 2023
    Description
    This dataset provides additional information on environmental attributes (minimum distance to land cover classes, topographic information) based on the dataset "BETA-FOR_SPZ_Patches_2022/2023" (https://zenodo.org/records/14748236) (centroid coordinates: decimalLongitude, decimalLatitude).
    From the following three geospatial datasets the information on environmental attributes were derived:
    - ESA Worldcover = Global product on land cover (Raster data, https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v100?hl=en)
    - SRTM = Global Digital Elevation Model (Raster data, https://developers.google.com/earth-engine/datasets/catalog/USGS_SRTMGL1_003?hl=en)
    The following attributes were added to the "BETA-FOR_SPZ_Patches_2022/2023" table and exported as .csv file (tabular data):
    DLM250:
    - min_dist_sie01_p = minimum distance to urban areas [m]
    - min_dist_ver01_l = minimum distance to technical infrastructure (roads) [m]
    - min_dist_veg01_f = minimum distance to agricultural areas [m]
    - min_dist_gew01_l = minimum distance to waterbodies [m]
    Please consider that the DLM250 is spatially discontinuous vector data where e.g. agricultural areas are incompletely assessed.
    ESA WorldCover (ESAWC):
    - min_dist_esawc_30 = minimum distance to grasslands (land cover class value = 30) [m]
    - min_dist_esawc_40 = minimum distance to cropland (land cover class value = 40) [m]
    SRTM:
    - SRTM_elevation = elevation [m]
    - SRTM_slope = slope [°]
    - SRTM_aspect = aspect; 90° = E, 180° = S; 270 ° = W; 360°/0° = N [°]
    The original vector and raster data can be made available upon request, e.g. to inspect benefits and limitations of DLM250 and ESA WorldCover.
  12. A

    Data from: Google Earth Engine (GEE)

    • data.amerigeoss.org
    • azgeo-open-data-agic.hub.arcgis.com
    • +3more
    esri rest, html
    Updated Nov 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AmeriGEO ArcGIS (2018). Google Earth Engine (GEE) [Dataset]. https://data.amerigeoss.org/de/dataset/google-earth-engine-gee1
    Explore at:
    esri rest, htmlAvailable download formats
    Dataset updated
    Nov 28, 2018
    Dataset provided by
    AmeriGEO ArcGIS
    Description

    Meet Earth Engine

    Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth's surface.

    Satellite imagerySATELLITE IMAGERY+Your algorithmsYOUR ALGORITHMS+Causes you care aboutREAL WORLD APPLICATIONS
    LEARN MORE
    GLOBAL-SCALE INSIGHT

    Explore our interactive timelapse viewer to travel back in time and see how the world has changed over the past twenty-nine years. Timelapse is one example of how Earth Engine can help gain insight into petabyte-scale datasets.

    EXPLORE TIMELAPSE
    READY-TO-USE DATASETS

    The public data archive includes more than thirty years of historical imagery and scientific datasets, updated and expanded daily. It contains over twenty petabytes of geospatial data instantly available for analysis.

    EXPLORE DATASETS
    SIMPLE, YET POWERFUL API

    The Earth Engine API is available in Python and JavaScript, making it easy to harness the power of Google’s cloud for your own geospatial analysis.

    EXPLORE THE API
    Google Earth Engine has made it possible for the first time in history to rapidly and accurately process vast amounts of satellite imagery, identifying where and when tree cover change has occurred at high resolution. Global Forest Watch would not exist without it. For those who care about the future of the planet Google Earth Engine is a great blessing!-Dr. Andrew Steer, President and CEO of the World Resources Institute.
    CONVENIENT TOOLS

    Use our web-based code editor for fast, interactive algorithm development with instant access to petabytes of data.

    LEARN ABOUT THE CODE EDITOR
    SCIENTIFIC AND HUMANITARIAN IMPACT

    Scientists and non-profits use Earth Engine for remote sensing research, predicting disease outbreaks, natural resource management, and more.

    SEE CASE STUDIES
    READY TO BE PART OF THE SOLUTION?SIGN UP NOW
    TERMS OF SERVICE PRIVACY ABOUT GOOGLE

  13. Performance counter for biometrics authentication

    • figshare.com
    txt
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cesar Andrade; Eduardo Souto; Hendrio Bragança (2023). Performance counter for biometrics authentication [Dataset]. http://doi.org/10.6084/m9.figshare.24461230.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Cesar Andrade; Eduardo Souto; Hendrio Bragança
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the quest for advancing the field of continuous user authentication, we have meticulously crafted two comprehensive datasets: COUNT-OS-I and COUNT-OS-II, each harboring unique characteristics while sharing a common ground in their utility and design principles. These datasets encompass performance counters extracted from the Windows operating system, offering an intricate tapestry of data vital for evaluating and refining authentication models in real-world scenarios.Both datasets have been generated in real-world settings within public organizations in Brazil, ensuring their applicability and relevance to practical scenarios. Volunteers from diverse professional backgrounds participated in the data collection, contributing to the richness and variability of the data. Furthermore, both datasets were collected at a sample rate of every 5 seconds, providing a dense and detailed view of user interactions and system performance. The commitment to preserving user confidentiality is unwavering across both datasets, with pseudonymization applied meticulously to safeguard individual identities while maintaining data integrity and statistical robustness.The COUNT-OS-I dataset was specifically generated in a real-world scenario to evaluate our work on continuous user authentication. This dataset consist of performance counters extracted from the Windows operating system of 26 computers, representing 26 individual users. The data were collected on the computers of the Information Technology Department of a public organization in Brazil.The participants in this study were volunteers, with aged between 20 and 45 years old, consisting of both males and females. The majority of the participants were systems analysts and software developers who performed their routine work activities. There were no specific restrictions imposed on the tasks that the participants were required to perform during the data collection process.The participants used a variety of software applications as part of their regular work activities. This included web browsers such as Firefox, Chrome, and Edge, developer tools like Eclipse and SQL Developer, office programs such as Microsoft Office Word, Excel, and PowerPoint, as well as chat applications like WhatsApp. It's important to note that the list of applications mentioned is not exhaustive, and participants were not limited to using only these applications.For the COUNT-OS-I dataset, the data collected is based on computers with different characteristics and configurations in terms of hardware, operating system versions, and installed software. This diversity ensures a representative sample of real-world scenarios and allows for a comprehensive evaluation of the authentication model.During the data collection process, each sample was recorded at a frequency of every 5 seconds, capturing system data over a period of approximately 26 hours, on average, for each user. This duration provides sufficient data to analyze user behavior and system performance over an extended period. Each sample in the COUNT-OS-I dataset corresponds to a feature vector comprising 159 attributesThe COUNT-OS-II dataset was utilized to evaluate our work in a real-world setting. This dataset comprises performance counters extracted from the Windows operating system installed on 37 computers. These computers possess identical hardware configurations (CPU, memory, network, disk), operating systems, and software installations. The data collection was conducted within various departments of a public organization in Brazil.The participants in this study (37 users) were voluntary administration assistants who performed various administrative tasks as part of their routine work activities. No restrictions were imposed on the specific tasks they were assigned. The participants commonly utilized programs such as the Chrome browser and office applications like Office Word, Excel, and PowerPoint, in addition to the WhatsApp chat application.The data were collected over six days (approximately 48 hours), with sample collected at a 5-second interval. Each sample corresponds to a feature vector composed of 218 attributes. In this dataset, we also apply pseudonymization to hide users' sensitive information.

  14. Z

    Sentinel2 RGB chips over BENELUX with ESA World Cover for Learning with...

    • data.niaid.nih.gov
    Updated May 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raúl Ramos-Pollan (2023). Sentinel2 RGB chips over BENELUX with ESA World Cover for Learning with Label Proportions [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7935236
    Explore at:
    Dataset updated
    May 17, 2023
    Dataset provided by
    Fabio A. González
    Raúl Ramos-Pollan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Benelux
    Description

    Region of Interest (ROI) is comprised of the Belgium, the Netherlands and Luxembourg

    We use the communes administrative division which is standardized across Europe by EUROSTAT at: https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units This is roughly equivalent to the notion municipalities in most countries.

    From the link above, communes definition are taken from COMM_RG_01M_2016_4326.shp and country borders are taken from NUTS_RG_01M_2021_3035.shp.

    images: Sentinel2 RGB from 2020-01-01 to 2020-31-12 filtered out pixels with clouds during the observation period according to QA60 band following the example given in GEE dataset info page, and took the median of the resulting pixels

      see https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED
    
    
      see also https://github.com/rramosp/geetiles/blob/main/geetiles/defs/sentinel2rgbmedian2020.py
    

    labels: ESA WorldCover 10m V100 labels mapped to the interval [1,11] according to the following map { 0:0, 10: 1, 20:2, 30:3, 40:4, 50:5, 60:6, 70:7, 80:8, 90:9, 95:10, 100:11 } pixel value zero is reserved for invalid data. see https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v100

      see also https://github.com/rramosp/geetiles/blob/main/geetiles/defs/esaworldcover.py
    

    _aschips.geojson the image chips geometries along with label proportions for easy visualization with QGIS, GeoPandas, etc.

    _communes.geojson the communes geometries with their label prortions for easy visualization with QGIS, GeoPandas, etc.

    splits.csv contains two splits of image chips in train, test, val - with geographical bands at 45° angles in nw-se direction - the same as above reorganized to that all chips within the same commune fall within the same split.

    data/ a pickle file for each image chip containing a dict with - the 100x100 RGB sentinel 2 chip image - the 100x100 chip level lavels - the label proportions of the chip - the aggregated label proportions of the commune the chip belongs to

  15. d

    Global Distribution of Economic Activity - Dataset - waterdata

    • waterdata3.staging.derilinx.com
    Updated Mar 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Global Distribution of Economic Activity - Dataset - waterdata [Dataset]. https://waterdata3.staging.derilinx.com/dataset/natural-endowment-measuring-economic-growth-outer-space
    Explore at:
    Dataset updated
    Mar 16, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data for replicating The Global Spatial Distribution of Economic Activity: Nature, History, and the Role of Trade (forthcoming 2018; with Vernon Henderson, Tim Squires and David N. Weil) Quarterly Journal of Economics We explore the role of natural characteristics in determining the worldwide spatial distribution of economic activity, as proxied by lights at night, observed across 240,000 grid cells. A parsimonious set of 24 physical geography attributes explains 47% of worldwide variation and 35% of within-country variation in lights. We divide geographic characteristics into two groups, those primarily important for agriculture and those primarily important for trade, and confront a puzzle. In examining within-country variation in lights, among countries that developed early, agricultural variables incrementally explain over 6 times as much variation in lights as do trade variables, while among late developing countries the ratio is only about 1.5, even though the latter group is far more dependent on agriculture. Correspondingly, the marginal effects of agricultural variables as a group on lights are larger in absolute value, and those for trade smaller, for early developers than for late developers. We show that this apparent puzzle is explained by persistence and the differential timing of technological shocks in the two sets of countries. For early developers, structural transformation due to rising agricultural productivity began when transport costs were still high, so cities were localized in agricultural regions. When transport costs fell, these agglomerations persisted. In late-developing countries, transport costs fell before structural transformation. To exploit urban scale economies, manufacturing agglomerated in relatively few, often coastal, locations. Consistent with this explanation, countries that developed earlier are more spatially equal in their distribution of education and economic activity than late developers. This dataset is part of the Global Research Program on Spatial Development of Cities funded by the Multi-Donor Trust Fund on Sustainable Urbanization of the World Bank and supported by the U.K. Department for International Development.

  16. The ORBIT (Object Recognition for Blind Image Training)-India Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

    Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

    The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

    This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

    REFERENCES:

    1. Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

    2. microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

    3. Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641

  17. m

    Fruits Dataset for Classification

    • data.mendeley.com
    Updated Feb 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS GTS (2025). Fruits Dataset for Classification [Dataset]. http://doi.org/10.17632/rg254yr63x.1
    Explore at:
    Dataset updated
    Feb 11, 2025
    Authors
    GTS GTS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About Dataset (strawberries, peaches, pomegranates) Photo requirements: 1-White background 2-.jpg 3- Image size 300*300 The number of photos required is 250 photos of each fruit when it is fresh and 250 photos of each Fruit Dataset for Classification when it is rotten. Total 1500 images

    Diverse Collection With a diverse collection of Product images, the files provides an excellent foundation for developing and testing machine learning models designed for image recognition and allocation. Each image is captured under different lighting conditions and backgrounds, offering a realistic challenge for algorithms to overcome.

    Real-World Applications The variability in the dataset ensures that models trained on it can generalize well to real-world scenarios, making them robust and reliable. The dataset includes common fruits such as apples, bananas, oranges, and strawberries, among others, allowing for comprehensive training and evaluation.

    Industry Use Cases One of the significant advantages of using the Fruits Dataset for Classification is its applicability in various fields such as agriculture, retail, and the food industry. In agriculture, it can help automate the process of fruit sorting and grading, enhancing efficiency and reducing labor costs. In retail, it can be used to develop automated checkout systems that accurately identify fruits, streamlining the purchasing process.

    Educational Value The dataset is also valuable for educational purposes, providing students and educators with a practical tool to learn and teach machine learning concepts. By working with this dataset, learners can gain hands-on experience in data preprocessing, model training, and evaluation.

    Conclusion The Fruits Dataset for Classification is a versatile and indispensable resource for advancing the field of image classification. Its diverse and high-quality images, coupled with practical applications, make it a go-to dataset for researchers, developers, and educators aiming to improve and innovate in machine learning and computer vision.

    This dataset is sourced from Kaggle.

  18. Gun Dataset YOLO v8

    • kaggle.com
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abuzar Khan (2024). Gun Dataset YOLO v8 [Dataset]. https://www.kaggle.com/datasets/abuzarkhaaan/helmetandguntesting
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abuzar Khan
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    This dataset contains labeled data for gun detection collected from various videos on YouTube. The dataset has been specifically curated and labeled by me to aid in training machine learning models, particularly for real-time gun detection tasks. It is formatted for easy use with YOLO (You Only Look Once), one of the most popular object detection models.

    Key Features: Source: The videos were sourced from YouTube and feature diverse environments, including indoor and outdoor settings, with varying lighting conditions and backgrounds. Annotations: The dataset is fully labeled with bounding boxes around guns, following the YOLO format (.txt files for annotations). Each annotation provides the class (gun) and the coordinates of the bounding box. YOLO-Compatible: The dataset is ready to be used with any YOLO model (YOLOv3, YOLOv4, YOLOv5, etc.), ensuring seamless integration for object detection training. Realistic Scenarios: The dataset includes footage of guns from various perspectives and angles, making it useful for training models that can generalize to real-world detection tasks. This dataset is ideal for researchers and developers working on gun detection systems, security applications, or surveillance systems that require fast and accurate detection of firearms.

  19. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  20. P

    Paimon Dataset YOLO Detection Dataset

    • paperswithcode.com
    • gts.ai
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Paimon Dataset YOLO Detection Dataset [Dataset]. https://paperswithcode.com/dataset/paimon-dataset-yolo-detection
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Description:

    👉 Download the dataset here

    This dataset consists of a diverse collection of images featuring Paimon, a popular character from the game Genshin Impact. The images have been sourced from in-game gameplay footage and capture Paimon from various angles and in different sizes (scales), making the dataset suitable for training YOLO object detection models.

    The dataset provides a comprehensive view of Paimon in different lighting conditions, game environments, and positions, ensuring the model can generalize well to similar characters or object detection tasks. While most annotations are accurately labeled, a small number of annotations may include minor inaccuracies due to manual labeling errors. This is ideal for researchers and developers working on character recognition, object detection in gaming environments, or other AI vision tasks.

    Download Dataset

    Dataset Features:

    Image Format: .jpg files in 640×320 resolution.

    Annotation Format: .txt files in YOLO format, containing bounding box data with:

    class_id

    x_center

    y_center

    width

    height

    Use Cases:

    Character Detection in Games: Train YOLO models to detect and identify in-game characters or NPCs.

    Gaming Analytics: Improve recognition of specific game elements for AI-powered game analytics tools.

    Research: Contribute to academic research focused on object detection or computer vision in animated and gaming environments.

    Data Structure:

    Images: High-quality .jpg images captured from multiple perspectives, ensuring robust model training across various orientations and lighting scenarios.

    Annotations: Each image has an associated .txt file that follows the YOLO format. The annotations are structured to include class identification, object location (center coordinates), and

    bounding box dimensions.

    Key Advantages:

    Varied Angles and Scales: The dataset includes Paimon from multiple perspectives, aiding in creating more versatile and adaptable object detection models.

    Real-World Scenario: Extracted from actual gameplay footage, the dataset simulates real-world detection challenges such as varying backgrounds, motion blur, and changing character scales.

    Training Ready: Suitable for training YOLO models and other deep learning frameworks that require object detection capabilities.

    This dataset is sourced from Kaggle.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
AndresHG (2020). videogames-companies-regions [Dataset]. https://www.kaggle.com/datasets/andreshg/videogamescompaniesregions/versions/2
Organization logo

videogames-companies-regions

This is a list of notable video game companies that have made games

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AndresHG
License

Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically

Description

Context

There are many developers in the world of video-games. Here they are!

Content

This is a short dataset that contains information about video-games publishers. The idea behind the data is to explain a little bit some information abut those video-games publishers.

Inspiration

The idea behind this dataset is to complement the video-games-sales-2019 dataset.

Search
Clear search
Close search
Google apps
Main menu