100+ datasets found
  1. large-data

    • kaggle.com
    zip
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AYUSH SINGH331 (2024). large-data [Dataset]. https://www.kaggle.com/datasets/ayushsingh331/large-data/versions/1
    Explore at:
    zip(1203746376 bytes)Available download formats
    Dataset updated
    Aug 13, 2024
    Authors
    AYUSH SINGH331
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by AYUSH SINGH331

    Released under MIT

    Contents

  2. h

    DISL

    • huggingface.co
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ASSERT | Research group at KTH Royal Institute of Technology (2024). DISL [Dataset]. https://huggingface.co/datasets/ASSERT-KTH/DISL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Dataset authored and provided by
    ASSERT | Research group at KTH Royal Institute of Technology
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DISL

    The DISL dataset features a collection of 514506 unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts.

      Content
    

    the raw subset has full contracts source code and it's not deduplicated, it has 3,298,271 smart contracts the… See the full description on the dataset page: https://huggingface.co/datasets/ASSERT-KTH/DISL.

  3. R

    Network Devices Large Dataset

    • universe.roboflow.com
    zip
    Updated Oct 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Usman Mughal (2022). Network Devices Large Dataset [Dataset]. https://universe.roboflow.com/usman-mughal-ky7sf/network-devices-large/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 10, 2022
    Dataset authored and provided by
    Usman Mughal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Modem Router Wireless Bounding Boxes
    Description

    Network Devices Large

    ## Overview
    
    Network Devices Large is a dataset for object detection tasks - it contains Modem Router Wireless annotations for 225 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. h

    Big-Math-RL-UNVERIFIED

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SynthLabs (2025). Big-Math-RL-UNVERIFIED [Dataset]. https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset authored and provided by
    SynthLabs
    Description

    Big-Math: UNVERIFIED

    [!WARNING] WARNING: This dataset contains ONLY questions whose answers have not been verified to be correct. Use this dataset at your own caution.

      Dataset Creation
    

    Big-Math-Unverified is created as an offshoot of the Big-Math dataset (HuggingFace Dataset Link). Big-Math-Unverified goes through the same filters as the rest of Big-Math (eg. remove non-English, remove multiple choice, etc.), except that these problems were not solved in any of the… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED.

  5. Refined DataCo Supply Chain Geospatial Dataset

    • kaggle.com
    zip
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Gupta (2025). Refined DataCo Supply Chain Geospatial Dataset [Dataset]. https://www.kaggle.com/datasets/aaumgupta/refined-dataco-supply-chain-geospatial-dataset
    Explore at:
    zip(29010639 bytes)Available download formats
    Dataset updated
    Jan 29, 2025
    Authors
    Om Gupta
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Refined DataCo Smart Supply Chain Geospatial Dataset

    Optimized for Geospatial and Big Data Analysis

    This dataset is a refined and enhanced version of the original DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS dataset, specifically designed for advanced geospatial and big data analysis. It incorporates geocoded information, language translations, and cleaned data to enable applications in logistics optimization, supply chain visualization, and performance analytics.

    Key Features

    1. Geocoded Source and Destination Data

    • Accurate latitude and longitude coordinates for both source and destination locations.
    • Facilitates geospatial mapping, route analysis, and distance calculations.

    2. Supplementary GeoJSON Files

    • src_points.geojson: Source point geometries.
    • dest_points.geojson: Destination point geometries.
    • routes.geojson: Line geometries representing source-destination routes.
    • These files are compatible with GIS software and geospatial libraries such as GeoPandas, Folium, and QGIS.

    3. Language Translation

    • Key location fields (countries, states, and cities) are translated into English for consistency and global accessibility.

    4. Cleaned and Consolidated Data

    • Addressed missing values, removed duplicates, and corrected erroneous entries.
    • Ready-to-use dataset for analysis without additional preprocessing.

    5. Routes and Points Geometry

    • Enables the creation of spatial visualizations, hotspot identification, and route efficiency analyses.

    Applications

    1. Logistics Optimization

    • Analyze transportation routes and delivery performance to improve efficiency and reduce costs.

    2. Supply Chain Visualization

    • Create detailed maps to visualize the global flow of goods.

    3. Geospatial Modeling

    • Perform proximity analysis, clustering, and geospatial regression to uncover patterns in supply chain operations.

    4. Business Intelligence

    • Use the dataset for KPI tracking, decision-making, and operational insights.

    Dataset Content

    Files Included

    1. DataCoSupplyChainDatasetRefined.csv

      • The main dataset containing cleaned fields, geospatial coordinates, and English translations.
    2. src_points.geojson

      • GeoJSON file containing the source points for easy visualization and analysis.
    3. dest_points.geojson

      • GeoJSON file containing the destination points.
    4. routes.geojson

      • GeoJSON file with LineStrings representing routes between source and destination points.

    Attribution

    This dataset is based on the original dataset published by Fabian Constante, Fernando Silva, and António Pereira:
    Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS”, Mendeley Data, V5, doi: 10.17632/8gx2fvg2k6.5.

    Refinements include geospatial processing, translation, and additional cleaning by the uploader to enhance usability and analytical potential.

    Tips for Using the Dataset

    • For geospatial analysis, leverage tools like GeoPandas, QGIS, or Folium to visualize routes and points.
    • Use the GeoJSON files for interactive mapping and spatial queries.
    • Combine this dataset with external datasets (e.g., road networks) for enriched analytics.

    This dataset is designed to empower data scientists, researchers, and business professionals to explore the intersection of geospatial intelligence and supply chain optimization.

  6. f

    Data from: Additive Hazards Regression Analysis of Massive Interval-Censored...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peiyao Huang; Shuwei Li; Xinyuan Song (2025). Additive Hazards Regression Analysis of Massive Interval-Censored Data via Data Splitting [Dataset]. http://doi.org/10.6084/m9.figshare.27103243.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Peiyao Huang; Shuwei Li; Xinyuan Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of data acquisition and storage space, massive datasets exhibited with large sample size emerge increasingly and make more advanced statistical tools urgently need. To accommodate such big volume in the analysis, a variety of methods have been proposed in the circumstances of complete or right censored survival data. However, existing development of big data methodology has not attended to interval-censored outcomes, which are ubiquitous in cross-sectional or periodical follow-up studies. In this work, we propose an easily implemented divide-and-combine approach for analyzing massive interval-censored survival data under the additive hazards model. We establish the asymptotic properties of the proposed estimator, including the consistency and asymptotic normality. In addition, the divide-and-combine estimator is shown to be asymptotically equivalent to the full-data-based estimator obtained from analyzing all data together. Simulation studies suggest that, relative to the full-data-based approach, the proposed divide-and-combine approach has desirable advantage in terms of computation time, making it more applicable to large-scale data analysis. An application to a set of interval-censored data also demonstrates the practical utility of the proposed method.

  7. d

    Data from: A large dataset of detection and submeter-accurate 3-D...

    • datadryad.org
    zip
    Updated Jul 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng (2021). A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon [Dataset]. http://doi.org/10.5061/dryad.tdz08kpzd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 14, 2021
    Dataset provided by
    Dryad
    Authors
    Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng
    Time period covered
    Jun 29, 2021
    Description

    Use of JSATS can generate a large volume of data. To manage and visualize the data, an integrated suite of science-based tools known as the Hydropower Biological Evaluation Toolset (HBET) can be used.

  8. h

    FoundationTactile

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Zhao (2025). FoundationTactile [Dataset]. https://huggingface.co/datasets/alanz-mit/FoundationTactile
    Explore at:
    Dataset updated
    Feb 19, 2025
    Authors
    Alan Zhao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Foundation Tactile (FoTa) - a multi-sensor multi-task large dataset for tactile sensing

    This repository stores the FoTa dataset and the pretrained checkpoints of Transferable Tactile Transformers (T3).

    Paper
    
    
    Code
    
    
    
    
      Colab
    

    [Project Website] Jialiang (Alan) Zhao, Yuxiang Ma, Lirui Wang, and Edward H. Adelson MIT CSAIL

      Overview
    

    FoTa was released with Transferable Tactile Transformers (T3) as a large dataset for tactile… See the full description on the dataset page: https://huggingface.co/datasets/alanz-mit/FoundationTactile.

  9. d

    Major US Open Data Domains

    • catalog.data.gov
    • data.kingcounty.gov
    • +1more
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.kingcounty.gov (2024). Major US Open Data Domains [Dataset]. https://catalog.data.gov/dataset/major-us-open-data-domains
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    data.kingcounty.gov
    Area covered
    United States
    Description

    An incomplete collection of open data domains throughout the U.S. (intended for comparison with King County open data)

  10. Z

    A dataset to investigate ChatGPT for enhancing Students' Learning Experience...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schicchi, Daniele; Taibi, Davide (2024). A dataset to investigate ChatGPT for enhancing Students' Learning Experience via Concept Maps [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12076680
    Explore at:
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Institute for Educational Technology, National Research Council of Italy
    Authors
    Schicchi, Daniele; Taibi, Davide
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was compiled to examine the use of ChatGPT 3.5 in educational settings, particularly for creating and personalizing concept maps. The data has been organized into three folders: Maps, Texts, and Questionnaires. The Maps folder contains the graphical representation of the concept maps and the PlanUML code for drawing them in Italian and English. The Texts folder contains the source text used as input for the map's creation The Questionnaires folder includes the students' responses to the three administered questionnaires.

  11. E

    DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

    • live.european-language-grid.eu
    binary format
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22959
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 15, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.

  12. RBD24 - Risk Activities Dataset 2024

    • zenodo.org
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

    This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

    Summary of the Datasets

    The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

    DatasetIdEntity Observed BehaviourGroundtruthSample Shape
    Crypto_desktop.parquetDEMiner CheckingIDS0: 738/161202, 1: 11/1343
    Crypto_smarphone.parquetSMMiner CheckingIDS0: 613/180021, 1: 4/956
    OutFlash_desktop.parquetDEOutdated software components IDS0: 738/161202, 1: 56/10820
    OutFlash_smartphone.parquetSMOutdated software components IDS0: 613/180021, 1: 22/6639
    OutTLS_desktop.parquetDEOutdated TLS protocolIDS0: 738/161202, 1: 18/2458
    OutTLS_smartphone.parquetSMOutdated TLS protocolIDS0: 613/180021, 1: 11/2930
    P2P_desktop.parquetDEP2P ActivityIDS0: 738/161202, 1: 177/35892
    P2P_smartphone.parquetSMP2P ActivityIDS0: 613/180021, 1: 94/21688
    NonEnc_desktop.parquetDENon-encrypted passwordIDS0: 738/161202, 1: 291/59943
    NonEnc_smaprthone.parquetSMNon-encrypted passwordIDS0: 613/180021, 1: 167/41434
    Phishing_desktop.parquetDEPhishing email

    Experimental Campaign

    0: 98/13864, 1: 19/3072
    Phishing_smartphone.parquetSMPhishing emailExperimental Campaign0: 117/34006, 1: 26/8968

    Methodology

    To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
    more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
    ground truth are as follows:

    - Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
    - IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

    For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
    user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
    and unsupervised methods.

    Sample Representation

    The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
    timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
    construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
    interpretable features designed to describe device-level properties within the specified time frame. The most
    influential features are described below.

    • User:** A unique hash value that identifies a user.
    • Timestamp:** The timestamp of the windows.
    • Features
    • Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

    Dataset Format

    Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

    ```python
    import pandas as pd

    # Reading a Parquet file
    df = pd.read_parquet(
    'path_to_your_file.parquet',
    engine='fastparquet'
    )

    ```

  13. u

    Data from: USHAP: Big Data Seamless 1 km Ground-level PM2.5 Dataset for the...

    • iro.uiowa.edu
    • data.niaid.nih.gov
    Updated May 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Wei; Jun Wang; Zhanqing Li (2023). USHAP: Big Data Seamless 1 km Ground-level PM2.5 Dataset for the United States [Dataset]. https://iro.uiowa.edu/esploro/outputs/dataset/USHAP-Big-Data-Seamless-1-km/9984702835302771
    Explore at:
    Dataset updated
    May 1, 2023
    Dataset provided by
    Zenodo
    Authors
    Jing Wei; Jun Wang; Zhanqing Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 1, 2023
    Area covered
    United States
    Description

    USHAP (USHighAirPollutants) is one of the series of long-term, full-coverage, high-resolution, and high-quality datasets of ground-level air pollutants for the United States. It is generated from the big data (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence by considering the spatiotemporal heterogeneity of air pollution. This is the big data-derived seamless (spatial coverage = 100%) daily, monthly, and yearly 1 km (i.e., D1K, M1K, and Y1K) ground-level PM2.5 dataset in the United States from 2000 to 2020. Our daily PM2.5 estimates agree well with ground measurements with an average cross-validation coefficient of determination (CV-R2) of 0.82 and normalized root-mean-square error (NRMSE) of 0.40, respectively. All the data will be made public online once our paper is accepted, and if you want to use the USHighPM2.5 dataset for related scientific research, please contact us (Email: weijing_rs@163.com; weijing@umd.edu). Wei, J., Wang, J., Li, Z., Kondragunta, S., Anenberg, S., Wang, Y., Zhang, H., Diner, D., Hand, J., Lyapustin, A., Kahn, R., Colarco, P., da Silva, A., and Ichoku, C. Long-term mortality burden trends attributed to black carbon and PM2.5 from wildfire emissions across the continental USA from 2000 to 2020: a deep learning modelling study. The Lancet Planetary Health, 2023, 7, e963–e975. https://doi.org/10.1016/S2542-5196(23)00235-8 More air quality datasets of different air pollutants can be found at: https://weijing-rs.github.io/product.html

  14. d

    Innovating the Data Ecosystem: An Update of the Federal Big Data Research...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCO NITRD (2025). Innovating the Data Ecosystem: An Update of the Federal Big Data Research and Development Strategic Plan [Dataset]. https://catalog.data.gov/dataset/innovating-the-data-ecosystem-an-update-of-the-federal-big-data-research-and-development-s
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    NCO NITRD
    Description

    This document, Innovating the Data Ecosystem: An Update of The Federal Big Data Research and Development Strategic Plan, updates the 2016 Federal Big Data Research and Development Strategic Plan. This plan updates the vision and strategies on the research and development needs for big data laid out in the 2016 Strategic Plan through the six strategies areas (enhance the reusability and integrity of data; enable innovative, user-driven data science; develop and enhance the robustness of the federated ecosystem; prioritize privacy, ethics, and security; develop necessary expertise and diverse talent; and enhance U.S. leadership in the international context) to enhance data value and reusability and responsiveness to federal policies on data sharing and management.

  15. N

    Gratis, OH Population Breakdown by Gender Dataset: Male and Female...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Gratis, OH Population Breakdown by Gender Dataset: Male and Female Population Distribution // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/b235d8fd-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Gratis
    Variables measured
    Male Population, Female Population, Male Population as Percent of Total Population, Female Population as Percent of Total Population
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Gratis by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Gratis across both sexes and to determine which sex constitutes the majority.

    Key observations

    There is a slight majority of female population, with 50.0% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.

    Variables / Data Columns

    • Gender: This column displays the Gender (Male / Female)
    • Population: The population of the gender in the Gratis is shown in this column.
    • % of Total Population: This column displays the percentage distribution of each gender as a proportion of Gratis total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Gratis Population by Race & Ethnicity. You can refer the same here

  16. Details of dataset information.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Details of dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

  17. Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

    • springernature.figshare.com
    application/csv
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

    We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

    We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

    • Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

    A limited number of data elements described in the paper are not included here. The following elements are excluded:

    1. The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

    2. The free-text comments written by raters during the ratings process.

    3. Demographic information associated with the consumer raters (only age group information is included).

    References

    1. Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

    2. Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

    3. Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

    4. Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

    5. Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

    Description of files and sheets

    1. Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

    2. Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

    3. Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

    4. Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

    5. Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

    6. Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

    7. Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

    8. Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

    9. TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

    10. Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

    11. Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

    12. HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

    13. Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

    14. Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

    Version history

    Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

    WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

    NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

  18. R

    3 Big Data Dataset

    • universe.roboflow.com
    zip
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BIG DATA (2025). 3 Big Data Dataset [Dataset]. https://universe.roboflow.com/big-data-db8ne/3-big-data-myxfg/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2025
    Dataset authored and provided by
    BIG DATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cats Bounding Boxes
    Description

    3 BIG DATA

    ## Overview
    
    3 BIG DATA is a dataset for object detection tasks - it contains Cats annotations for 943 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  19. a

    CIFAR-100

    • datasets.activeloop.ai
    • universe.roboflow.com
    • +5more
    deeplake
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Krizhevsky (2022). CIFAR-100 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/cifar-100-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Feb 3, 2022
    Authors
    Alex Krizhevsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 8, 2009
    Dataset funded by
    University of Toronto
    Description

    The CIFAR-100 dataset is a large dataset of labeled images. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 100,000 32x32 images. These images are split into 100 mutually exclusive classes, with 1,000 images per class. The classes are animals, vehicles, and other objects.

  20. c

    Dataset - CQU_TMR

    • acquire.cqu.edu.au
    • researchdata.edu.au
    bin
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pubudu Sanjeewani Thihagoda Gamage (2024). Dataset - CQU_TMR [Dataset]. http://doi.org/10.25946/21441051.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 26, 2024
    Dataset provided by
    CQUniversity
    Authors
    Pubudu Sanjeewani Thihagoda Gamage
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Road safety systems are essential for planning, managing, and improving road infrastructure and decreasing road accidents. Manual systems used for road safety assessments are inefficient, time consuming, and prone to error. Some automated systems using sensors, cameras, lidar, and radar to detect nearby obstacles such as vehicles, pedestrians, lane lines, some traffic signs and parking slots have been introduced to reduce road fatalities by minimizing human error. However, the existing road safety systems available in industry are unable to accurately detect all road safety attributes required by the Australian Road Assessment Program (AusRAP), a program launched to establish a safer road system through high-risk roads inspection, developing star ratings and safer roads investment plans to mitigate the possibility of meeting with accidents. Therefore, it is important to explore novel techniques and develop better automated systems which can accurately detect and classify all road safety attributes.

    This research focuses on the development of a novel deep learning technique for the analysis of road safety attributes. Various architectures, learning and optimisation techniques have been investigated to develop an appropriate deep learning-based technique that can detect road safety attributes with high accuracy. Firstly, a single-stage segmentation and classification technique to automatically identify AusRAP attributes has been investigated. Secondly, multi-stage segmentation and classification techniques using various classifiers have been investigated. Finally, Genetic Algorithm (GA) and Particle Swarm Optimisation (PSO)-based techniques have been investigated to optimise the proposed deep learning techniques.

    The proposed techniques were evaluated on a real-world dataset using roadside videos provided by the Department of Transport and Main Roads (DTMR), Queensland, Australia, and Australian Road Research Board (ARRB). The classification accuracy was used as a metric to measure the performance, and to further validate the efficacy, different diversity measures such as specificity, sensitivity, and f1-score were used. An appropriate analysis and a comparison with existing techniques were conducted and presented. The results and analysis show that the proposed single-stage and multi-stage deep learning-based techniques achieve classification accuracy and misclassifications better than the existing state-of-the-art segmentation and classification techniques. It was found through experimentation that proposed single stage technique can avoid re-training the whole model using all training samples which requires a lot of time when a new attribute is introduced. Moreover, through extensive experimentation, it was found that it is not always necessarily required to have a large dataset for training. Effective solutions were found to eliminate the requirement to annotate large number of samples for each attribute to produce acceptable accuracy for industry. Both single-stage and multi-stage deep learning-based techniques were also validated using real world test data without cropping and pixel wise prediction was obtained for each object. The accurate location of the predicted object was known in predictions and hence, bounding box problem was avoided. Through the incorporation of optimisation techniques, optimum parameters suitable for road safety attributes were determined. The optimum parameters proved to be effective in terms of classification accuracy and time to achieve minimum error.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
AYUSH SINGH331 (2024). large-data [Dataset]. https://www.kaggle.com/datasets/ayushsingh331/large-data/versions/1
Organization logo

large-data

Explore at:
zip(1203746376 bytes)Available download formats
Dataset updated
Aug 13, 2024
Authors
AYUSH SINGH331
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by AYUSH SINGH331

Released under MIT

Contents

Search
Clear search
Close search
Google apps
Main menu