100+ datasets found

Z
Sample Dataset - HR Subject Areas
data.niaid.nih.gov
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
Explore at:
Dataset updated
Jan 18, 2023
Authors
Weber, Marc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

Lucerne University of Applied Sciences and Arts

Master of Science in Applied Information and Data Science (MScIDS)

Autumn Semester 2022

Change log Version 1.1:

The following SQL scripts were added:

Index Type Name 1 View pg.dictionary_table 2 View pg.dictionary_column 3 View pg.dictionary_relation 4 View pg.accesslayer_table 5 View pg.accesslayer_column 6 View pg.accesslayer_relation 7 View pg.accesslayer_fact_candidate 8 Stored Procedure pg.get_fact_candidate 9 Stored Procedure pg.get_dimension_candidate 10 Stored Procedure pg.get_columns

Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.
h
sample-construction-dataset
huggingface.co
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Yankov (2024). sample-construction-dataset [Dataset]. https://huggingface.co/datasets/lutherwaves/sample-construction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 9, 2024
Authors
Martin Yankov
Description
lutherwaves/sample-construction-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Tunisia, Northern Mariana Islands, Bangladesh, Isle of Man, Canada, Nepal, British Indian Ocean Territory, Moldova (Republic of), Andorra, Taiwan
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Test data ver21
kaggle.com
zip
Updated Sep 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
g-dragon (2022). Test data ver21 [Dataset]. https://www.kaggle.com/datasets/ngotrieulong/test-data-ver21
Explore at:
zip(802 bytes)Available download formats
Dataset updated
Sep 15, 2022
Authors
g-dragon
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by g-dragon

Released under CC0: Public Domain

Contents
m
Example Stata syntax and data construction for negative binomial time series...
data.mendeley.com
Updated Nov 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Price (2022). Example Stata syntax and data construction for negative binomial time series regression [Dataset]. http://doi.org/10.17632/3mj526hgzx.2
Explore at:
Unique identifier
https://doi.org/10.17632/3mj526hgzx.2
Dataset updated
Nov 2, 2022
Authors
Sarah Price
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We include Stata syntax (dummy_dataset_create.do) that creates a panel dataset for negative binomial time series regression analyses, as described in our paper "Examining methodology to identify patterns of consulting in primary care for different groups of patients before a diagnosis of cancer: an exemplar applied to oesophagogastric cancer". We also include a sample dataset for clarity (dummy_dataset.dta), and a sample of that data in a spreadsheet (Appendix 2).

The variables contained therein are defined as follows:

case: binary variable for case or control status (takes a value of 0 for controls and 1 for cases).

patid: a unique patient identifier.

time_period: A count variable denoting the time period. In this example, 0 denotes 10 months before diagnosis with cancer, and 9 denotes the month of diagnosis with cancer,

ncons: number of consultations per month.

period0 to period9: 10 unique inflection point variables (one for each month before diagnosis). These are used to test which aggregation period includes the inflection point.

burden: binary variable denoting membership of one of two multimorbidity burden groups.

We also include two Stata do-files for analysing the consultation rate, stratified by burden group, using the Maximum likelihood method (1_menbregpaper.do and 2_menbregpaper_bs.do).

Note: In this example, for demonstration purposes we create a dataset for 10 months leading up to diagnosis. In the paper, we analyse 24 months before diagnosis. Here, we study consultation rates over time, but the method could be used to study any countable event, such as number of prescriptions.
Getty-Images-Sample-Dataset
huggingface.co
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Getty Images (2024). Getty-Images-Sample-Dataset [Dataset]. https://huggingface.co/datasets/GettyImages/Getty-Images-Sample-Dataset
Explore at:
Dataset updated
Sep 6, 2024
Dataset authored and provided by
Getty Imageshttp://gettyimages.com/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Use Getty Images content to build or enhance your machine learning or artificial intelligence capabilities.

With nearly 30 years of visual expertise, Getty Images is the world’s foremost visual expert. Focused on identifying cultural shifts, spearheading trends and powering the creative economy Getty Images can provide you with the data you need to train your models. This sample Dataset includes 3,750 images from 15 categories including: Abstracts & Backgrounds, Built Environments… See the full description on the dataset page: https://huggingface.co/datasets/GettyImages/Getty-Images-Sample-Dataset.
VA Personal Health Record Sample Data
catalog.data.gov
datahub.va.gov
+4more
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). VA Personal Health Record Sample Data [Dataset]. https://catalog.data.gov/dataset/va-personal-health-record-sample-data
Explore at:
Dataset updated
Aug 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
My HealtheVet (www.myhealth.va.gov) is a Personal Health Record portal designed to improve the delivery of health care services to Veterans, to promote health and wellness, and to engage Veterans as more active participants in their health care. The My HealtheVet portal enables Veterans to create and maintain a web-based PHR that provides access to patient health education information and resources, a comprehensive personal health journal, and electronic services such as online VA prescription refill requests and Secure Messaging. Veterans can visit the My HealtheVet website and self-register to create an account, although registration is not required to view the professionally-sponsored health education resources, including topics of special interest to the Veteran population. Once registered, Veterans can create a customized PHR that is accessible from any computer with Internet access.
Straight Fork Road Bridge Construction Project
catalog.data.gov
s.cnmilf.com
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Park Service (2025). Straight Fork Road Bridge Construction Project [Dataset]. https://catalog.data.gov/dataset/straight-fork-road-bridge-construction-project
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
National Park Servicehttp://www.nps.gov/
Description
This data package was created 2024-10-17 17:52:53 by NPSTORET and includes selected project, location, and result data. Data are from monitoring conducted to assess the potential impacts of the Straight Fork Road Bridge Construction Project in Great Smoky Mountains National Park. The Straight Fork road ford was replaced with a bridge in early 2006. The purpose of this project was to determine if the water quality during bridge construction from March 2006 through August 2006 was significantly different than the water quality prior to construction (October 2004 through September 2005). Data contained in Great Smoky Mountains National Park NPSTORET back-end file (GRSM_NPSTORET_BE_20240510_20240510_1245.ACCDB) were filtered to include: Organization: - GRSM: Great Smoky Mountains National Park Project: - GRSM_SF: Straight Fork Road Bridge Construction Project Station: - Include Trip QC And All Station Visit Results Value Status: - Accepted or Certified (exported as Final) or Final The data package is organized into five data tables: - Projects.csv - describes the purpose and background of the monitoring efforts - Locations.csv - documents the attributes of the monitoring locations/stations - Results.csv - contains the field measurements, observations, and/or lab analyses for each sample/event/data grouping - HUC.csv - enumerates the domain of allowed values for 8-digit and 12-digit hydrologic unit codes utilized by the Locations datatable - Characteristics.csv - enumerates the domain of characteristics available in NPSTORET to identify what was sampled, measured or observed in Results Period of record for filtered data is 2004-10-27 to 2004-10-27. This data package is a snapshot in time of one National Park Service project. The most current data for this project, which may be more or less extensive than that in this data package, can be found on the Water Quality Portal at: https://www.waterqualitydata.us/data/Result/search?project=GRSM_SF
d
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
h
sample-create-dataset
huggingface.co
Updated Nov 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noel JP (2025). sample-create-dataset [Dataset]. https://huggingface.co/datasets/Noel-997/sample-create-dataset
Explore at:
Dataset updated
Nov 5, 2025
Authors
Noel JP
Description
Noel-997/sample-create-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
F
OER sample data-set
data.uni-hannover.de
csv
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L3S (2022). OER sample data-set [Dataset]. https://data.uni-hannover.de/dataset/oer-sample-data-set
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
L3S
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.

This data-set has been used to build a metadata scoring and quality prediction model for OERs.
d
80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...
datarade.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Guatemala, Russian Federation, Senegal, United Arab Emirates, Swaziland, Venezuela (Bolivarian Republic of), Kenya, Tunisia, Peru, Grenada
Description
This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
d
Data from: Sampling site information, well construction details, and data...
catalog.data.gov
data.usgs.gov
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Sampling site information, well construction details, and data dictionaries for data sets associated with the National Crude Oil Spill Fate and Natural Attenuation Site near Bemidji, Minnesota (ver. 4.0, September 2025) [Dataset]. https://catalog.data.gov/dataset/sampling-site-information-well-construction-details-and-data-dictionaries-for-data-sets-as
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Minnesota, Bemidji
Description
This U.S. Geological Survey data release provides detailed sampling site information, hole and well construction details, and data dictionaries necessary to interpret historical and future physical, chemical, and biological data sets derived from samples collected and measurements made in association with the National Crude Oil Spill Fate and Natural Attenuation Research Site. In 1979, a high-pressure pipeline carrying crude oil burst near the city of Bemidji, Minnesota and spilled approximately 1.7 million liters (10,700 barrels) of crude oil into glacial outwash deposits (Essaid and others, 2011). Since 1983, scientists with the U.S. Geological Survey, in collaboration with scientists from academic institutions, industry, and the regulatory community have conducted extensive investigations of multiphase flow and transport, volatilization, dissolution, geochemical interactions, microbial populations, and biodegradation with the goal of providing an improved understanding of the natural processes limiting the extent of hydrocarbon contamination. Long-term field studies at Bemidji have illustrated that the fate of hydrocarbons evolves with time, and a snap-shot study of a hydrocarbon plume may not provide information that is of relevance to the long-term behavior of the plume during natural attenuation. The research at the site has been supported primarily by the U.S. Geological Survey's Toxic Substances Hydrology Program.
S
Building sampling and labeling dataset of UAV images in rural China
scidb.cn
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yaohui; Yang Xinyue; Li Jiahe; Cheng Hao; Fan Xiwei; Zhang Haoyu; Li Xiaoli; Qi Wenhua; Li Zhiqiang; Nie Gaozhong; Xu Nan; Fu Bo; Yao Guobiao; Yu Mingyang; Cui Jian; Meng Fei; Jin Fengxiang (2022). Building sampling and labeling dataset of UAV images in rural China [Dataset]. http://doi.org/10.11922/sciencedb.j00001.00226
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00001.00226
Dataset updated
Jun 8, 2022
Dataset provided by
Science Data Bank
Authors
Liu Yaohui; Yang Xinyue; Li Jiahe; Cheng Hao; Fan Xiwei; Zhang Haoyu; Li Xiaoli; Qi Wenhua; Li Zhiqiang; Nie Gaozhong; Xu Nan; Fu Bo; Yao Guobiao; Yu Mingyang; Cui Jian; Meng Fei; Jin Fengxiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Rural buildings are one of the most important means to observe rural land changes and economic development. As an agricultural country like China, timely and accurate extraction of rural buildings from high-resolution remote sensing images is crucial to rural development and rural planning. With the recent advancements of computer vision and computing capabilities, deep learning has achieved considerable achievements in many applications such as building extraction due to its automatic learning features and strong applicability. Deep learning usually requires large amounts of training data. At present, the datasets commonly used in deep learning to identify buildings are mainly international open-source building datasets, including Massachusetts, INRIA, WHU, etc. These datasets are based on foreign buildings, lacking sampling data of buildings that are open-source, high-precision, wide-covered, and suitable for the architectural style of rural areas in China. Here, we propose an open-resource dataset named “Building sampling and labeling dataset of UAV images in rural China”. This dataset is based on the unmanned aerial images (UAV) collected in Weinan, Shaanxi, Huai’an, Jiangsu, Kangding, Sichuan, Shanwei, Guangdong, Huizhou, Guangdong, Atushi, Xinjiang, Songyuan, Jilin, and other rural areas in China from 2017 to 2020. This dataset has high spatial resolution and can represent the characteristics of buildings in rural China. It can be applied for building extraction using deep learning methods as well as be combined with further research for spatial analysis. Furthermore, it is of great significance for rural development and the Beautiful Countryside Construction in China.
d
Health and Retirement Study (HRS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
+1more
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunse, Mirko; Moreo, Alejandro; Sebastiani, Fabrizio; Senz, Martin (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
TU Dortmund University
Consiglio Nazionale delle Ricerche
Authors
Bunse, Mirko; Moreo, Alejandro; Sebastiani, Fabrizio; Senz, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Sample Power BI Data
kaggle.com
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmitRaghav007 (2022). Sample Power BI Data [Dataset]. https://www.kaggle.com/datasets/amitraghav007/us-store-data
Explore at:
zip(1031090 bytes)Available download formats
Dataset updated
Oct 2, 2022
Authors
AmitRaghav007
Description
Dataset

This dataset was created by AmitRaghav007

Contents

E commerce website data to make reports.

Facebook

Twitter

Click to copy link

Link copied

Cite

Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111

Sample Dataset - HR Subject Areas

Explore at:

Dataset updated

Jan 18, 2023

Authors

Weber, Marc

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

Lucerne University of Applied Sciences and Arts

Master of Science in Applied Information and Data Science (MScIDS)

Autumn Semester 2022

Change log Version 1.1:

The following SQL scripts were added:

    Index
    Type
    Name


    1
    View
    pg.dictionary_table


    2
    View
    pg.dictionary_column


    3
    View
    pg.dictionary_relation


    4
    View
    pg.accesslayer_table


    5
    View
    pg.accesslayer_column


    6
    View
    pg.accesslayer_relation


    7
    View
    pg.accesslayer_fact_candidate


    8
    Stored Procedure
    pg.get_fact_candidate


    9
    Stored Procedure
    pg.get_dimension_candidate


    10
    Stored Procedure
    pg.get_columns

Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.

Clear search

Close search

Google apps

Main menu

Sample Dataset - HR Subject Areas

sample-construction-dataset

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Simulation Data Set

Company Datasets for Business Profiling

Test data ver21

Dataset

Contents

Example Stata syntax and data construction for negative binomial time series...

Getty-Images-Sample-Dataset

VA Personal Health Record Sample Data

Straight Fork Road Bridge Construction Project

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

sample-create-dataset

OER sample data-set

80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

Current Population Survey (CPS)

Data from: Sampling site information, well construction details, and data...

Building sampling and labeling dataset of UAV images in rural China

Health and Retirement Study (HRS)

UCI and OpenML Data Sets for Ordinal Quantification

Sample Power BI Data

Dataset

Contents

Sample Dataset - HR Subject Areas