100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Sample Data 1
kaggle.com
zip
Updated Jul 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aizat Mokhtar (2021). Sample Data 1 [Dataset]. https://www.kaggle.com/datasets/aizatmokhtar/sample-data-1/data
Explore at:
zip(309 bytes)Available download formats
Dataset updated
Jul 7, 2021
Authors
Aizat Mokhtar
Description
Dataset

This dataset was created by Aizat Mokhtar

Contents
h
RedPajama-Data-1T-Sample
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T-Sample [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
Genomics examples
redivis.com
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2025). Genomics examples [Dataset]. https://redivis.com/datasets/yz1s-d09009dbb
Explore at:
Dataset updated
Feb 8, 2025
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 30, 2025
Description
This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_id.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
d
Data from: Database for the U.S. Geological Survey Woods Hole Science...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Database for the U.S. Geological Survey Woods Hole Science Center's marine sediment samples, including locations, sample data and collection information (SED_ARCHIVE) [Dataset]. https://catalog.data.gov/dataset/database-for-the-u-s-geological-survey-woods-hole-science-centers-marine-sediment-samples-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Woods Hole
Description
The U.S. Geological Survey (USGS), Woods Hole Science Center (WHSC) has been an active member of the Woods Hole research community for over 40 years. In that time there have been many sediment collection projects conducted by USGS scientists and technicians for the research and study of seabed environments and processes. These samples are collected at sea or near shore and then brought back to the WHSC for study. While at the Center, samples are stored in ambient temperature, cold or freezing conditions, depending on the best mode of preparation for the study being conducted or the duration of storage planned for the samples. Recently, storage methods and available storage space have become a major concern at the WHSC. The shapefile sed_archive.shp, gives a geographical view of the samples in the WHSC's collections, and where they were collected along with images and hyperlinks to useful resources.
Data sample
kaggle.com
zip
Updated Sep 23, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HungDo (2017). Data sample [Dataset]. https://www.kaggle.com/datasets/hungdo1291/hung-data-full/suggestions
Explore at:
zip(2642308 bytes)Available download formats
Dataset updated
Sep 23, 2017
Authors
HungDo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by HungDo

Released under CC0: Public Domain

Contents
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
f
Data_Sheet_1_Raw Data Visualization for Common Factorial Designs Using SPSS:...
frontiersin.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Loffing (2023). Data_Sheet_1_Raw Data Visualization for Common Factorial Designs Using SPSS: A Syntax Collection and Tutorial.ZIP [Dataset]. http://doi.org/10.3389/fpsyg.2022.808469.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.808469.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Florian Loffing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
d
Data from: Sample DataSet.
datadiscoverystudio.org
csv, json, rdf, xml
Updated Feb 4, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Sample DataSet. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/34b1d00ff032454bb8f6be1ff2acaa5c/html
Explore at:
xml, rdf, csv, jsonAvailable download formats
Dataset updated
Feb 4, 2018
Description
description: For testing purposes only; abstract: For testing purposes only
Historic HPMS Data (Sample) - 1992
data.virginia.gov
data.transportation.gov
+1more
csv, json, rdf, xsl
Updated May 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S Department of Transportation (2024). Historic HPMS Data (Sample) - 1992 [Dataset]. https://data.virginia.gov/dataset/historic-hpms-data-sample-1992
Explore at:
xsl, rdf, json, csvAvailable download formats
Dataset updated
May 8, 2024
Dataset provided by
Federal Highway Administrationhttps://highways.dot.gov/
Authors
U.S Department of Transportation
Description
Historic Highway Performance Monitoring System sample data for the year 1992
o
SQLite SpatiaLite sample databases for testing spatial analysis function and...
osf.io
Updated Feb 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Ilba (2020). SQLite SpatiaLite sample databases for testing spatial analysis function and performance of database [Dataset]. http://doi.org/10.17605/OSF.IO/2YM95
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/2YM95
Dataset updated
Feb 24, 2020
Dataset provided by
Center For Open Science
Authors
Mateusz Ilba
Description
Databases (for SQLite SpatiaLite) were created from publicly available OpenStreetMap data for Poland (https://www.openstreetmap.org/copyright). The db_small database comprises data for the area of the city of Kraków in the Małopolskie Province. The db_medium database comprises data from the entire Małopolskie Province. The db_large database, in addition to the Małopolskie Province, covers the Podkarpackie and Dolnośląskie Provinces. The db_v_large database covers the entire country.
f
Data from: Evaluating Supplemental Samples in Longitudinal Research:...
tandf.figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162072.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Laura K. Taylor; Xin Tong; Scott E. Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
Living Standards Survey III 1991-1992 - World Bank SHIP Harmonized Dataset -...
datacatalog.ihsn.org
dev.ihsn.org
+2more
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghana Statistical Service (GSS) (2019). Living Standards Survey III 1991-1992 - World Bank SHIP Harmonized Dataset - Ghana [Dataset]. https://datacatalog.ihsn.org/catalog/2358
Explore at:
Dataset updated
Mar 29, 2019
Dataset provided by
Ghana Statistical Services
Authors
Ghana Statistical Service (GSS)
Time period covered
1991 - 1992
Area covered
Ghana
Description
Abstract

Survey based Harmonized Indicators (SHIP) files are harmonized data files from household surveys that are conducted by countries in Africa. To ensure the quality and transparency of the data, it is critical to document the procedures of compiling consumption aggregation and other indicators so that the results can be duplicated with ease. This process enables consistency and continuity that make temporal and cross-country comparisons consistent and more reliable.

Four harmonized data files are prepared for each survey to generate a set of harmonized variables that have the same variable names. Invariably, in each survey, questions are asked in a slightly different way, which poses challenges on consistent definition of harmonized variables. The harmonized household survey data present the best available variables with harmonized definitions, but not identical variables. The four harmonized data files are

a) Individual level file (Labor force indicators in a separate file): This file has information on basic characteristics of individuals such as age and sex, literacy, education, health, anthropometry and child survival. b) Labor force file: This file has information on labor force including employment/unemployment, earnings, sectors of employment, etc. c) Household level file: This file has information on household expenditure, household head characteristics (age and sex, level of education, employment), housing amenities, assets, and access to infrastructure and services. d) Household Expenditure file: This file has consumption/expenditure aggregates by consumption groups according to Purpose (COICOP) of Household Consumption of the UN.

Geographic coverage

National

Analysis unit

Individual level for datasets with suffix _I and _L

Household level for datasets with suffix _H and _E

Universe

The survey covered all de jure household members (usual residents).

Kind of data

Sample survey data [ssd]

Sampling procedure

A multi-stage sampling technique was used in selecting the GLSS sample. Initially, 4565 households were selected for GLSS3, spread around the country in 407 small clusters; in general, 15 households were taken in an urban cluster and 10 households in a rural cluster. The actual achieved sample was 4552 households. Because of the sample design used, and the very high response rate achieved, the sample can be considered as being selfweighting, though in the case of expenditure data weighting of the expenditure values is required.

Mode of data collection

Face-to-face [f2f]
TELL sample output data
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Casey D Burleyson; Casey D Burleyson; Casey McGrath; Casey McGrath (2022). TELL sample output data [Dataset]. http://doi.org/10.5281/zenodo.6338472
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6338472
Dataset updated
Mar 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Casey D Burleyson; Casey D Burleyson; Casey McGrath; Casey McGrath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains sample output data for TELL. The sample dataset includes four years of sample future data (2039, 2059, 2079, and 2099) that comes from IM3's future WRF runs under the RCP 8.5 climate scenario with SSP5 population forcing. Note that the GCAM-USA output used in this simulation is sample data only. As such the quantitative results from this set of sample output should not be considered valid.
7+ Million Company Dataset
kaggle.com
zip
Updated May 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
People Data Labs (2019). 7+ Million Company Dataset [Dataset]. https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset
Explore at:
zip(291957415 bytes)Available download formats
Dataset updated
May 10, 2019
Authors
People Data Labs
Description
Dataset

This dataset was created by People Data Labs

Contents
f
Dataset for: Analyzing discrete competing risks data with partially...
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minjung Lee; Eric J Feuer; Zhuoqiao Wang; Hyunsoon Cho; Joe Zou; Benjamin Hankey; Angela B. Mariotto; Jason P Fine (2023). Dataset for: Analyzing discrete competing risks data with partially overlapping or independent data sources and non-standard sampling schemes, with application to cancer registries [Dataset]. http://doi.org/10.6084/m9.figshare.9795572.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9795572.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Minjung Lee; Eric J Feuer; Zhuoqiao Wang; Hyunsoon Cho; Joe Zou; Benjamin Hankey; Angela B. Mariotto; Jason P Fine
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This paper demonstrates the flexibility of a general approach for the analysis of discrete time competing risks data that can accommodate complex data structures, different time scales for different causes, and nonstandard sampling schemes. The data may involve a single data source where all individuals contribute to analyses of both cause-specific hazard functions, overlapping datasets where some individuals contribute to the analysis of the cause-specific hazard function of only one cause while other individuals contribute to analyses of both cause-specific hazard functions, or separate data sources where each individual contributes to the analysis of the cause-specific hazard function of only a single cause. The approach is modularized into estimation and prediction. For the estimation step, the parameters and the variance-covariance matrix can be estimated using widely available software. The prediction step utilizes a generic program with plug-in estimates from the estimation step. The approach is illustrated with three prognostic models for stage IV male oral cancer using different data structures. The first model uses only men with stage IV oral cancer from population-based registry data. The second model strategically extends the cohort to improve the efficiency of the estimates. The third model improves the accuracy for those with a lower risk of other causes of death, by bringing in an independent data source collected under a complex sampling design with additional other-cause covariates. These analyses represent novel extensions of existing methodology, broadly applicable for the development of prognostic models capturing both the cancer and non-cancer aspects of a patient's health.
s
Netflow data with sampling for training - Datasets - open.scayle.es
open.scayle.es
Updated Oct 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Netflow data with sampling for training - Datasets - open.scayle.es [Dataset]. https://open.scayle.es/dataset/netflow-data-with-sampling-for-training
Explore at:
Dataset updated
Oct 1, 2020
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Netflow traffic generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic) NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device. Netflow flows have been captured by sampling at the packet level. A sampling means that 1 out of every X packets is selected to be flow while the rest of the packets are not valued. In the construction of the datasets, different percentages of flows considered attacks and flows considered normal traffic have been used. These datasets have been used to train machine learning models.
d
Water Quality Sampling Data
catalog.data.gov
data.austintexas.gov
+1more
Updated Mar 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.austintexas.gov (2025). Water Quality Sampling Data [Dataset]. https://catalog.data.gov/dataset/water-quality-sampling-data
Explore at:
Dataset updated
Mar 25, 2025
Dataset provided by
data.austintexas.gov
Description
Data collected to assess water quality conditions in the natural creeks, aquifers and lakes in the Austin area. This is raw data, provided directly from our Water Resources Monitoring database (WRM) and should be considered provisional. Data may or may not have been reviewed by project staff. A map of site locations can be found by searching for LOCATION.WRM_SAMPLE_SITES; you may then use those WRM_SITE_IDs to filter in this dataset using the field SAMPLE_SITE_NO.
d
HuTime Sample Dataset
data.depositar.io
gtm, png, xml
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ANGIS Data Portal (2021). HuTime Sample Dataset [Dataset]. https://data.depositar.io/dataset/hutime-sample-dataset
Explore at:
xml(404471), xml(404620), png(17258), png(14736), png(16654), gtm(922), xml(4423), xml(10971), png(18462), gtm(3842), png(18461), xml(1580), gtm(910), xml(12018), xml(33534), gtm(888), gtm, gtm(909), gtm(919), png(16959), xml(1752), gtm(3808), png(17058), png(18189), png(25698), xml(22664), gtm(889), png(25284), xml(36523), gtm(4358), gtm(927), png(15002), xml(3759), gtm(914), gtm(4344), png(18968), xml(11367), xml(11783), png(18636)Available download formats
Dataset updated
Nov 30, 2021
Dataset provided by
ANGIS Data Portal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HuTime is the Time Information System which was developed by Dr.Tatsuki Sekino. The website of Hutime is http://www.hutime.org.