This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_kd.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
This file is an example data set from the Central Valley of California from a drought study corresponding to “recent non-drought conditions” (Scenario 1 in Petrie et al., in review). In 2014, following an 8-year period with 7 below-normal to critically-dry water years, the bioenergetic model TRUEMET was used to assess the impacts of drought on wintering waterfowl habitat and bioenergetics in the Central Valley of California. The goal of the study was to assess whether available foraging habitats could provide enough food to support waterfowl populations (ducks and geese) under a variety of climate and population level scenarios. This information could then be used by managers to adapt their waterfowl habitat management plans to drought conditions. The study area spanned the Central Valley and included the Sacramento Valley in the north, the San Joaquin Valley in the south, and Suisun Marsh and Sacramento-San Joaquin River Delta (Delta) east of San Francisco Bay. The data set consists of two foraging guilds (ducks and geese/swans) and five forage types: harvested corn, rice (flooded), rice (unflooded), wetland invertebrates and wetland moist soil seeds. For more background on the data set, see Petrie et al. in review.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.
This data-set has been used to build a metadata scoring and quality prediction model for OERs.
Sample data for exercises in Further Adventures in Data Cleaning.
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for example-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/CoffeeDoodle/example-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/CoffeeDoodle/example-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.
Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository contains sample datasets of raw DInSAR time series (NSBAS_PARAMS.h5), raw, interpolated GNSS time series maps (GPS_East/North/Up.h5) , errors associated with the GNSS data (GPS_East/North/Up_sigma.h5), and integrated DInSAR + GNSS time series (fused.h5). Details about the data can be read about in the following publication: [Corsa, B. "Integration of DInSAR Time Series and GNSS data for Continuous Volcanic Deformation Monitoring and Eruption Early Warning Applications" Remote Sens. 2022, 14(3), 784; https://doi.org/10.3390/rs14030784]. The raw DInSAR time series spans 245 dates between 2015-11-11 to 2021-04-13 over the Big Island of Hawaii. The current raw GPS data and fused time series used 22 data points between those same dates.
Dataset Card for cot-example-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/cot-example-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/cot-example-dataset.
This data set consists of several files that were created to accompany M.o.R., a shiny app created by the Surface & Nanostructure Metrology Group in the Engineering Physics Division of the Physical Measurement Laboratory (PML) at the National Institute of Standards and Technology. It was created to simplify model-based metrology. A detailed explanation of the proper usage can be found in the M.o.R. documentation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quite outdated data but used in openPMD-api unit tests.
HDF5 data contains particle patches, ADIOS1 data does not. Uploading it here for reference, as a download point and for test reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example Proteomics Dataset for C-COMPASS
This dataset is an example of proteomics data generated from gradient centrifugation. It is provided exclusively to the reviewers of the manuscript titled "C-COMPASS: A User-Friendly Neural Network Tool for Profiling Cell Compartments at Protein and Lipid Levels."
Until the manuscript is accepted, this dataset remains unpublished and is intended solely for testing the functionality of C-COMPASS. Any further use or distribution of the data is strictly prohibited.
C-COMPASS_session.npy: Ready-to-use session that can be opened in C-COMPASS. All processing steps were already performed hers.
C-COMPASS_ProteomeInput.tsv: Complete dataset including Fractionation and TotalProteome data. This can be used to reproduce the analysis. The containing data is equal to the dataset that was used for the research article.
C-COMPASS_MarkerList.txt: This list contains a compatible marker list that can be used to analyze the data from C-COMPASS_ProteomeInput.tsv. It is the same marker list that was used for the research article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
Dataset Card for example-preference-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/aspirina765/example-preference-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/aspirina765/example-preference-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
example data package (local version)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information on samples submitted for RNAseq
Rows are individual samples
Columns are: ID Sample Name Date sampled Species Sex Tissue Geographic location Date extracted Extracted by Nanodrop Conc. (ng/µl) 260/280 260/230 RIN Plate ID Position Index name Index Seq Qubit BR kit Conc. (ng/ul) BioAnalyzer Conc. (ng/ul) BioAnalyzer bp (region 200-1200) Submission reference Date submitted Conc. (nM) Volume provided PE/SE Number of reads Read length
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example Proteomics Dataset for C-COMPASS
This dataset is a simulated example of proteomics dataset containing 8 compartments across 16 fractions over 3 replicates for 2 conditions, including relocalizations. The data is simulated to be highly comparable to a real spatial proteomics dataset based on gradient fractionation, and contains controlled random variations and noises that mimic real experimental variation.
The purpose of this dataset is, to test or validate C-COMPASS, or to have a prepared session that helps handling the GUI of C-COMPASS.
sim_08_16_FractionationData.txt:
Simulated fractionation dataset for the above mentioned dataset. It can be imported to C-COMPASS for spatial proteomics analysis.
sim_08_16_MarkerList.txt:
Simulated marker list, fitting to the above mentioned fractionation dataset. It can be imported to C-COMPASS to assign markers for the 8 simulated compartments, and is necessary for spatial predictions.
sim_08_16_TotalProteomeData.txt:
Simulated total proteome dataset, fitting to the above mentioned fractionation dataset. This can be imported to C-COMPASS to calculate class-centric changes across conditions.
sim_08_16_CCMPSsession_start.npy:
Prepared C-COMPASS session that can be loaded. This session provides a starting point for a C-COMPASS analysis of the simulated data where the gradient samples are pre-defined, but all the following steps are not yet done.
sim_08_16_CCMPSsession_processed.npy:
Prepared C-COMPASS session that can be loaded. This session provides a fully analyzed state of the simulated data. Results can be exported and plots can be generated. Each processing step can be re-done via the corresponding 'reset' button.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data sample (in support the article "Indicators on firm level innovation activities from web scraped data" https://ssrn.com/abstract=3938767) contains data on companies' innovative behavior measured at the firm-level based on web scraped firm-level data derived from medium-high and high-technology companies in the European Union and the United Kingdom. The data are retrieved from individual company websites and contains in total data on 96,921 companies. The data provide information on various aspects of innovation, most significantly the research and development orientation of the company at the company and product level, the company’s collaborative activities, company’s products, and use of standards. In addition to the web scraped data, the dataset aggregates a variety firm-level indicators including patenting activities. In total, the dataset includes 28 variables with unique identifiers which enables connecting to other databases such as financial data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the example data for use the SOPiNet at https://github.com/RegiusQuant/ESIDLM.
The details of SOPiNet can be foud at https://doi.org/10.1016/j.envpol.2023.121509
This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_kd.