Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created using LeRobot.
Dataset Structure
meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.
This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.
It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath
- To study how social media affects stock prices
- To study how Reddit affects stock prices
- To study how the sentiment of a subreddit affects stock prices
If you use this dataset in your research, please credit the original authors. Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |
File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4. Code to create the plots in this paper presented as a R markdown file.
Facebook
TwitterKORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.
Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.
Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.
Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:
τ = CS / RH
where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:
τmax = CS+ / RH− τmin = CS− / RH+
where CS+ and CS− are the maximum and minimum soil C values, and RH+ and RH− are the maximum and minimum RH values, respectively.
To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.
All files are provided in NetCDF format. The SOC file includes the following variables:
· longitude, latitude
· soc: mean soil C stock (kg C m⁻²)
· soc_median: median soil C (kg C m⁻²)
· soc_n: number of estimates per grid cell
· soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
· soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
· soc_range: range of soil C values
· soc_sd: standard deviation of soil C (kg C m⁻²)
· soc_cv: coefficient of variation (%)
The RH file includes:
· longitude, latitude
· rh: mean RH (g C m⁻² yr⁻¹)
· rh_median, rh_n, rh_max, rh_min: as above
· rh_max_id, rh_min_id: study IDs for max/min
· rh_range, rh_sd, rh_cv: analogous variables for RH
The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q10 files contain estimates derived from the mean values of soil C and RH, along with associated uncertainty values.
The harmonized dataset files available in the repository are as follows:
· harmonized-RH-hdg.nc: global soil heterotrophic respiration map
· harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm
· harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm
· Q10.nc: global Q10 map
· Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum RH
· Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum RH
· Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and RH
· Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm
Version history
Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)
More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data
Reference
Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).
Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.
<td style="width:|
Dataset |
Repository/References (Dataset name) |
Depth |
ID in NetCDF file*** |
|
Global soil C |
Global soil data task 2000 (IGBP-DIS)1 |
0–100 |
3,- |
|
|
Shangguan et al. 2014 (GSDE)2,3 |
0–100, 0–30* |
1,1 |
|
|
Batjes 2016 (WISE30sec)4,5 |
0–100, 0–30 |
6,7 |
|
|
Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7 |
0–100, 0–30 |
5,5 |
|
|
Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9 |
0–30** |
-,6 |
|
|
Hengl and Wheeler 2018 (LandGIS)10 |
0–100, 0–30 |
4,4 |
|
|
FAO 2022 (GSOC)11 |
0–30 |
-,2 |
|
|
FAO 2023 (HWSD2)12 |
0–100, 0–30 |
2,3 |
|
Circumpolar soil C |
Hugelius et al. 2013 (NCSCD)13–15 |
0–100, 0–30 |
7,8 |
|
Global RH |
Hashimoto et al. 201516,17 |
- |
1 |
|
|
Warner et al. 2019 (Bond-Lamberty equation based)18,19 |
- |
2 |
|
|
Warner et al. 2019 (Subke equation based)18,19 |
- |
3 |
|
|
Tang et al. 202020,21 |
- |
4 |
|
|
Lu et al. 202122,23 |
- |
5 |
|
|
Stell et al. 202124,25 |
- |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset and R code associated with the following publication: Badiane et al. (2022), Environmental conditions and male quality traits simultaneously explain variation of multiple colour signals in male lizards. Journal of Animal Ecology, in press This dataset includes the following files: - An excel file containing the reflectance spectra of all individuals from all the study populations - An excel file containing the variables collected at the individual and population levels - Two R scripts corresponding to the analyses performed in the publication
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Input files needed to recreate the plots in this paper: Tracer output files for three species.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3. Input files needed to recreate the plots in this paper: raw sequence data for alignment.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of growth and yield data for each season when soybean [Glycine max (L.) Merr.] was grown for seed at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1994, 2003, 2004, and 2010 seasons, soybean was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. In 2019, soybean was grown on four large, precision weighing lysimeters and their surrounding 4.4 ha fields. The square fields are themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field are thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Soybean was grown on different combinations of fields in different years. Irrigation was by linear move sprinkler system in 1995, 2003, 2004, and 2010 although in 2010 only one irrigation was applied to establish the crop after which it was grown as a dryland crop. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigations to establish the crop early in the season, followed by reduced or absent irrigations later in the season (typically in the later winter and spring). The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), kernel or seed number, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on soybean ET, crop coefficients, and crop water productivity. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file. Resources in this dataset:Resource Title: 1995 Bushland, TX, west soybean growth and yield data. File Name: 1995 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2003 Bushland, TX, east soybean growth and yield data. File Name: 2003 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2004 Bushland, TX, east soybean growth and yield data. File Name: 2004 East Soybean_Growth-and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, east soybean growth and yield data. File Name: 2019 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, west soybean growth and yield data. File Name: 2019 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2010 Bushland, TX, west soybean growth and yield data. File Name: 2010 West_Soybean_Growth_and_Yield-V2.xlsxResource Title: README. File Name: README_Soybean_Growth_and_Yield.txt
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides invaluable insights into the natural history of Monoclonal Gammopathy of Undetermined Significance (MGUS), a crucial precursor condition to various plasma cell disorders, including multiple myeloma. Originating from a landmark longitudinal study at the Mayo Clinic, this meticulously curated dataset has served as a cornerstone in hematologic research for decades.
The data stems from a foundational clinical cohort established at the Mayo Clinic in Rochester, Minnesota, USA. The principal investigator, Dr. Robert A. Kyle of the Mayo Clinic, initiated this study, which involved sequential patients diagnosed with MGUS, followed longitudinally. The initial groundbreaking research was published in the New England Journal of Medicine (NEJM) in 1978, with extended follow-up studies published in 2002 and subsequent updates, ensuring a robust and long-term perspective on the condition. The diligence of the principal investigator ensured no subjects from the initial cohort were lost to follow-up.
Plasma cells are vital for immune defense, producing immunoglobulins. In certain conditions, a single plasma cell clone can proliferate, leading to an abnormal monoclonal protein (M-protein) visible as a "spike" in serum protein electrophoresis. MGUS is defined by the presence of such an M-protein without evidence of overt malignancy, distinguishing it from more serious conditions like multiple myeloma. It is a premalignant plasma cell disorder, and while generally asymptomatic, it carries a lifelong risk of progression to more severe conditions.
The dataset typically comprises two main files, mgus1.csv and mgus2.csv, offering different formats and an extended cohort. The data sets were updated in January 2015 to correct some small errors. For patient confidentiality, the dataset in the survival R package has been slightly perturbed, but the statistical results remain essentially unchanged.
This dataset is foundational in hematology due to its long-term follow-up and the comprehensive clinical and laboratory data collected. It is frequently cited in medical literature and textbooks, such as "Modeling Survival Data: Extending the Cox Model" by Therneau & Grambsch. Researchers utilize this dataset to:
The dataset highlights that approximately 1% of individuals with MGUS progress to a more serious blood cancer or related disorder annually. Key risk factors for progression include higher concentrations of monoclonal immunoglobulin (especially >1.5 g/dL), monoclonal immunoglobulins other than IgG, and an abnormal serum free light-chain ratio.
The data was directly abstracted from clinical and laboratory records at the Mayo Clinic under strict research protocols and verified through continuous follow-up. It is widely distributed and included as a standard dataset in the survival official R package by Terry Therneau (Mayo Clinic).
Original Study (1978): Kyle RA, et al. Prevalence of Monoclonal Gammopathy of Undetermined Significance. NEJM 1978;299:1213-20. https://www.nejm.org/doi/full/10.1056/NEJM197812072992304
Major Follow-up (2002): Kyle RA, et al. Long-Term Follow-up of MGUS. NEJM 2002;346:564-569. https://www.nejm.org/doi/full/10.1056/NEJMoa011332
R Survival Package Documentation: For detailed information on the dataset as included in R. https://stat.ethz.ch/R-manual/R-devel/library/survival/help/mgus.html https://search.r-project.org/CRAN/refmans/eventglm/html/mgus2.html
| Column 'mgus1' | Description |
|---|---|
| id | Subject ID |
| age | Age in years at the detection of MGUS |
| sex | Gender (male or female) |
| dxyr | Year of diagnosis |
| pcdx | For subjects who progress to a plasma cell malignancy the subtype of malignancy: multiple myeloma (MM) is the most common, followed by amyloidosis (AM), macroglobulinemia (MA), and other lymphoproliferative disorders (LP) |
| subtype | The subtype of malignancy (MM = multiple myeloma, AM = amyloidosis... |
Facebook
TwitterName: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment.
Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant).
License: CC-BY-SA
Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898.
Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Collection Date: 2024-1 / 2024-04
Publication Date: 22/04/2025
DOI: 10.5281/zenodo.13909413
Other repositories: -
Author: University of Deusto
Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope.
Description:
Data Files (CSV)
decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.
decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample.
decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields.
decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them.
prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata.
prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment.
curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names.
Scripts files (R)
decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models.
joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.
Report Files
decipher-modelers.pdf: Diagram with the result of the
full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country.
full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender.
full-Education.html : Detailed breakdown of dimension prioritisation results based on education level.
full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation.
full-Income.html : Analysis report showing how income level correlates with dimension prioritisation.
full-PS.html : Report analyzing Political Sensitivity scores across all participants.
full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations.
full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns.
full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations.
full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness.
full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation
full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations.
full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations.
5 star: ⭐⭐⭐
Preprocessing steps: The data has been re-coded and cleaned using the scripts provided.
Reuse: NA
Update policy: No more updates are planned.
Ethics and legal aspects: Names of the persons involved have been removed.
Technical aspects:
Other:
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
Twitter1.Phenotypic data is crucial for understanding genotype–phenotype relationships, assessing the tree of life, and revealing trends in trait diversity over time. Large†scale description of whole organisms for quantitative analyses (phenomics) presents several challenges, and technological advances in the collection of genomic data outpace those for phenomic data. Reasons for this disparity include the time†consuming and expensive nature of collecting discrete phenotypic data and mining previously†published data on a given species (both often requiring anatomical expertise across taxa), and computational challenges involved with analyzing high†dimensional datasets.
2.One approach to building approximations of organismal phenomes is to combine published datasets of discrete characters assembled for phylogenetic analyses into a phenomic dataset. Despite a wealth of legacy datasets in the literature for many groups, relatively few methods exist for automating the assembly, analysis, and vi...
Facebook
TwitterThis dataset contains NASA DC-8 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. This dataset contains updated data provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-dc8_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This dataset is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and dataset comments. For more information on updates to this dataset, please see the readme file.
Facebook
TwitterThis data set contains DLR Falcon 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 29 May 2012 through 14 June 2012. These merges were created using data in the NASA DC3 archive as of September 25, 2013. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg06-falcon_merge_YYYYMMdd_R2_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📖 Benchmark Datasets for Entity Linking from Tabular Data (Version 2)
This archive provides a benchmark suite for evaluating entity linking algorithms on structured tabular data.
It is organised into two parts:
• Challenge datasets (HTR1, HTR2): From the SemTab Table-to-KG Challenge, widely used in academic evaluations of table-to-KG alignment systems. Each is a dataset (a collection of many tables) provided with ground truth and candidate mappings.
👉 Please also cite the SemTab Challenge when using these resources.
• Real-world tables (Company, Movie, SN):
• Company — one table constructed via SPARQL queries on Wikidata, with both Wikidata and Crunchbase ground truths.
• Movie — one table constructed via SPARQL queries on Wikidata.
• SN (Spend Network) — one procurement table from the enRichMyData (EMD) project, manually annotated and including NIL cases for mentions with no known Wikidata match.
A shared top-level folder (mention_to_qid/) provides JSON files mapping surface mentions to candidate QIDs for these real-world tables.
⸻
📂 Contents
Each dataset or table includes:
• One or more input CSV tables
• Ground truth files mapping mentions/cells to Wikidata QIDs (or NIL)
• Candidate mappings (mention_to_qid/*.json), sometimes multiple variants
• Optional files such as column_classifications.json or cell_to_qid.json
⸻
📝 Licensing
• HTR1 & HTR2: CC BY 4.0
• Company & Movie: Derived from Wikidata (public domain; CC0 1.0)
• SN: CC BY 4.0 (from the enRichMyData project)
⸻
📌 Citation
If you use these datasets, please cite:
• This Zenodo record (Version 2):
Avogadro, R., & Rauniyar, A. (2025). Benchmark Datasets for Entity Linking from Tabular Data (Version 2). Zenodo. https://doi.org/10.5281/zenodo.15888942
• The SemTab Challenge (for HTR1/HTR2):
SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (Table-to-KG). (Cite the relevant SemTab overview paper for the year you reference.)
• Wikidata: Data retrieved from Wikidata (public domain; CC0 1.0).
• enRichMyData (for SN / Spend Network): Project resources from enRichMyData, licensed under CC BY 4.0.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of growth and yield data for each season when sorghum [Sorghum bicolor (L.)] was grown at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1988, 1991, 1993, 1997, 1998, 1999, 2003 through 2007, 2014, and 2015 seasons (13 years), sorghum was grown on from one to four large, precision weighing lysimeters, each in the center of a 4.44 ha square field also planted to sorghum. The square fields were themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field were thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Sorghum was grown on different combinations of fields in different years. When irrigated, irrigation was by linear move sprinkler system years before 2014, and by both sprinkler and subsurface drip irrigation in 2014 and 2015. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigation at rates established as percentages of full irrigation ranging from 33% to 75% depending on the year. The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), seed mass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on sorghum ET, crop coefficients, crop water productivity, and simulation modeling of crop water use, growth, and yield. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file.
Facebook
Twitteranalyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterData and code for "Dawson, D.E.; Lau, C.; Pradeep, P.; Sayre, R.R.; Judson, R.S.; Tornero-Velez, R.; Wambaugh, J.F. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics 2023, 11, 98. https://doi.org/10.3390/toxics11020098" Includes a link to R-markdown file allowing the application of the model to novel chemicals. This dataset is associated with the following publication: Dawson, D., C. Lau, P. Pradeep, R. Sayre, R. Judson, R. Tornero-Velez, and J. Wambaugh. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics. MDPI, Basel, SWITZERLAND, 11(2): 98, (2023).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created using LeRobot.
Dataset Structure
meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.