92 datasets found

h
dataset-pinkball-first-merge
huggingface.co
Updated Dec 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge
Explore at:
Dataset updated
Dec 1, 2025
Authors
Thomas R
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset was created using LeRobot.

Dataset Structure

meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.
Reddit's /r/Gamestop
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit's /r/Gamestop [Dataset]. https://www.kaggle.com/datasets/thedevastator/gamestop-inc-stock-prices-and-social-media-senti
Explore at:
zip(186464492 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit's /r/Gamestop

Merge this dataset with gamestop price data to study how the chat impacted

By SocialGrep [source]

About this dataset

The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.

This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.

It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath

Research Ideas

To study how social media affects stock prices

To study how Reddit affects stock prices

To study how the sentiment of a subreddit affects stock prices

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |

File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Additional file 4 of mtDNAcombine: tools to combine sequences from multiple...
springernature.figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleanor F. Miller; Andrea Manica (2023). Additional file 4 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189969.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14189969.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Eleanor F. Miller; Andrea Manica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4. Code to create the plots in this paper presented as a R markdown file.
KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

Harmonized global datasets of soil carbon and heterotrophic respiration from...

zenodo.org

bin, nc, txt

Updated Oct 7, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina (2025). Harmonized global datasets of soil carbon and heterotrophic respiration from data-driven estimates, with derived turnover time and Q10 [Dataset]. http://doi.org/10.5281/zenodo.17282577

Explore at:

nc, txt, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17282577

Dataset updated

Oct 7, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We collected all available global soil carbon (C) and heterotrophic respiration (R_H) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.

Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and R_H at 0.5° resolution.

Grid cells with fewer than three soil C estimates or fewer than four R_H estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and R_H information over land.

Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:

τ = C_S / R_H

where C_S is soil carbon stock and R_H is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:

τ_max = C_S⁺ / R_H⁻ τ_min = C_S⁻ / R_H⁺

where C_S⁺ and C_S⁻ are the maximum and minimum soil C values, and R_H⁺ and R_H⁻ are the maximum and minimum R_H values, respectively.

To calculate the temperature sensitivity of decomposition (Q₁₀)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q₁₀ (maximum and minimum values) was derived using τ_max and τ_min, respectively.

All files are provided in NetCDF format. The SOC file includes the following variables:
· longitude, latitude
· soc: mean soil C stock (kg C m⁻²)
· soc_median: median soil C (kg C m⁻²)
· soc_n: number of estimates per grid cell
· soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
· soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
· soc_range: range of soil C values
· soc_sd: standard deviation of soil C (kg C m⁻²)
· soc_cv: coefficient of variation (%)
The R_H file includes:
· longitude, latitude
· rh: mean R_H (g C m⁻² yr⁻¹)
· rh_median, rh_n, rh_max, rh_min: as above
· rh_max_id, rh_min_id: study IDs for max/min
· rh_range, rh_sd, rh_cv: analogous variables for R_H
The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q₁₀ files contain estimates derived from the mean values of soil C and R_H, along with associated uncertainty values.

The harmonized dataset files available in the repository are as follows:

· harmonized-RH-hdg.nc: global soil heterotrophic respiration map

· harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm

· harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm

· Q10.nc: global Q10 map

· Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum R_H

· Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum R_H

· Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and R_H

· Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm

Version history
Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)

More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data

Reference

Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).

Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.

<td style="width:

Dataset	Repository/References (Dataset name)	Depth	ID in NetCDF file***
Global soil C	Global soil data task 2000 (IGBP-DIS)¹	0–100	3,-
	Shangguan et al. 2014 (GSDE)^2,3	0–100, 0–30*	1,1
	Batjes 2016 (WISE30sec)^4,5	0–100, 0–30	6,7
	Sanderman et al. 2017 (Soil-Carbon-Debt) ^6,7	0–100, 0–30	5,5
	Soilgrids team and Hengl et al. 2017 (SoilGrids)^8,9	0–30**	-,6
	Hengl and Wheeler 2018 (LandGIS)¹⁰	0–100, 0–30	4,4
	FAO 2022 (GSOC)¹¹	0–30	-,2
	FAO 2023 (HWSD2)¹²	0–100, 0–30	2,3
Circumpolar soil C	Hugelius et al. 2013 (NCSCD)^13–15	0–100, 0–30	7,8
Global R_H	Hashimoto et al. 2015^16,17	-	1
	Warner et al. 2019 (Bond-Lamberty equation based)^18,19	-	2
	Warner et al. 2019 (Subke equation based)^18,19	-	3
	Tang et al. 2020^20,21	-	4
	Lu et al. 2021^22,23	-	5
	Stell et al. 2021^24,25	-

Dataset: Environmental conditions and male quality traits simultaneously...
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jun 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). Dataset: Environmental conditions and male quality traits simultaneously explain variation of multiple colour signals in male lizards [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6683661?locale=de
Explore at:
unknown(3063441)Available download formats
Dataset updated
Jun 21, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset and R code associated with the following publication: Badiane et al. (2022), Environmental conditions and male quality traits simultaneously explain variation of multiple colour signals in male lizards. Journal of Animal Ecology, in press This dataset includes the following files: - An excel file containing the reflectance spectra of all individuals from all the study populations - An excel file containing the variables collected at the individual and population levels - Two R scripts corresponding to the analyses performed in the publication
Additional file 2 of mtDNAcombine: tools to combine sequences from multiple...
springernature.figshare.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleanor F. Miller; Andrea Manica (2023). Additional file 2 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189960.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14189960.v1
Dataset updated
Jun 11, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eleanor F. Miller; Andrea Manica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. Input files needed to recreate the plots in this paper: Tracer output files for three species.
Additional file 3 of mtDNAcombine: tools to combine sequences from multiple...
springernature.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleanor F. Miller; Andrea Manica (2023). Additional file 3 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189963.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14189963.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Eleanor F. Miller; Andrea Manica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 3. Input files needed to recreate the plots in this paper: raw sequence data for alignment.
u
Growth and Yield Data for the Bushland, Texas, Soybean Datasets
agdatacommons.nal.usda.gov
catalog.data.gov
xlsx
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt (2025). Growth and Yield Data for the Bushland, Texas, Soybean Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1528670
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1528670
Dataset updated
Nov 21, 2025
Dataset provided by
Ag Data Commons
Authors
Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Bushland, Texas
Description
This dataset consists of growth and yield data for each season when soybean [Glycine max (L.) Merr.] was grown for seed at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1994, 2003, 2004, and 2010 seasons, soybean was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. In 2019, soybean was grown on four large, precision weighing lysimeters and their surrounding 4.4 ha fields. The square fields are themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field are thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Soybean was grown on different combinations of fields in different years. Irrigation was by linear move sprinkler system in 1995, 2003, 2004, and 2010 although in 2010 only one irrigation was applied to establish the crop after which it was grown as a dryland crop. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigations to establish the crop early in the season, followed by reduced or absent irrigations later in the season (typically in the later winter and spring). The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), kernel or seed number, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on soybean ET, crop coefficients, and crop water productivity. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file. Resources in this dataset:Resource Title: 1995 Bushland, TX, west soybean growth and yield data. File Name: 1995 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2003 Bushland, TX, east soybean growth and yield data. File Name: 2003 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2004 Bushland, TX, east soybean growth and yield data. File Name: 2004 East Soybean_Growth-and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, east soybean growth and yield data. File Name: 2019 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, west soybean growth and yield data. File Name: 2019 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2010 Bushland, TX, west soybean growth and yield data. File Name: 2010 West_Soybean_Growth_and_Yield-V2.xlsxResource Title: README. File Name: README_Soybean_Growth_and_Yield.txt

MGUS: Surveillance & Disease Dynamics

kaggle.com

zip

Updated Jun 1, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Utkarsh Singh (2023). MGUS: Surveillance & Disease Dynamics [Dataset]. https://www.kaggle.com/datasets/utkarshx27/monoclonal-gammopathy-data

Explore at:

zip(25967 bytes)Available download formats

Dataset updated

Jun 1, 2023

Authors

Utkarsh Singh

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Monoclonal Gammopathy of Undetermined Significance (MGUS)

This dataset provides invaluable insights into the natural history of Monoclonal Gammopathy of Undetermined Significance (MGUS), a crucial precursor condition to various plasma cell disorders, including multiple myeloma. Originating from a landmark longitudinal study at the Mayo Clinic, this meticulously curated dataset has served as a cornerstone in hematologic research for decades.

Dataset Origin and Principal Investigator

The data stems from a foundational clinical cohort established at the Mayo Clinic in Rochester, Minnesota, USA. The principal investigator, Dr. Robert A. Kyle of the Mayo Clinic, initiated this study, which involved sequential patients diagnosed with MGUS, followed longitudinally. The initial groundbreaking research was published in the New England Journal of Medicine (NEJM) in 1978, with extended follow-up studies published in 2002 and subsequent updates, ensuring a robust and long-term perspective on the condition. The diligence of the principal investigator ensured no subjects from the initial cohort were lost to follow-up.

Understanding Monoclonal Gammopathy of Undetermined Significance (MGUS)

Plasma cells are vital for immune defense, producing immunoglobulins. In certain conditions, a single plasma cell clone can proliferate, leading to an abnormal monoclonal protein (M-protein) visible as a "spike" in serum protein electrophoresis. MGUS is defined by the presence of such an M-protein without evidence of overt malignancy, distinguishing it from more serious conditions like multiple myeloma. It is a premalignant plasma cell disorder, and while generally asymptomatic, it carries a lifelong risk of progression to more severe conditions.

Dataset Content

The dataset typically comprises two main files, mgus1.csv and mgus2.csv, offering different formats and an extended cohort. The data sets were updated in January 2015 to correct some small errors. For patient confidentiality, the dataset in the survival R package has been slightly perturbed, but the statistical results remain essentially unchanged.

Clinical Significance and Research Applications

This dataset is foundational in hematology due to its long-term follow-up and the comprehensive clinical and laboratory data collected. It is frequently cited in medical literature and textbooks, such as "Modeling Survival Data: Extending the Cox Model" by Therneau & Grambsch. Researchers utilize this dataset to:

Study the natural history and progression of MGUS.
Identify risk factors for progression to multiple myeloma and other plasma cell malignancies.
Develop and validate prognostic models.
Explore associations between MGUS and other diseases.
Demonstrate and practice survival analysis techniques.

The dataset highlights that approximately 1% of individuals with MGUS progress to a more serious blood cancer or related disorder annually. Key risk factors for progression include higher concentrations of monoclonal immunoglobulin (especially >1.5 g/dL), monoclonal immunoglobulins other than IgG, and an abnormal serum free light-chain ratio.

Provenance and Recommended Use

The data was directly abstracted from clinical and laboratory records at the Mayo Clinic under strict research protocols and verified through continuous follow-up. It is widely distributed and included as a standard dataset in the survival official R package by Terry Therneau (Mayo Clinic).

References and Further Reading:

Original Study (1978): Kyle RA, et al. Prevalence of Monoclonal Gammopathy of Undetermined Significance. NEJM 1978;299:1213-20. https://www.nejm.org/doi/full/10.1056/NEJM197812072992304

Major Follow-up (2002): Kyle RA, et al. Long-Term Follow-up of MGUS. NEJM 2002;346:564-569. https://www.nejm.org/doi/full/10.1056/NEJMoa011332

R Survival Package Documentation: For detailed information on the dataset as included in R. https://stat.ethz.ch/R-manual/R-devel/library/survival/help/mgus.html https://search.r-project.org/CRAN/refmans/eventglm/html/mgus2.html

Table Content:

Column 'mgus1'	Description
id	Subject ID
age	Age in years at the detection of MGUS
sex	Gender (male or female)
dxyr	Year of diagnosis
pcdx	For subjects who progress to a plasma cell malignancy the subtype of malignancy: multiple myeloma (MM) is the most common, followed by amyloidosis (AM), macroglobulinemia (MA), and other lymphoproliferative disorders (LP)
subtype	The subtype of malignancy (MM = multiple myeloma, AM = amyloidosis...

s
Data from: RAW data from Towards Holistic Environmental Policy Assessment:...
research.science.eus
data.europa.eu
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber (2024). RAW data from Towards Holistic Environmental Policy Assessment: Multi-Criteria Frameworks and recommendations for modelers paper [Dataset]. https://research.science.eus/documentos/685699066364e456d3a65172
Explore at:
Dataset updated
2024
Authors
Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber
Description
Name: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment.

Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant).

License: CC-BY-SA

Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898.

Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Collection Date: 2024-1 / 2024-04

Publication Date: 22/04/2025

DOI: 10.5281/zenodo.13909413

Other repositories: -

Author: University of Deusto

Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope.

Description:

Data Files (CSV)

decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample.

decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields.

decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them.

prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata.

prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment.

curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names.

Scripts files (R)

decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models.

joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

Report Files

decipher-modelers.pdf: Diagram with the result of the

full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country.

full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender.

full-Education.html : Detailed breakdown of dimension prioritisation results based on education level.

full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation.

full-Income.html : Analysis report showing how income level correlates with dimension prioritisation.

full-PS.html : Report analyzing Political Sensitivity scores across all participants.

full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations.

full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns.

full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations.

full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness.

full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation

full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations.

full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations.

5 star: ⭐⭐⭐

Preprocessing steps: The data has been re-coded and cleaned using the scripts provided.

Reuse: NA

Update policy: No more updates are planned.

Ethics and legal aspects: Names of the persons involved have been removed.

Technical aspects:

Other:
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Vaccine Trials Networkhttp://www.hvtn.org/
HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
d
phenotools: an R package for visualizing and analyzing phenomic datasets
search.dataone.org
datadryad.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chad M. Eliason; Scott V. Edwards; Julia A. Clarke (2025). phenotools: an R package for visualizing and analyzing phenomic datasets [Dataset]. http://doi.org/10.5061/dryad.05qm36k
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.05qm36k
Dataset updated
Jun 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Chad M. Eliason; Scott V. Edwards; Julia A. Clarke
Time period covered
Jan 1, 2019
Description
1.Phenotypic data is crucial for understanding genotypeâ€“phenotype relationships, assessing the tree of life, and revealing trends in trait diversity over time. Largeâ€ scale description of whole organisms for quantitative analyses (phenomics) presents several challenges, and technological advances in the collection of genomic data outpace those for phenomic data. Reasons for this disparity include the timeâ€ consuming and expensive nature of collecting discrete phenotypic data and mining previouslyâ€ published data on a given species (both often requiring anatomical expertise across taxa), and computational challenges involved with analyzing highâ€ dimensional datasets.

2.One approach to building approximations of organismal phenomes is to combine published datasets of discrete characters assembled for phylogenetic analyses into a phenomic dataset. Despite a wealth of legacy datasets in the literature for many groups, relatively few methods exist for automating the assembly, analysis, and vi...
u
NASA DC-8 1 Minute Data Merge
data.ucar.edu
ascii
Updated Oct 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2025). NASA DC-8 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/VM9C-1C16-H003
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/VM9C-1C16-H003
Dataset updated
Oct 7, 2025
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 1, 2012 - Jun 30, 2012
Area covered

Description
This dataset contains NASA DC-8 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. This dataset contains updated data provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-dc8_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This dataset is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and dataset comments. For more information on updates to this dataset, please see the readme file.
u
DLR Falcon 1 Minute Data Merge
data.ucar.edu
ckanprod.data-commons.k8s.ucar.edu
ascii
Updated Oct 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2025). DLR Falcon 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/SZ09-F2G3-7X0V
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/SZ09-F2G3-7X0V
Dataset updated
Oct 7, 2025
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 29, 2012 - Jun 14, 2012
Area covered

Description
This data set contains DLR Falcon 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 29 May 2012 through 14 June 2012. These merges were created using data in the NASA DC3 archive as of September 25, 2013. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg06-falcon_merge_YYYYMMdd_R2_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Benchmark Datasets for Entity Linking from Tabular Data
zenodo.org
zip
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Avogadro; Roberto Avogadro (2025). Benchmark Datasets for Entity Linking from Tabular Data [Dataset]. http://doi.org/10.5281/zenodo.17160156
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17160156
Dataset updated
Sep 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Roberto Avogadro; Roberto Avogadro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📖 Benchmark Datasets for Entity Linking from Tabular Data (Version 2)

This archive provides a benchmark suite for evaluating entity linking algorithms on structured tabular data.
It is organised into two parts:
• Challenge datasets (HTR1, HTR2): From the SemTab Table-to-KG Challenge, widely used in academic evaluations of table-to-KG alignment systems. Each is a dataset (a collection of many tables) provided with ground truth and candidate mappings.
👉 Please also cite the SemTab Challenge when using these resources.
• Real-world tables (Company, Movie, SN):
• Company — one table constructed via SPARQL queries on Wikidata, with both Wikidata and Crunchbase ground truths.
• Movie — one table constructed via SPARQL queries on Wikidata.
• SN (Spend Network) — one procurement table from the enRichMyData (EMD) project, manually annotated and including NIL cases for mentions with no known Wikidata match.

A shared top-level folder (mention_to_qid/) provides JSON files mapping surface mentions to candidate QIDs for these real-world tables.

⸻

📂 Contents

Each dataset or table includes:
• One or more input CSV tables
• Ground truth files mapping mentions/cells to Wikidata QIDs (or NIL)
• Candidate mappings (mention_to_qid/*.json), sometimes multiple variants
• Optional files such as column_classifications.json or cell_to_qid.json

⸻

📝 Licensing
• HTR1 & HTR2: CC BY 4.0
• Company & Movie: Derived from Wikidata (public domain; CC0 1.0)
• SN: CC BY 4.0 (from the enRichMyData project)

⸻

📌 Citation

If you use these datasets, please cite:
• This Zenodo record (Version 2):
Avogadro, R., & Rauniyar, A. (2025). Benchmark Datasets for Entity Linking from Tabular Data (Version 2). Zenodo. https://doi.org/10.5281/zenodo.15888942
• The SemTab Challenge (for HTR1/HTR2):
SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (Table-to-KG). (Cite the relevant SemTab overview paper for the year you reference.)
• Wikidata: Data retrieved from Wikidata (public domain; CC0 1.0).
• enRichMyData (for SN / Spend Network): Project resources from enRichMyData, licensed under CC BY 4.0.
u
Growth and Yield Data for the Bushland, Texas, Sorghum Datasets
agdatacommons.nal.usda.gov
catalog.data.gov
xlsx
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt (2025). Growth and Yield Data for the Bushland, Texas, Sorghum Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1529411
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1529411
Dataset updated
Nov 21, 2025
Dataset provided by
Ag Data Commons
Authors
Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Bushland, Texas
Description
This dataset consists of growth and yield data for each season when sorghum [Sorghum bicolor (L.)] was grown at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1988, 1991, 1993, 1997, 1998, 1999, 2003 through 2007, 2014, and 2015 seasons (13 years), sorghum was grown on from one to four large, precision weighing lysimeters, each in the center of a 4.44 ha square field also planted to sorghum. The square fields were themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field were thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Sorghum was grown on different combinations of fields in different years. When irrigated, irrigation was by linear move sprinkler system years before 2014, and by both sprinkler and subsurface drip irrigation in 2014 and 2015. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigation at rates established as percentages of full irrigation ranging from 33% to 75% depending on the year. The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), seed mass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on sorghum ET, crop coefficients, crop water productivity, and simulation modeling of crop water use, growth, and yield. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
House Price Regression Dataset
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
Explore at:
zip(27045 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Prokshitha Polemoni
Description
Home Value Insights: A Beginner's Regression Dataset

This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

Features:

Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.

Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.

Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.

Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.

Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.

Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.

Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.

House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

Potential Uses:

Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

Versatility:

The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Data from: A Machine Learning Model to Estimate Toxicokinetic Half-Lives of...
catalog.data.gov
datasets.ai
Updated Apr 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species [Dataset]. https://catalog.data.gov/dataset/a-machine-learning-model-to-estimate-toxicokinetic-half-lives-of-per-and-polyfluoro-alkyl-
Explore at:
Dataset updated
Apr 30, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data and code for "Dawson, D.E.; Lau, C.; Pradeep, P.; Sayre, R.R.; Judson, R.S.; Tornero-Velez, R.; Wambaugh, J.F. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics 2023, 11, 98. https://doi.org/10.3390/toxics11020098" Includes a link to R-markdown file allowing the application of the model to novel chemicals. This dataset is associated with the following publication: Dawson, D., C. Lau, P. Pradeep, R. Sayre, R. Judson, R. Tornero-Velez, and J. Wambaugh. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics. MDPI, Basel, SWITZERLAND, 11(2): 98, (2023).

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge

dataset-pinkball-first-merge

treitz/dataset-pinkball-first-merge

Explore at:

Dataset updated

Dec 1, 2025

Authors

Thomas R

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset was created using LeRobot.

  Dataset Structure

meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.

Clear search

Close search

Google apps

Main menu

dataset-pinkball-first-merge

Reddit's /r/Gamestop

Reddit's /r/Gamestop

Merge this dataset with gamestop price data to study how the chat impacted

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Additional file 4 of mtDNAcombine: tools to combine sequences from multiple...

KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

Harmonized global datasets of soil carbon and heterotrophic respiration from...

Dataset: Environmental conditions and male quality traits simultaneously...

Additional file 2 of mtDNAcombine: tools to combine sequences from multiple...

Additional file 3 of mtDNAcombine: tools to combine sequences from multiple...

Growth and Yield Data for the Bushland, Texas, Soybean Datasets

MGUS: Surveillance & Disease Dynamics

Monoclonal Gammopathy of Undetermined Significance (MGUS)

Dataset Origin and Principal Investigator

Understanding Monoclonal Gammopathy of Undetermined Significance (MGUS)

Dataset Content

Clinical Significance and Research Applications

Provenance and Recommended Use

References and Further Reading:

Table Content:

Data from: RAW data from Towards Holistic Environmental Policy Assessment:...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

phenotools: an R package for visualizing and analyzing phenomic datasets

NASA DC-8 1 Minute Data Merge

DLR Falcon 1 Minute Data Merge

Benchmark Datasets for Entity Linking from Tabular Data

Growth and Yield Data for the Bushland, Texas, Sorghum Datasets

Current Population Survey (CPS)

House Price Regression Dataset

Home Value Insights: A Beginner's Regression Dataset

Features:

Potential Uses:

Versatility:

Data from: A Machine Learning Model to Estimate Toxicokinetic Half-Lives of...

dataset-pinkball-first-merge

treitz/dataset-pinkball-first-merge