92 datasets found
  1. h

    dataset-pinkball-first-merge

    • huggingface.co
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge
    Explore at:
    Dataset updated
    Dec 1, 2025
    Authors
    Thomas R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset was created using LeRobot.

      Dataset Structure
    

    meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.

  2. Reddit's /r/Gamestop

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit's /r/Gamestop [Dataset]. https://www.kaggle.com/datasets/thedevastator/gamestop-inc-stock-prices-and-social-media-senti
    Explore at:
    zip(186464492 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit's /r/Gamestop

    Merge this dataset with gamestop price data to study how the chat impacted

    By SocialGrep [source]

    About this dataset

    The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.

    This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.

    It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath

    Research Ideas

    • To study how social media affects stock prices
    • To study how Reddit affects stock prices
    • To study how the sentiment of a subreddit affects stock prices

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |

    File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  3. Additional file 4 of mtDNAcombine: tools to combine sequences from multiple...

    • springernature.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor F. Miller; Andrea Manica (2023). Additional file 4 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189969.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Eleanor F. Miller; Andrea Manica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4. Code to create the plots in this paper presented as a R markdown file.

  4. KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

  5. Harmonized global datasets of soil carbon and heterotrophic respiration from...

    • zenodo.org
    bin, nc, txt
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina (2025). Harmonized global datasets of soil carbon and heterotrophic respiration from data-driven estimates, with derived turnover time and Q10 [Dataset]. http://doi.org/10.5281/zenodo.17282577
    Explore at:
    nc, txt, binAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.

    Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.

    Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.

    Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:

    τ = CS / RH

    where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:

    τmax = CS+ / RH  τmin = CS / RH+

    where CS+ and CS are the maximum and minimum soil C values, and RH+ and RH are the maximum and minimum RH values, respectively.

    To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.

    All files are provided in NetCDF format. The SOC file includes the following variables:
    · longitude, latitude
    · soc: mean soil C stock (kg C m⁻²)
    · soc_median: median soil C (kg C m⁻²)
    · soc_n: number of estimates per grid cell
    · soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
    · soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
    · soc_range: range of soil C values
    · soc_sd: standard deviation of soil C (kg C m⁻²)
    · soc_cv: coefficient of variation (%)
    The RH file includes:
    · longitude, latitude
    · rh: mean RH (g C m⁻² yr⁻¹)
    · rh_median, rh_n, rh_max, rh_min: as above
    · rh_max_id, rh_min_id: study IDs for max/min
    · rh_range, rh_sd, rh_cv: analogous variables for RH
    The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q10 files contain estimates derived from the mean values of soil C and RH, along with associated uncertainty values.

    The harmonized dataset files available in the repository are as follows:

    · harmonized-RH-hdg.nc: global soil heterotrophic respiration map

    · harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm

    · harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm

    · Q10.nc: global Q10 map

    · Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum RH

    · Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum RH

    · Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and RH

    · Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm

    Version history
    Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)


    More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data

    Reference

    Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).

    Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.

    <td style="width:

    Dataset

    Repository/References (Dataset name)

    Depth

    ID in NetCDF file***

    Global soil C

    Global soil data task 2000 (IGBP-DIS)1

    0–100

    3,-

    Shangguan et al. 2014 (GSDE)2,3

    0–100, 0–30*

    1,1

    Batjes 2016 (WISE30sec)4,5

    0–100, 0–30

    6,7

    Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7

    0–100, 0–30

    5,5

    Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9

    0–30**

    -,6

    Hengl and Wheeler 2018 (LandGIS)10

    0–100, 0–30

    4,4

    FAO 2022 (GSOC)11

    0–30

    -,2

    FAO 2023 (HWSD2)12

    0–100, 0–30

    2,3

    Circumpolar soil C

    Hugelius et al. 2013 (NCSCD)13–15

    0–100, 0–30

    7,8

    Global RH

    Hashimoto et al. 201516,17

    -

    1

    Warner et al. 2019 (Bond-Lamberty equation based)18,19

    -

    2

    Warner et al. 2019 (Subke equation based)18,19

    -

    3

    Tang et al. 202020,21

    -

    4

    Lu et al. 202122,23

    -

    5

    Stell et al. 202124,25

    -

  6. Dataset: Environmental conditions and male quality traits simultaneously...

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Dataset: Environmental conditions and male quality traits simultaneously explain variation of multiple colour signals in male lizards [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6683661?locale=de
    Explore at:
    unknown(3063441)Available download formats
    Dataset updated
    Jun 21, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset and R code associated with the following publication: Badiane et al. (2022), Environmental conditions and male quality traits simultaneously explain variation of multiple colour signals in male lizards. Journal of Animal Ecology, in press This dataset includes the following files: - An excel file containing the reflectance spectra of all individuals from all the study populations - An excel file containing the variables collected at the individual and population levels - Two R scripts corresponding to the analyses performed in the publication

  7. Additional file 2 of mtDNAcombine: tools to combine sequences from multiple...

    • springernature.figshare.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor F. Miller; Andrea Manica (2023). Additional file 2 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189960.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Eleanor F. Miller; Andrea Manica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2. Input files needed to recreate the plots in this paper: Tracer output files for three species.

  8. Additional file 3 of mtDNAcombine: tools to combine sequences from multiple...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor F. Miller; Andrea Manica (2023). Additional file 3 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189963.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Eleanor F. Miller; Andrea Manica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3. Input files needed to recreate the plots in this paper: raw sequence data for alignment.

  9. u

    Growth and Yield Data for the Bushland, Texas, Soybean Datasets

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    xlsx
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt (2025). Growth and Yield Data for the Bushland, Texas, Soybean Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1528670
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Bushland, Texas
    Description

    This dataset consists of growth and yield data for each season when soybean [Glycine max (L.) Merr.] was grown for seed at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1994, 2003, 2004, and 2010 seasons, soybean was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. In 2019, soybean was grown on four large, precision weighing lysimeters and their surrounding 4.4 ha fields. The square fields are themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field are thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Soybean was grown on different combinations of fields in different years. Irrigation was by linear move sprinkler system in 1995, 2003, 2004, and 2010 although in 2010 only one irrigation was applied to establish the crop after which it was grown as a dryland crop. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigations to establish the crop early in the season, followed by reduced or absent irrigations later in the season (typically in the later winter and spring). The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), kernel or seed number, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on soybean ET, crop coefficients, and crop water productivity. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file. Resources in this dataset:Resource Title: 1995 Bushland, TX, west soybean growth and yield data. File Name: 1995 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2003 Bushland, TX, east soybean growth and yield data. File Name: 2003 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2004 Bushland, TX, east soybean growth and yield data. File Name: 2004 East Soybean_Growth-and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, east soybean growth and yield data. File Name: 2019 East Soybean_Growth_and_Yield-V2.xlsxResource Title: 2019 Bushland, TX, west soybean growth and yield data. File Name: 2019 West Soybean_Growth_and_Yield-V2.xlsxResource Title: 2010 Bushland, TX, west soybean growth and yield data. File Name: 2010 West_Soybean_Growth_and_Yield-V2.xlsxResource Title: README. File Name: README_Soybean_Growth_and_Yield.txt

  10. MGUS: Surveillance & Disease Dynamics

    • kaggle.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Singh (2023). MGUS: Surveillance & Disease Dynamics [Dataset]. https://www.kaggle.com/datasets/utkarshx27/monoclonal-gammopathy-data
    Explore at:
    zip(25967 bytes)Available download formats
    Dataset updated
    Jun 1, 2023
    Authors
    Utkarsh Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Monoclonal Gammopathy of Undetermined Significance (MGUS)

    This dataset provides invaluable insights into the natural history of Monoclonal Gammopathy of Undetermined Significance (MGUS), a crucial precursor condition to various plasma cell disorders, including multiple myeloma. Originating from a landmark longitudinal study at the Mayo Clinic, this meticulously curated dataset has served as a cornerstone in hematologic research for decades.

    Dataset Origin and Principal Investigator

    The data stems from a foundational clinical cohort established at the Mayo Clinic in Rochester, Minnesota, USA. The principal investigator, Dr. Robert A. Kyle of the Mayo Clinic, initiated this study, which involved sequential patients diagnosed with MGUS, followed longitudinally. The initial groundbreaking research was published in the New England Journal of Medicine (NEJM) in 1978, with extended follow-up studies published in 2002 and subsequent updates, ensuring a robust and long-term perspective on the condition. The diligence of the principal investigator ensured no subjects from the initial cohort were lost to follow-up.

    Understanding Monoclonal Gammopathy of Undetermined Significance (MGUS)

    Plasma cells are vital for immune defense, producing immunoglobulins. In certain conditions, a single plasma cell clone can proliferate, leading to an abnormal monoclonal protein (M-protein) visible as a "spike" in serum protein electrophoresis. MGUS is defined by the presence of such an M-protein without evidence of overt malignancy, distinguishing it from more serious conditions like multiple myeloma. It is a premalignant plasma cell disorder, and while generally asymptomatic, it carries a lifelong risk of progression to more severe conditions.

    Dataset Content

    The dataset typically comprises two main files, mgus1.csv and mgus2.csv, offering different formats and an extended cohort. The data sets were updated in January 2015 to correct some small errors. For patient confidentiality, the dataset in the survival R package has been slightly perturbed, but the statistical results remain essentially unchanged.

    Clinical Significance and Research Applications

    This dataset is foundational in hematology due to its long-term follow-up and the comprehensive clinical and laboratory data collected. It is frequently cited in medical literature and textbooks, such as "Modeling Survival Data: Extending the Cox Model" by Therneau & Grambsch. Researchers utilize this dataset to:

    • Study the natural history and progression of MGUS.
    • Identify risk factors for progression to multiple myeloma and other plasma cell malignancies.
    • Develop and validate prognostic models.
    • Explore associations between MGUS and other diseases.
    • Demonstrate and practice survival analysis techniques.

    The dataset highlights that approximately 1% of individuals with MGUS progress to a more serious blood cancer or related disorder annually. Key risk factors for progression include higher concentrations of monoclonal immunoglobulin (especially >1.5 g/dL), monoclonal immunoglobulins other than IgG, and an abnormal serum free light-chain ratio.

    Provenance and Recommended Use

    The data was directly abstracted from clinical and laboratory records at the Mayo Clinic under strict research protocols and verified through continuous follow-up. It is widely distributed and included as a standard dataset in the survival official R package by Terry Therneau (Mayo Clinic).

    References and Further Reading:

    Original Study (1978): Kyle RA, et al. Prevalence of Monoclonal Gammopathy of Undetermined Significance. NEJM 1978;299:1213-20. https://www.nejm.org/doi/full/10.1056/NEJM197812072992304

    Major Follow-up (2002): Kyle RA, et al. Long-Term Follow-up of MGUS. NEJM 2002;346:564-569. https://www.nejm.org/doi/full/10.1056/NEJMoa011332

    R Survival Package Documentation: For detailed information on the dataset as included in R. https://stat.ethz.ch/R-manual/R-devel/library/survival/help/mgus.html https://search.r-project.org/CRAN/refmans/eventglm/html/mgus2.html

    Table Content:

    Column 'mgus1'Description
    idSubject ID
    ageAge in years at the detection of MGUS
    sexGender (male or female)
    dxyrYear of diagnosis
    pcdxFor subjects who progress to a plasma cell malignancy the subtype of malignancy: multiple myeloma (MM) is the most common, followed by amyloidosis (AM), macroglobulinemia (MA), and other lymphoproliferative disorders (LP)
    subtypeThe subtype of malignancy (MM = multiple myeloma, AM = amyloidosis...
  11. s

    Data from: RAW data from Towards Holistic Environmental Policy Assessment:...

    • research.science.eus
    • data.europa.eu
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber (2024). RAW data from Towards Holistic Environmental Policy Assessment: Multi-Criteria Frameworks and recommendations for modelers paper [Dataset]. https://research.science.eus/documentos/685699066364e456d3a65172
    Explore at:
    Dataset updated
    2024
    Authors
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber
    Description

    Name: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment.

    Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant).

    License: CC-BY-SA

    Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898.

    Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

    Collection Date: 2024-1 / 2024-04

    Publication Date: 22/04/2025

    DOI: 10.5281/zenodo.13909413

    Other repositories: -

    Author: University of Deusto

    Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope.

    Description:

    Data Files (CSV)

    decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample.

    decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields.

    decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them.

    prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata.

    prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment.

    curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names.

    Scripts files (R)

    decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models.

    joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    Report Files

    decipher-modelers.pdf: Diagram with the result of the

    full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country.

    full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender.

    full-Education.html : Detailed breakdown of dimension prioritisation results based on education level.

    full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation.

    full-Income.html : Analysis report showing how income level correlates with dimension prioritisation.

    full-PS.html : Report analyzing Political Sensitivity scores across all participants.

    full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations.

    full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns.

    full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations.

    full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness.

    full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation

    full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations.

    full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations.

    5 star: ⭐⭐⭐

    Preprocessing steps: The data has been re-coded and cleaned using the scripts provided.

    Reuse: NA

    Update policy: No more updates are planned.

    Ethics and legal aspects: Names of the persons involved have been removed.

    Technical aspects:

    Other:

  12. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  13. d

    phenotools: an R package for visualizing and analyzing phenomic datasets

    • search.dataone.org
    • datadryad.org
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chad M. Eliason; Scott V. Edwards; Julia A. Clarke (2025). phenotools: an R package for visualizing and analyzing phenomic datasets [Dataset]. http://doi.org/10.5061/dryad.05qm36k
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Chad M. Eliason; Scott V. Edwards; Julia A. Clarke
    Time period covered
    Jan 1, 2019
    Description

    1.Phenotypic data is crucial for understanding genotype–phenotype relationships, assessing the tree of life, and revealing trends in trait diversity over time. Large†scale description of whole organisms for quantitative analyses (phenomics) presents several challenges, and technological advances in the collection of genomic data outpace those for phenomic data. Reasons for this disparity include the time†consuming and expensive nature of collecting discrete phenotypic data and mining previously†published data on a given species (both often requiring anatomical expertise across taxa), and computational challenges involved with analyzing high†dimensional datasets.

    2.One approach to building approximations of organismal phenomes is to combine published datasets of discrete characters assembled for phylogenetic analyses into a phenomic dataset. Despite a wealth of legacy datasets in the literature for many groups, relatively few methods exist for automating the assembly, analysis, and vi...

  14. u

    NASA DC-8 1 Minute Data Merge

    • data.ucar.edu
    ascii
    Updated Oct 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). NASA DC-8 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/VM9C-1C16-H003
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 1, 2012 - Jun 30, 2012
    Area covered
    Description

    This dataset contains NASA DC-8 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. This dataset contains updated data provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-dc8_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This dataset is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and dataset comments. For more information on updates to this dataset, please see the readme file.

  15. u

    DLR Falcon 1 Minute Data Merge

    • data.ucar.edu
    • ckanprod.data-commons.k8s.ucar.edu
    ascii
    Updated Oct 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). DLR Falcon 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/SZ09-F2G3-7X0V
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 29, 2012 - Jun 14, 2012
    Area covered
    Description

    This data set contains DLR Falcon 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 29 May 2012 through 14 June 2012. These merges were created using data in the NASA DC3 archive as of September 25, 2013. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg06-falcon_merge_YYYYMMdd_R2_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.

  16. Benchmark Datasets for Entity Linking from Tabular Data

    • zenodo.org
    zip
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Avogadro; Roberto Avogadro (2025). Benchmark Datasets for Entity Linking from Tabular Data [Dataset]. http://doi.org/10.5281/zenodo.17160156
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roberto Avogadro; Roberto Avogadro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📖 Benchmark Datasets for Entity Linking from Tabular Data (Version 2)


    This archive provides a benchmark suite for evaluating entity linking algorithms on structured tabular data.
    It is organised into two parts:
    • Challenge datasets (HTR1, HTR2): From the SemTab Table-to-KG Challenge, widely used in academic evaluations of table-to-KG alignment systems. Each is a dataset (a collection of many tables) provided with ground truth and candidate mappings.
    👉 Please also cite the SemTab Challenge when using these resources.
    • Real-world tables (Company, Movie, SN):
    • Company — one table constructed via SPARQL queries on Wikidata, with both Wikidata and Crunchbase ground truths.
    • Movie — one table constructed via SPARQL queries on Wikidata.
    • SN (Spend Network) — one procurement table from the enRichMyData (EMD) project, manually annotated and including NIL cases for mentions with no known Wikidata match.


    A shared top-level folder (mention_to_qid/) provides JSON files mapping surface mentions to candidate QIDs for these real-world tables.



    📂 Contents


    Each dataset or table includes:
    • One or more input CSV tables
    • Ground truth files mapping mentions/cells to Wikidata QIDs (or NIL)
    • Candidate mappings (mention_to_qid/*.json), sometimes multiple variants
    • Optional files such as column_classifications.json or cell_to_qid.json



    📝 Licensing
    • HTR1 & HTR2: CC BY 4.0
    • Company & Movie: Derived from Wikidata (public domain; CC0 1.0)
    • SN: CC BY 4.0 (from the enRichMyData project)



    📌 Citation


    If you use these datasets, please cite:
    • This Zenodo record (Version 2):
    Avogadro, R., & Rauniyar, A. (2025). Benchmark Datasets for Entity Linking from Tabular Data (Version 2). Zenodo. https://doi.org/10.5281/zenodo.15888942
    • The SemTab Challenge (for HTR1/HTR2):
    SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (Table-to-KG). (Cite the relevant SemTab overview paper for the year you reference.)
    • Wikidata: Data retrieved from Wikidata (public domain; CC0 1.0).
    • enRichMyData (for SN / Spend Network): Project resources from enRichMyData, licensed under CC BY 4.0.

  17. u

    Growth and Yield Data for the Bushland, Texas, Sorghum Datasets

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    xlsx
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt (2025). Growth and Yield Data for the Bushland, Texas, Sorghum Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1529411
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Steven R. Evett; Gary W. Marek; Karen S. Copeland; Terry A. Sr. Howell; Paul D. Colaizzi; David K. Brauer; Brice B. Ruthardt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Bushland, Texas
    Description

    This dataset consists of growth and yield data for each season when sorghum [Sorghum bicolor (L.)] was grown at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In the 1988, 1991, 1993, 1997, 1998, 1999, 2003 through 2007, 2014, and 2015 seasons (13 years), sorghum was grown on from one to four large, precision weighing lysimeters, each in the center of a 4.44 ha square field also planted to sorghum. The square fields were themselves arranged in a larger square with four fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field were thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Sorghum was grown on different combinations of fields in different years. When irrigated, irrigation was by linear move sprinkler system years before 2014, and by both sprinkler and subsurface drip irrigation in 2014 and 2015. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigation at rates established as percentages of full irrigation ranging from 33% to 75% depending on the year. The growth and yield data include plant population density, height, plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), seed mass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. Machine harvest yields are commonly smaller than hand harvest yields due to combine losses. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on sorghum ET, crop coefficients, crop water productivity, and simulation modeling of crop water use, growth, and yield. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used for testing, and calibrating models of ET that use satellite and/or weather data. See the README for descriptions of each data file.

  18. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  19. House Price Regression Dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
    Explore at:
    zip(27045 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Prokshitha Polemoni
    Description

    Home Value Insights: A Beginner's Regression Dataset

    This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

    Features:

    1. Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.
    2. Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.
    3. Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.
    4. Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.
    5. Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.
    6. Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.
    7. Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.
    8. House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

    Potential Uses:

    1. Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

    2. Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

    3. Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

    4. Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

    Versatility:

    • The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

    • It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

    • This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.

  20. Data from: A Machine Learning Model to Estimate Toxicokinetic Half-Lives of...

    • catalog.data.gov
    • datasets.ai
    Updated Apr 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species [Dataset]. https://catalog.data.gov/dataset/a-machine-learning-model-to-estimate-toxicokinetic-half-lives-of-per-and-polyfluoro-alkyl-
    Explore at:
    Dataset updated
    Apr 30, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data and code for "Dawson, D.E.; Lau, C.; Pradeep, P.; Sayre, R.R.; Judson, R.S.; Tornero-Velez, R.; Wambaugh, J.F. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics 2023, 11, 98. https://doi.org/10.3390/toxics11020098" Includes a link to R-markdown file allowing the application of the model to novel chemicals. This dataset is associated with the following publication: Dawson, D., C. Lau, P. Pradeep, R. Sayre, R. Judson, R. Tornero-Velez, and J. Wambaugh. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-Alkyl Substances (PFAS) in Multiple Species. Toxics. MDPI, Basel, SWITZERLAND, 11(2): 98, (2023).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge

dataset-pinkball-first-merge

treitz/dataset-pinkball-first-merge

Explore at:
Dataset updated
Dec 1, 2025
Authors
Thomas R
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset was created using LeRobot.

  Dataset Structure

meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.

Search
Clear search
Close search
Google apps
Main menu