100+ datasets found
  1. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  2. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, bin
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Dylan Westfall; Mullins James; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Westfall; Dylan Westfall; Mullins James; Mullins James
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Measurement technique
    <p>This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"</p> <p>Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005</p> <p>For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.</p> <p>The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.</p> <p>The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an `.Rmd` file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.</p> <p>`Sequence_Analysis.Rmd` has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for `Indentifying_Recombinant_Reads.Rmd` and `Figures.Rmd`. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.</p> <p>To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.</p> <p>Using `Identifying_Recombinant_Reads.Rmd`, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.</p> <p>`Figures.Rmd` used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for each<br>Figure. These were copied into Prism software to create the final figures for the paper.</p>
    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.

  3. f

    Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  4. Data Scientists vs Size of Datasets

    • kaggle.com
    zip
    Updated Oct 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/laurae2/data-scientists-vs-size-of-datasets
    Explore at:
    zip(1191 bytes)Available download formats
    Dataset updated
    Oct 18, 2016
    Authors
    Laurae
    Description

    This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

    What can you do with the data?

    • Look up whether Kagglers has "stronger" hardware than non-Kagglers
    • Whether there is a correlation between a preferred data set size and hardware
    • Is proficiency a predictor of specific preferences?
    • Are data scientists more Intel or AMD?
    • How spread is GPU computing, and is there any relationship with Kaggling?
    • Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

    I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

    • Your intended usage (research? business use? blogging?...)
    • Your first/last name

    Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

    Data set size:

    • Small: under 1 million values
    • Medium: between 1 million and 1 billion values
    • Large: over 1 billion values

    For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

    • DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No
    • DS_2 = Do you enjoy working with large data sets? => Yes or No
    • DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large
    • DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No
    • DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No
    • W_1 = What is your CPU brand? => Intel or AMD
    • W_2 = Do you have access to a remote server to perform large workloads? => Yes or No
    • W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s
    • W_4 = How many cores do you have to work with data sets? => numeric output
    • W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output
    • W_6 = Do you do GPU computing? => Yes or No
    • W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)
    • W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)
    • W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

    You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.

  5. Glassdoor Job Reviews 2

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DG (2024). Glassdoor Job Reviews 2 [Dataset]. https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews-2/code
    Explore at:
    zip(1074079348 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    DG
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This large dataset (+8 million obs) contains job descriptions and rankings among various criteria such as work-life balance, income, culture, etc.

    This data set complements the Glassdoor dataset located [here].(https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews)

    Please cite as: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FQstpaoAAAAJ&citation_for_view=FQstpaoAAAAJ:UebtZRa9Y70C

    Glassdoor Reviews Glassdoor produces reports based upon the data collected from its users, on topics including work–life balance, CEO pay ratios, lists of the best office places and cultures, and the accuracy of corporate job searching maxims. Data from Glassdoor has also been used by outside sources to produce estimates on the effects of salary trends and changes on corporate revenues. Glassdoor also puts the conclusions of its research of other companies towards its company policies. In 2015, Tom Lakin produced the first study of Glassdoor in the United Kingdom, concluding that Glassdoor is regarded by users as a more trustworthy source of information than career guides or official company documents.

    Features The columns correspond to the date of the review, the job name, the job location, the status of the reviewers, and the reviews. Reviews are divided into sub-categories Career Opportunities, Comp & Benefits, Culture & Values, Senior Management, and Work/Life Balance. In addition, employees can add recommendations on the firm, the CEO, and the outlook.

    Other information Ranking for the recommendation of the firm, CEO approval, and outlook are allocated categories v, r, x, and o, with the following meanings: v - Positive, r - Mild, x - Negative, o - No opinion

    Some examples of the textual data entries MCDONALD-S I don't like working here,don't work here Headline: I don't like working here,don't work here Pros: Some people are nice,some free food,some of the managers are nice about 95% of the time Cons: 95% of people are mean to employees/customers,its not a clean place,people barely clean their hands of what i see,managers are mean,i got a stress rash because of this i can't get rid of it,they don't give me a little raise even though i do alot of crap there for them Rating: 1.0

    KPMG Quit working people to death Headline: Quit working people to death Pros: Lots of PTO, Good company training Cons: long hours, clear disconnect between management and staff, as corporate as it gets Rating: 2.0

    PRIMARK Sales assistant Headline: Sales assistant Pros: Lovely staff, managers are also very nice Cons: Hardwork, often rude customers, underpaid for u18 Rating: 3.0

    J-P-MORGAN Life in JPM, Bangalore Headline: Life in JPM, Bangalore Pros: Good place to start, lots of opportunity. Cons: Be ready to put in a lot of effort not a place to chill out. Rating: 4.0

    VODAFONE Good to be here Headline: Good to be here Pros: Fast moving with technology. Leading Cons: There are areas you may want to avoid Rating: 5.0

  6. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  7. Glassdoor Job Reviews

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DG (2024). Glassdoor Job Reviews [Dataset]. https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews/code
    Explore at:
    zip(87995333 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    DG
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This large dataset contains job descriptions and rankings among various criteria such as work-life balance, income, culture, etc. The data covers the various industries in the UK. Great dataset for multidimensional sentiment analysis.

    This data set complements the Glassdoor dataset located [here].(https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews-2)

    Please cite as: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FQstpaoAAAAJ&citation_for_view=FQstpaoAAAAJ:UebtZRa9Y70C

    Glassdoor Reviews

    Glassdoor produces reports based upon the data collected from its users, on topics including work–life balance, CEO pay ratios, lists of the best office places and cultures, and the accuracy of corporate job searching maxims. Data from Glassdoor has also been used by outside sources to produce estimates on the effects of salary trends and changes on corporate revenues. Glassdoor also puts the conclusions of its research of other companies towards its own company policies. In 2015, Tom Lakin produced the first study of Glassdoor in the United Kingdom, concluding that Glassdoor is regarded by users as a more trustworthy source of information than career guides or official company documents.

    Features

    The columns correspond to the date of the review, the job name, the job location, the status of the reviewers, and the reviews. Reviews are divided in s sub-categories Career Opportunities, Comp & Benefits, Culture & Values, Senior Management, and Work/Life Balance. In addition, employees can add recommendations on the firm, the CEO, and the outlook.

    Other information

    Ranking for the recommendation of the firm, CEO approval, and outlook are allocated categories v, r, x, and o, with the following meanings: v - Positive, r - Mild, x - Negative, o - No opinion

    Some examples of the textual data entries

    MCDONALD-S I don't like working here,don't work here Headline: I don't like working here,don't work here Pros: Some people are nice,some free food,some of the managers are nice about 95% of the time Cons: 95% of people are mean to employees/customers,its not a clean place,people barely clean their hands of what i see,managers are mean,i got a stress rash because of this i can't get rid of it,they don't give me a little raise even though i do alot of crap there for them Rating: 1.0

    KPMG Quit working people to death Headline: Quit working people to death Pros: Lots of PTO, Good company training Cons: long hours, clear disconnect between management and staff, as corporate as it gets Rating: 2.0

    PRIMARK Sales assistant Headline: Sales assistant Pros: Lovely staff, managers are also very nice Cons: Hardwork, often rude customers, underpaid for u18 Rating: 3.0

    J-P-MORGAN Life in JPM, Bangalore Headline: Life in JPM, Bangalore Pros: Good place to start, lots of opportunity. Cons: Be ready to put in a lot of efforts not a place to chill out. Rating: 4.0

    VODAFONE Good to be here Headline: Good to be here Pros: Fast moving with technology. Leading Cons: There are areas you may want to avoid Rating: 5.0

  8. w

    Dataset of books by R. Brad Long

    • workwithdata.com
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books by R. Brad Long [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=author&fop0=%3D&fval0=R.+Brad+Long
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the author is R. Brad Long. It features 7 columns including author, publication date, language, and book publisher.

  9. H

    Data from: phyloraster: an R package to calculate measures of endemism and...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Alves-Ferreira; Flávio Mota; Daniela Talora; Cynthia Oliveira; Mirco Solé; Neander Heming (2023). phyloraster: an R package to calculate measures of endemism and evolutionary diversity for rasters [Dataset]. http://doi.org/10.7910/DVN/QSNTSG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gabriela Alves-Ferreira; Flávio Mota; Daniela Talora; Cynthia Oliveira; Mirco Solé; Neander Heming
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    With the increasing size and complexity of biogeographical and phylogenetic data, it is important to provide functions that allow the manipulation of these datasets efficiently and rapidly. The phylogrid package aims to facilitate the spatial analysis of measures of diversity and endemism. Our package provide a set of functions to join the results derived from species distribution models (SDMs) with phylogenies and endemism data. The functions are focused on pre-processing, processing and post-processing of macroecological and phylogenetic data. The pre-processing step offers basic functions for preparing the data before running the analyses. The processing step brings together functions to calculate Faith’s phylogenetic diversity, phylogenetic endemism, weighted endemism and evolutionary distinctiveness. This step also provides functions to calculate standardized effect size for each metric through different methods of spatial and phylogenetic randomization, aiming to control for richness effects. The post processing stage includes functions to calculate the delta of metrics between different times (e.g. present and future). We have shown that the package has a slightly longer computation time than comparable packages, but takes up a considerably smaller portion of RAM memory, which will allow users to work with high-resolution datasets from local to global scales. This enhances the application of the package by enabling users to work with large datasets on computers with less RAM available.

  10. n

    eBird - Dataset - INTAROS Data Catalogue

    • catalog-intaros.nersc.no
    Updated Nov 5, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). eBird - Dataset - INTAROS Data Catalogue [Dataset]. https://catalog-intaros.nersc.no/dataset/ebird
    Explore at:
    Dataset updated
    Nov 5, 2020
    Description

    eBird is among the world’s largest biodiversity-related science projects, with more than 100 million bird sightings contributed annually by eBirders around the world and an average participation growth rate of approximately 20% year over year. eBird is managed by the Cornell Lab of Ornithology. Some data has been contributed under INTAROS WP4. eBird provides open data access in several formats to logged-in users, ranging from raw data to processed datasets geared toward more rigorous scientific modeling. eBird Basic Dataset (EBD) The EBD is the core dataset for accessing all raw eBird observations and associated metadata. The EBD is updated monthly (15th of each month), and is available by direct download through eBird to any logged-in user after completion of a data request form. The data request form allows us to gain some understanding of how the data will be used. Requests are typically approved within 7 days. Data are provided with documentation in spreadsheet format, which can be read by a variety of programs. Although Excel or similar programs work for basic analyses, for larger datasets (>1 million rows) or more sophisticated analyses, we recommend using programs like R. There are several R packages available for summarizing data, including one that is managed here at the Cornell Lab specifically for working with the EBD dataset. The data collection may enable a better understanding of bird population dynamics and the status of bird species including bird conservation management requirements.

  11. h

    R-HORIZON-training-data

    • huggingface.co
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LongCat (2025). R-HORIZON-training-data [Dataset]. https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset authored and provided by
    LongCat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    R-HORIZON

    How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

    📃 Paper • 🌐 Project Page • 🤗 Dataset

    R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span… See the full description on the dataset page: https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data.

  12. [Superseded] Intellectual Property Government Open Data 2019

    • researchdata.edu.au
    • data.gov.au
    Updated Jun 6, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
    Explore at:
    Dataset updated
    Jun 6, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    IP Australia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is IPGOD?\r

    The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

    How do I use IPGOD?\r

    IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

    IP Data Platform\r

    IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

    References\r

    \r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

    Updates\r

    \r

    Tables and columns\r

    \r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

    Data quality improvements\r

    \r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  13. bellabeat_dataset

    • kaggle.com
    zip
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reiner Serrano (2025). bellabeat_dataset [Dataset]. https://www.kaggle.com/datasets/reinerserrano/bellabeat-dataset
    Explore at:
    zip(727100 bytes)Available download formats
    Dataset updated
    Nov 24, 2025
    Authors
    Reiner Serrano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Hi everyone! Iʼm currently in the 8th course of the Data Analysis program on Coursera, and this is my first case study completed entirely on my own (following the steps provided throughout the course). I chose to work with R because it was one of the tools that interested me the most during my learning journey. In my opinion, itʼs incredibly powerful — you can handle large datasets and create visualizations with just a few commands. Iʼm really happy with the final result and I feel I learned a lot throughout the process. Iʼve truly fallen in love with data and its power to create change. Thanks for reading, and Iʼd be happy to receive any feedback!

  14. Reddit: /r/technology (Submissions & Comments)

    • kaggle.com
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/technology (Submissions & Comments)

    Title, Score, ID, URL, Comment Number, and Timestamp

    By Reddit [source]

    About this dataset

    This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

    Research Ideas

    • Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
    • Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
    • Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  15. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  16. H

    Replication Data for: kluster: An Efficient Scalable Procedure for...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 15, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Estiri (2018). Replication Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.7910/DVN/LLIOHM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Hossein Estiri
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z

  17. d

    Multivariate Time Series Search

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Multivariate Time Series Search [Dataset]. https://catalog.data.gov/dataset/multivariate-time-series-search
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Dashlink
    Description

    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.

  18. h

    fib

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    r-three (2023). fib [Dataset]. https://huggingface.co/datasets/r-three/fib
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    r-three
    Description

    Dataset Card for FIB

      Dataset Summary
    

    The FIB benchmark consists of 3579 examples for evaluating the factual inconsistency of large language models. Each example consists of a document and a pair of summaries: a factually consistent one and a factually inconsistent one. It is based on documents and summaries from XSum and CNN/DM. Since this dataset is intended to evaluate the factual inconsistency of large language models, there is only a test split.
    Accuracies should be… See the full description on the dataset page: https://huggingface.co/datasets/r-three/fib.

  19. Z

    KNMI-LENTIS large ensemble time slice dataset description

    • nde-dev.biothings.io
    Updated Sep 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bintanja, Richard (2023). KNMI-LENTIS large ensemble time slice dataset description [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7573136
    Explore at:
    Dataset updated
    Sep 29, 2023
    Dataset provided by
    Reerink, Thomas
    Bintanja, Richard
    Muntjewerf, Laura
    Van der Wiel, Karin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Contents

    Available variables in KNMI-LENTIS

    request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt

    Where is the data deposited on the ECWMF's tape storage (section 4)

    LENTIS_on_ECFS.zip

    Data of all variables for 1 year for 1 ensemble member (section 5)

    tree_of_files_one_member_all_data.txt

    {AERmon,Amon,Emon,LImon,Lmon,Ofx,Omon,SImon,fx,Eday,Oday,day,CFday,3hr,6hrPlev,6hrPlevPt}.zip

    1. Description of this Zenodo dataset

    This Zenodo dataset pertains to the full KNMI-LENTIS dataset: a large ensemble of simulations with the Global Climate Model EC-Earth3. The periods are for the present-day period (2000-2009) and a future +2K period (2075-2084 following SSP2-4.5). KNMI-LENTIS has 1600 simulated years for both the two climates. This level of sampled climate variability allows for robust and in-depth research into extreme events. The available variables are listed in the file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt. All variables are cmorised following CMIP6 data format convention. Further details on the variables and their output dimensions is available via the following search tool. The total size of KNMI-LENTIS is 128 TB. KNMI-LENTIS is stored at the high performance storage system of the ECMWF (ECFS).

    The Global Climate Model that is used for generating this Large Ensemble is EC-Earth3 - VAREX project branch https://svn.ec-earth.org/ecearth3/branches/projects/varex (access restricted to ECMWF members).

    The goal of this Zenodo dataset is :

    to provide an accurate description and example of how the KNMI-LENTIS dataset is organised.

    to describe in which servers the data are deposited and how to gain access to the data for future users

    to provide links to related git repositories and other content relating to the KNMI-LENTIS production

    1. How KNMI-LENTIS is organised

    KNMI-LENTIS consists of 2 times 160 runs of 10 years. All simulations have a unique ensemble member label that reflects the forcing, and how the initial conditions are generated. The initial conditions have two aspects: the parent simulation from which the run is branched (macro perturbation, there are 16), and the seed relating to a particular micro-perturbation in the initial three-dimensional atmosphere temperature field (there are 10). The ensemble member label thus is a combination of:

    forcing (h for present-day/historical and s for +2K/SSP2-4.5)

    parent ID (number between 1 and 16)

    micro perturbation ID (number between 0 and 9)

    In this Zenodo dataset we publish 1 year from 1 member to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The published data is year 2000 from member h010. See Section 4

    Further, all KNMI-LENTIS simulations are labeled per the CMIP6 convention of variant labelling. A variant label is made from four components: the realization index r, the initialization index i, the physics index p and the forcing index f. Further details on CMIP6 variant labelling be found in The CMIP6 Participation Guidance for Modelers. In the KNMI-LENTIS data set, the forcing is reflected in the first digit of the realization index r of the variant label. For the historical simulations, the one thousands (r1000-r1999) have been reserved. For the SSP2-4.5 the five thousands (r5000-r5999) have been reserved. The parent is reflected in the second and third digit of the realization index r of the variant label (r?01?-r?16?). The seed is reflected in the fourth digit of the realization index r: (r???0-r???9). The seed is also reflected in the initialization index i of the variant label (i0-i9), so this is double information. The physics index p5 has been reserved for the ECE3p5 version: all KNMI-LENTIS simulations have the p5 label. The forcing index f of the variant label is kept at 1 for all KNMI-LENTIS simulations. As an example, variant label r5119i9p5f1 refers to: the 2K time slice with parent 11 and randomizing seed number 9. The physics index is 5, meaning the run is done with the ECE3p5 version of EC-Earth3.

    1. Where is the data deposited on the ECWMF's tape storage

    In this Zenodo folder, there are several text files and several netcdf files. The text files provide

    Data from KNMI-LENTIS is deposited in the ECMWF ECFS tape storage system. Data can be freely downloaded by to those who have access to the ECMWF ECFS. Else, the data can be made available by the authors upon request.

    The way the dataset is organised is detailed in LENTIS_on_ECFS.zip. This contains details on all available KNMI-LENTIS files, in particular details for how these are filed in ECFS. The files on ECFS are tar zipped per ensemble member & variable: these contain 10 years of ensemble member data (10 separate netcdf files). The location on ECFS of the tar-zipped files that are listed in the various text files in this Zenodo dataset is

    ec:/nklm/LENTIS/ec-earth/cmorised_by_var/

    !/bin/bash

    -------------------

    script to write out LENTIS details on ECFS

    -------------------

    for freq in AERmon Amon Emon LImon Lmon Ofx Omon SImon fx Eday Oday day CFday 3hr 6hrPlev 6hrPlevPt; do for scen in hxxx sxxx; do els -l ec:/nklm/LENTIS/ec-earth/cmorised_by_var/${scen}/${freq}/* >> LENTIS_on_ECFS_${scen}_${freq}.txt done done

    Further, part of the data will be made publicly available from the Earth System Grid Federation (ESGF) data portal. We aim to upload most of the monthly variables for the full ensemble. As search terms use EC-Earth for model and p5 for physical index to locate the KNMI-LENTIS data.

    1. Data of all variables for 1 year for 1 ensemble member

    The netcdf files of the data of 1 year from 1 member h010 are published here to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The data are in zipped folders per output frequencies: AERmon, Amon, Emon, LImon, Lmon, Ofx, Omon, SImon, fx, Eday, Oday, day, CFday, 3hr, 6hrPlev, 6hrPlevPt. The text file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt gives an overview of variables available per output frequency. the text files tree_of_files_one_member_all_data.txt gives an overview of the files in the zipped folders.

    1. Related links

    The production of the KNMI-LENTIS ensemble was funded by the KNMI (Royal Dutch Meteorological Institute) multi-year strategic research fund KNMI MSO Climate Variability And Extremes (VAREX)

    GitHub repository corresponding to this Zenodo dataset: https://github.com/lmuntjewerf/KNMI-LENTIS_dataset_description.git

    Github repository for KNMI-LENTIS production code: https://github.com/lmuntjewerf/KNMI-LENTIS_production_script_train.git

  20. r

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Apr 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    Dataset updated
    Apr 1, 2019
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Publication


    Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

    Description of R codes and data files in the repository

    This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

    The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

    These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

    Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

    Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

    The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Explore at:
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description

This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

Search
Clear search
Close search
Google apps
Main menu