100+ datasets found

D
Data set for reproducing plots showing stable water isotopologue transport...
darus.uni-stuttgart.de
Updated Oct 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data set for reproducing plots showing stable water isotopologue transport and fractionation [Dataset]. https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3108
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3108
Dataset updated
Oct 6, 2022
Dataset provided by
DaRUS
Authors
Stefanie Kiemle; Katharina Heck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
DFG
Description
This data set includes the *.csv data and the used scripts to reproduce the plots of the three different scenarios presented in S. Kiemle, K. Heck, E. Coltman, R. Helmig (2022) Stable water isotopologue fractionation during soil-water evaporation: Analysis using a coupled soil-atmosphere model. (Under review) Water Resources Research. *.csv files The isotope distribution has been analyzed in the vertical and in horizontal direction of a soil column for all scenarios. Therefore, we provide *.csv files generated using the ParaView Tools "plot over line" or "plot over time". Each *.csv file contains information about the saturation, temperature, and component composition for each phase in mole fraction or in the isotopic-specific delta notation. Additionally, information about the evaporation rate is given in a separate file *.txt file. python scripts For each scenario, we provide scripts to reproduce the presented plots. Scenarios We used different free flow conditions to analyze the fractionation processes inside the porous medium. Scenario 1. laminar flow, Scenario 2. laminar flow, but with isolation of parameter affecting the fractionation process, Scenario 3. turbulent flow. Please find below a detailed description of the data labeling and needed scripts to reproduce a certain plot for each scenario. Scenario: The spatial distribution of stable water isotopologues in horizontal (-0.01 m depth) and vertical (at 0.05 m width) inside a soil column at selected days (DoE (Day of Experiment)): Use the python scripts plot_concentration_horizontal_all.py (horizontal direction) and plot_concentration_spatial_all.py (vertical direction) to create the specific plots. In the folder IsotopeProfile_Horizontal and IsotopeProfile_Vertical the belonging *.csv can be found. The *.csv files are named after the selected day (e.g. DoE_80 refers to day 80 of the virtual experiment). The influence of the evaporation rate on isotopic fractionation processes in various depths (-0.001, -0.005, -0.009, and -0.018 m ) during the whole virtual experiment time: Use the python script plot_evap_isotopes_v2.py to create the plots. The data for the isotopologues distribution and the saturation can be found in the folder PlotOverTime. All data is named as PlotOverTime_xxxxm with xxxx representing the respective depth (e.g. PlotOverTime_0.001m refers to -0.001 m depth). The data for the evaporation rate can be found in the folder EvaporationRate. Note, the evaporation rate data is available as a .txt because we extract the information about the evaporation directly during the simulation and do not derive it through any post-processing. Scenario: Process behavior of isolated parameters that influences the isotopic fractionation: Use plot_concentration.py to reproduce the plots either represented in the isotopic-specific delta notation or in mole fraction. The corresponding data can be found in the folder IsotopeProfile_Vertical. The data labeling refers to the single cases (1- no fractionation; 2 - only equilibrium fractionation; 3 - only kinetic fractionation; 4 - only liquid diffusion; 5 - Reference). Scenario: Evaporation rate during the virtual experiment for different flow cases: With plot_evap.py and the .txt files which can be found in the folder EvaporationRate, the evaporation progression can be plotted. The labeling of the .txt files refers to the different flow cases (1 - 0.1 m/s (laminar); 2 - 0.13 m/s (laminar); 3 - 0.5 m/s (turbulent); 4 - 1 m/s (turbulent); 5 - 3 m/s (turbulent)). The isotope profiles in the vertical and horizontal direction of the soil column (similar to Scenario 1) for selected days: With plot_cocentration_horizontal_all.py and plot_concentration_spatial_all.py the plots for the horizontal and vertical distribution of isotopologues can be generated. The corresponding data can be found in the folders IsotopeProfile_Horizontal and IsotopeProfile_Vertical. These folders are structured with subfolders containing the data of selected days of the virtual experiments (DoE - Day of Experiments), in this case, day 2, 10, and 35. The data labeling remains similar to Scenario 3a).
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
b
Video Plankton Recorder data (formatted with taxa displayed in single...
datacart.bco-dmo.org
bco-dmo.org
+1more
csv
Updated Jul 31, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carin J. Ashjian (2012). Video Plankton Recorder data (formatted with taxa displayed in single column); from R/V Columbus Iselin and R/V Endeavor cruises CI9407, EN259, EN262 in the Gulf of Maine and Georges Bank from 1994-1995 [Dataset]. https://datacart.bco-dmo.org/dataset/3685
Explore at:
csv(370.26 MB)Available download formats
Dataset updated
Jul 31, 2012
Dataset provided by
Biological and Chemical Data Management Office
Authors
Carin J. Ashjian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Gulf of Maine, Georges Bank
Variables measured
lat, lon, sal, temp, year, fluor, press, taxon, flvolt, abund_L, and 9 more
Measurement technique
Video Plankton Recorder
Description
This dataset includes ALL the abundance values, zero and non-zero. Taxonomic groups are diplayed in the 'taxon' column, rather than in separate columns, with abundances in the 'abund_L' column. For the original presentation of the data, see VPR_ashjian_orig. For a version of the data with only non-zero data, see VPR_ashjian_nonzero. In the 'nonzero' dataset, values of 0 in the abund_L column (taxon abundance) have been removed.

Methodology
The following information was extracted from C.J. Ashjian et al., Deep- Sea Research II 48(2001) 245-282 . An in-depth discussion of the data and sampling methods can be found there.

The Video Plankton Recorder was towed at 2 m/s, collecting data from the surface to the bottom (towyo). The VPR was equipped with 2-4 cameras, temperature and conductivity probes, fluorometer and transmissometer. Environmental data was collected at 0.25 Hz (CI9407) or 0.5 Hz (EN259, EN262). Video images were recorded at 60 fields per second (fps).

Video tapes were analyzed for plankton abundances using a semi-automated method discussed in Davis, C.S. et al., Deep-Sea Research II 43 (1996) 1946-1970. In-focus images were extracted from the video tapes and identified by hand to particle type, taxon, or species. Plankton and particle observations were merged with environmental and navigational data by binning the observations for each category into the time intervals at which the environmental data were collected (again see above Davis citation). Concentrations were calculated utilizing the total volume (liters) imaged during that period. For less-abundant categories, usually only a single organism was observed during each time interval so that the resulting concentrations are close to presence or absence data rather than covering a range of values.
[Superseded] Intellectual Property Government Open Data 2019
researchdata.edu.au
data.gov.au
Updated Jun 6, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
Explore at:
Dataset updated
Jun 6, 2019
Dataset provided by
Data.govhttps://data.gov/
Authors
IP Australia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is IPGOD?\r

The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

How do I use IPGOD?\r

IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

IP Data Platform\r

IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

References\r

\r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

Updates\r

\r

Tables and columns\r

\r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

Data quality improvements\r

\r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
w
Randomized Hourly Load Data for use with Taxonomy Distribution Feeders
data.wu.ac.at
datadiscoverystudio.org
application/unknown
Updated Aug 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
Explore at:
application/unknownAvailable download formats
Dataset updated
Aug 29, 2017
Dataset provided by
Department of Energy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Reddit AskScience Flair Analysis Dataset
kaggle.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit Mishra
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.

NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.

Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.

Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.

Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
f
ukbtools: An R package to manage and query UK Biobank data
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken B. Hanscombe; Jonathan R. I. Coleman; Matthew Traylor; Cathryn M. Lewis (2023). ukbtools: An R package to manage and query UK Biobank data [Dataset]. http://doi.org/10.1371/journal.pone.0214311
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0214311
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Ken B. Hanscombe; Jonathan R. I. Coleman; Matthew Traylor; Cathryn M. Lewis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThe UK Biobank (UKB) is a resource that includes detailed health-related data on about 500,000 individuals and is available to the research community. However, several obstacles limit immediate analysis of the data: data files vary in format, may be very large, and have numerical codes for column names.Resultsukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis. All associated data files are merged into a single dataset with descriptive column names. The package also provides tools to assist in quality control by exploring the primary demographics of subsets of participants; query of disease diagnoses for one or more individuals, and estimating disease frequency relative to a reference variable; and to retrieve genetic metadata.ConclusionHaving a dataset with meaningful variable names, a set of UKB-specific exploratory data analysis tools, disease query functions, and a set of helper functions to explore and write genetic metadata to file, will rapidly enable UKB users to undertake their research.
Z
Data from: Dataset from : Browsing is a strong filter for savanna tree...
data.niaid.nih.gov
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archibald, Sally (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
Explore at:
Dataset updated
Oct 1, 2021
Dataset provided by
Archibald, Sally
Craddock Mthabini
Wayne Twine
Nicola Stevens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data presented here were used to produce the following paper:

Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

Description of file(s):

File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Data from: Thirteen-year Stover Harvest and Tillage Effects on Corn...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Thirteen-year Stover Harvest and Tillage Effects on Corn Agroecosystem Sustainability in Iowa [Dataset]. https://catalog.data.gov/dataset/thirteen-year-stover-harvest-and-tillage-effects-on-corn-agroecosystem-sustainability-in-i-be5ae
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This dataset includes soil health, crop biomass, and crop yield data for a 13-year corn stover harvest trial in central Iowa. Following the release in 2005 of the Billion Ton Study assessment of biofuel sources, several soil health assessments associated with harvesting corn stover were initiated across ARS locations to help provide industry guidelines for sustainable stover harvest. This dataset is from a trial conducted by the National Laboratory for Agriculture and Environment from 2007-2021 at the Iowa State University Ag Engineering and Agronomy farm. Management factors evaluated in the trial included the following. Stover harvest rate at three levels: No, moderate (3.5 ± 1.1 Mg ha-1 yr-1), or high (5.0 ± 1.7 Mg ha-1 yr-1) stover harvest rates. No-till versus chisel-plow tillage. Originally, the 3 stover harvest rates were evaluated in a complete factorial design with tillage system. However, the no-till, no-harvest system performed poorly in continuous corn and was discontinued in 2012 due to lack of producer interest. Cropping sequence. In addition to evaluating continuous corn for all stover harvest rates and tillage systems, a corn-alfalfa rotation, and a corn-soybean-wheat rotation with winter cover crops were evaluated in a subset of the tillage and stover harvest rate treatments. One-time additions of biochar in 2013 at rates of either 9 Mg/ha or 30 Mg/ha were evaluated in a continuous corn cropping system. The dataset includes: 1) Crop biomass and yields for all crop phases in every year. 2) Soil organic carbon, total carbon, total nitrogen, and pH to 120 cm depth in 2012, 2016, and 2017. Soil cores from 2005 (pre-study) were also sampled to 90 cm depth. 3) Soil chemistry sampled to 15 cm depth every 1-2 years from 2007 to 2017. 4) Soil strength and compaction was assessed to 60 cm depth in April 2021. These data have been presented in several manuscripts, including Phillips et al. (in review), O'Brien et al. (2020), and Obrycki et al. (2018). Resources in this dataset:Resource Title: R Script for Phillips et al. 2022. File Name: Field 70-71 Analysis Script_AgDataCommons.RResource Description: This R script includes analysis and figures for Phillips et al. "Thirteen-year Stover Harvest and Tillage Effects on Soil Compaction in Iowa". It focuses primarily on the soil compaction and strength data found in "Field 70-71 ConeIndex_BulkDensityDepths_2021". It also includes analysis of corn yields from "Field 70-71 CornYield_2008-2021" and weather conditions from "PRISM_MayTemps" and "Rainfall_AEA".Resource Software Recommended: R version 4.1.3 or higher,url: https://cran.r-project.org/bin/windows/base/ Resource Title: Field 70-71 ConeIndex_BulkDensityDepths_2021. File Name: Field 70-71 ConeIndex_BulkDensityDepths_2021.csvResource Description: This dataset provides an assessment of soil strength (penetration resistance) and soil compaction (bulk density) to 60 cm depth, in continuous corn plots. Penetration resistance was measured in most-trafficked and least-trafficked areas of the plots to assess compaction from increased traffic associated with stover harvest. This spreadsheet also has associated data, including soil water, carbon, and organic matter content. Data were collected in April 2021 and are described in Phillips et al. (in review, 2022).Resource Title: Field 70-71 CornYield_2008-2021. File Name: Field 70-71 CornYield_2008-2021_ForR.csvResource Description: This dataset provides corn stover biomass and grain yields from 2008-2021. Note that this dataset is just for corn, which were presented in Phillips et al., 2022. Yields for all crop phases, including soybeans, wheat, alfalfa, and winter cover crops, are in the file "Field 70-71 Crop Yield File 2008-2020".Resource Title: PRISM_MayTemps. File Name: PRISM_MayTemps.csvResource Description: Average May temperatures during the study period, obtained from interpolation of regional weather stations using the PRISM climate model (https://prism.oregonstate.edu/). These data were used to evaluate how spring temperatures may have impacted corn establishment.Resource Title: Rainfall_AEA. File Name: Rainfall_AEA.csvResource Description: Daily rainfall for the study location, 2008-2021. Data were obtained from the Iowa Environmental Mesonet (https://mesonet.agron.iastate.edu/rainfall/). Title: Field 70-71 Plot Status 2007-2021. File Name: Field 70-71 Plot Status 2007-2021.xlsxResource Description: This file contains descriptions of experimental treatments and diagrams of plot layouts as they were modified through several phases of the trial. Also includes an image of plot locations relative to NRCS soil survey map units.Resource Title: Field 70-71 Deep Soil Cores 2012-2017. File Name: Field 70-71 Deep Soil Cores 2012-2017.xlsxResource Description: Soil carbon, nitrogen, organic matter, and pH to 120 cm depth in 2012, 2016, and 2017.Resource Title: Field 70-71 Baseline Deep Soil Cores 2005. File Name: Field 70-71 Baseline Deep Soil Cores 2005.csvResource Description: Baseline soil carbon, nitrogen, and pH data from an earlier trial in 2005, prior to stover trial establishment.Resource Title: Field 70-71 Crop Yield File 2008-2020. File Name: Field 70-71 Crop Yield File 2008-2020.xlsxResource Description: Yields for all crops in all cropping sequences, 2008-2020. Some of the crop sequences have not been summarized in publications.Resource Title: Field 70-71 Surface Soil Test Data 2007-2021. File Name: Field 70-71 Surface Soil Test Data 2007-2021.xlsxResource Description: Soil chemistry data, 0-15 cm, collect near-annually from 2007 to 2021. Most analyses were performed by Harris Laboratories (now AgSource) in Lincoln, Nebraska, USA. Resource Title: Iowa Stover Harvest Trial Data Dictionary. File Name: Field 70-71 Data Dictionary.xlsxResource Description: Data dictionary for all data files.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
o
Movie Rationales (Rationales For Movie Reviews)
opendatabay.com
.undefined
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Movie Rationales (Rationales For Movie Reviews) [Dataset]. https://www.opendatabay.com/data/ai-ml/056ebe3b-4213-4643-b69d-3933e0cfa443
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset was created to allow researchers to gain an in-depth understanding of the inner workings of human-generated movie reviews. With these train, test, and validation sets, researchers can explore different aspects of movie reviews, such as sentiment labels or rationales behind them. By analyzing this information and finding patterns and correlations, insightful ideas can be discovered that can lead to developing models powerful enough to uncover importance of the unique human perspectives when interpreting movie reviews. Any data scientist or researcher interested in AI applications is encouraged to take advantage of this dataset which may potentially provide useful insights into better understanding user intent when reviewing movies

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset is intended to enable researchers and developers to uncover the rationales behind movie reviews. To use it effectively, you must understand the data format and how each column in the dataset works.

What does each column mean? review: The text of the movie review. (String)

label: The sentiment label of the review (Positive, Negative, or Neutral). (String)

validation.csv: The validation set which contains reviews, labels, and evidence which can be used to validate models developed for understanding human perspective on movie reviews.

train.csv: The train set which contains reviews, labels as well as evidence used for training a model based on human annotations of movie reviews.

test.csv: The test set which contains reviews, labels and evidence that can be used to evaluate models on unseen data related to understanding perspectives of humans when it comes to movie reviews..

How do I use this dataset? To get started with this dataset you need a working environment such as Python or R where you have access library’s needed for natural language processing(NLP). After setting up an environment with libraries that support NLP tasks execute following steps :

Import csv files into your workspace using appropriate functions provided by specified language libraries e,.g., for Python use pandas read_csv() method .

Preprocess your text data in 'review' & 'label' columns by standardizing them like removing stopwords from sentences & converting words into lowercase etc .Following link link provides best possible preprocessing libraries available in Python .

Train&Test ML algorithms using appropriate feature extraction techniques related to NLP( Bag Of Words , TF-IDF , Word2Vec ) eines are some examples in many more are available Refer link

Measure performance accuracy after running experiments on datasets provided validation & test sets we have also included precision recall curves along famous metrics like F1 score & accuracy score so you could easily analyze hyperparameter tuning & algorithm efficiency according their outputs values you get while testing your ML algorithm

Recommendation systems are always fun! build a simple machine learning reccomendation system by collecting user visits logs post hand writting new featuers might

Research Ideas Developing an automated movie review summarizer based on user ratings, that can accurately capture the salient points of a review and summarize it for moviegoers. Training a model to predict the sentiment of a review, by combining machine learning models with human-annotated rationales from this dataset. Building an AI system that can detect linguistic markers of deception in reviews (e.g., 'fake news', thin reviews etc) and issue warnings on possible fraudulent purchases or online reviews

Columns File: validation.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) File: train.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) File: test.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) Acknowledgements If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

License

CC0

Original Data Source: Movie Rationales (Rationales For Movie Reviews)
o
SocialGrep Reddit Comment & Sentiment
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). SocialGrep Reddit Comment & Sentiment [Dataset]. https://www.opendatabay.com/data/ai-ml/1ed11deb-f713-4db9-9e9a-55c0f9107164
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset provides an in-depth corpus of posts and comments from the Reddit board /r/datasets, covering its entire history up to 1st March 2022. Its primary purpose is to serve as a collection of datasets related to Reddit content, enabling analysts and data scientists to explore online community data. The data was acquired using SocialGrep. To safeguard user privacy, usernames have been excluded from this dataset, preventing targeted harassment and preserving anonymity. It includes details such as comment body text, sentiment analysis, and comment scores, offering a rich resource for various analytical tasks.

Columns

type: Denotes the type of the data point.

id: A unique Base-36 identifier for each comment.

subreddit.id: A unique Base-36 identifier for the subreddit where the comment was posted.

subreddit.name: The human-readable name of the subreddit.

subreddit.nsfw: Indicates whether the comment's subreddit is Not Safe For Work (NSFW).

created_utc: The timestamp in Coordinated Universal Time (UTC) when the comment was created.

permalink: The permanent link to the comment on Reddit.

body: The main text content of the comment.

sentiment: The analysed sentiment score for the comment's body text.

score: The numerical score assigned to the comment.

Distribution

The dataset is structured as a table containing all comments. While the specific file format is typically CSV, the total number of values for key columns such as id, subreddit.id, created_utc, permalink, body, sentiment, and score is 54,848 records. For the subreddit.nsfw column, all 54,848 values indicate 'false', meaning no NSFW subreddits are included in this specific count. The body column shows that 5% of comments are '[deleted]', 2% are '[removed]', and the remaining 93% consist of other content. Sentiment scores range from -1.00 to 1.00, with varying distributions across different ranges. Comment scores range from -65 to 195, also with varying frequencies across score bands.

Usage

This dataset is ideally suited for data science and analytics projects. It can be used for: * Natural Language Processing (NLP) tasks, such as text analysis and sentiment classification. * Studying the dynamics of online communities and social networks. * Analyzing user sentiment towards various topics discussed on Reddit. * Exploring the factors influencing comment scores and engagement. * Developing models for content moderation or recommendation based on Reddit data.

Coverage

The dataset spans a significant time range, including all posts and comments from the inception of the /r/datasets board up to 1st March 2022. Its geographic scope is global, representing activity across Reddit's platform without specific regional limitations. The demographic scope primarily focuses on the users interacting within the /r/datasets community on Reddit. As mentioned, usernames are specifically excluded to ensure user anonymity.

License

CC-BY

Who Can Use It

This dataset is valuable for a wide range of users, including: * Data scientists and analysts looking for real-world social media data for their projects. * Researchers in fields such as computer science, social networks, and linguistics, for studying online behaviour and communication patterns. * Developers creating applications that involve text analysis or sentiment prediction. * Anyone interested in gaining insights into Reddit communities and their discussions.

Dataset Name Suggestions

Reddit /r/datasets Comment Log

Analysed Reddit Community Posts

SocialGrep Reddit Comment & Sentiment

Reddit Data Science Discussions

Online Community Text Data

Attributes

Original Data Source: The Reddit Dataset Dataset
g
EM2040 Water Column Sonar Data Collected During H13177
gimi9.com
datasets.ai
+2more
Updated Sep 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). EM2040 Water Column Sonar Data Collected During H13177 [Dataset]. https://gimi9.com/dataset/data-gov_em2040-water-column-sonar-data-collected-during-h13177
Explore at:
Dataset updated
Sep 30, 2019
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sea Scout Hydrographic Survey, H13177 (EM2040). Mainline coverage within the survey area consisted of Complete Coverage (100% side scan sonar with concurrent multibeam data) acquisition. The assigned Fish Haven area and associated debris area were surveyed with Object Detection MBES coverage. Bathymetric and water column data were acquired with a Kongsberg EM2040C multibeam echo sounder aboard the R/V Sea Scout and bathymetry data was acquired with a Kongsberg EM3002 multibeam echo sounder aboard the R/V C-Wolf. Side scan sonar acoustic imagery was collected with a Klein 5000 V2 system aboard the R/V Sea Scout and an EdgeTech 4200 aboard the R/V C-Wolf.
Data from: A FAIR and modular image-based workflow for knowledge discovery...
zenodo.org
data.niaid.nih.gov
bin, csv, png, txt +1
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghan Balk; Meghan Balk; Thibault Tabarin; Thibault Tabarin; John Bradley; John Bradley; Hilmar Lapp; Hilmar Lapp (2024). Data from: A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics [Dataset]. http://doi.org/10.5281/zenodo.8233380
Explore at:
csv, png, xml, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8233380
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Meghan Balk; Meghan Balk; Thibault Tabarin; Thibault Tabarin; John Bradley; John Bradley; Hilmar Lapp; Hilmar Lapp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).

Fish-AIR:
This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:

extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness,
uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.

The outputs from the Minnow_Segmented_Traits workflow are:

sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.

presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.

heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.

minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).

burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.
Z
Data from: Lower complexity of motor primitives ensures robust control of...
data.niaid.nih.gov
Updated Jun 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arampatzis, Adamantios (2022). Lower complexity of motor primitives ensures robust control of high-speed human locomotion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3764760
Explore at:
Dataset updated
Jun 18, 2022
Dataset provided by
Kunimasa, Yoko
Kijima, Kota
Ishikawa, Masaki
Santuz, Alessandro
Ekizos, Antonis
Arampatzis, Adamantios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Walking and running are mechanically and energetically different locomotion modes. For selecting one or another, speed is a parameter of paramount importance. Yet, both are likely controlled by similar low-dimensional neuronal networks that reflect in patterned muscle activations called muscle synergies. Here, we investigated how humans synergistically activate muscles during locomotion at different submaximal and maximal speeds. We analysed the duration and complexity (or irregularity) over time of motor primitives, the temporal components of muscle synergies. We found that the challenge imposed by controlling high-speed locomotion forces the central nervous system to produce muscle activation patterns that are wider and less complex relative to the duration of the gait cycle. The motor modules, or time-independent coefficients, were redistributed as locomotion speed changed. These outcomes show that robust locomotion control at challenging speeds is achieved by modulating the relative contribution of muscle activations and producing less complex and wider control signals, whereas slow speeds allow for more irregular control.

In this supplementary data set we made available: a) the metadata with anonymized participant information, b) the raw EMG, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via NMF and f) the code to process the data, including the scripts to calculate the Higuchi's fractal dimension (HFD) of motor primitives. In total, 180 trials from 30 participants are included in the supplementary data set.

The file “metadata.dat” is available in ASCII and RData format and contains:

Code: the participant’s code

Group: the experimental group in which the participant was involved (G1 = walking and submaximal running; G2 = submaximal and maximal running)

Sex: the participant’s sex (M or F)

Speeds: the type of locomotion (W for walking or R for running) and speed at which the recordings were conducted in 10*[m/s]

Age: the participant’s age in years

Height: the participant’s height in [cm]

Mass: the participant’s body mass in [kg]

PB: 100 m-personal best time (for G2).

The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus. Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17). All the other trials consist of 30 gait cycles. Trials are named like “P20_R_20,” where the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

Old versions not compatible with the R package musclesyneRgies

The files containing the gait cycle breakdown are available in RData format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with as many rows as the available number of gait cycles and two columns. The first column named “touchdown” contains the touchdown incremental times in seconds. The second column named “stance” contains the duration of each stance phase of the right foot in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P20_R_20,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times, the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17).

The files containing the raw, filtered and the normalized EMG data are available in RData format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with as many rows as the amount of recorded data points and 13 columns. The first column named “time” contains the incremental time in seconds. The remaining 12 columns contain the raw EMG data, named with muscle abbreviations that follow those reported above. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P03_R_30”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P03” indicate the participant number (in this example the 3rd), the character “R” indicate the locomotion type (see above), and the numbers “30” indicate the locomotion speed (see above). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorisation and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed in the methods section to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P12_W_07”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P12” indicate the participant number (in this example the 12th), the character “W” indicate the locomotion type (see above), and the numbers “07” indicate the speed (see above). Motor modules are data frames with 12 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported above, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P22_R_20”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P22” indicate the participant number (in this example the 22nd), the character “W” indicates the locomotion type (see above), and the numbers “20” indicate the speed (see above). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.

The files containing the HFD calculated from motor primitives are available in RData format, in the file named “HFD.RData”. HFD results are presented in a list of lists containing, for each trial, 1) the HFD, and 2) the interval time k used for the calculations. HFDs are presented as one number (mean HFD of the primitives for that trial), as are the interval times k. Trials are named like “HFD_P01_R_95”, where the characters “HFD” indicate that the trial contains HFD data, the characters “P01” indicate the participant number (in this example the 1st), the character “R” indicates the locomotion type (see above), and the numbers “95” indicate the speed (see above).

All the code used for the pre-processing of EMG data, the extraction of muscle synergies and the calculation of HFD is available in R format. Explanatory comments are profusely present throughout the script “muscle_synergies.R”.
w
Dataset of list of books by John R. Salter
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of list of books by John R. Salter [Dataset]. https://www.workwithdata.com/datasets/books?col=book&f=1&fcol0=author&fop0=%3D&fval0=John+R.+Salter
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 6 rows and is filtered where the author is John R. Salter. It features one column called book.

Facebook

Twitter

Click to copy link

Link copied

Cite

Data set for reproducing plots showing stable water isotopologue transport and fractionation [Dataset]. https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3108

Data set for reproducing plots showing stable water isotopologue transport and fractionation

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.18419/DARUS-3108

Dataset updated

Oct 6, 2022

Dataset provided by

DaRUS

Authors

Stefanie Kiemle; Katharina Heck

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Dataset funded by

DFG

Description

This data set includes the *.csv data and the used scripts to reproduce the plots of the three different scenarios presented in S. Kiemle, K. Heck, E. Coltman, R. Helmig (2022) Stable water isotopologue fractionation during soil-water evaporation: Analysis using a coupled soil-atmosphere model. (Under review) Water Resources Research. *.csv files The isotope distribution has been analyzed in the vertical and in horizontal direction of a soil column for all scenarios. Therefore, we provide *.csv files generated using the ParaView Tools "plot over line" or "plot over time". Each *.csv file contains information about the saturation, temperature, and component composition for each phase in mole fraction or in the isotopic-specific delta notation. Additionally, information about the evaporation rate is given in a separate file *.txt file. python scripts For each scenario, we provide scripts to reproduce the presented plots. Scenarios We used different free flow conditions to analyze the fractionation processes inside the porous medium. Scenario 1. laminar flow, Scenario 2. laminar flow, but with isolation of parameter affecting the fractionation process, Scenario 3. turbulent flow. Please find below a detailed description of the data labeling and needed scripts to reproduce a certain plot for each scenario. Scenario: The spatial distribution of stable water isotopologues in horizontal (-0.01 m depth) and vertical (at 0.05 m width) inside a soil column at selected days (DoE (Day of Experiment)): Use the python scripts plot_concentration_horizontal_all.py (horizontal direction) and plot_concentration_spatial_all.py (vertical direction) to create the specific plots. In the folder IsotopeProfile_Horizontal and IsotopeProfile_Vertical the belonging *.csv can be found. The *.csv files are named after the selected day (e.g. DoE_80 refers to day 80 of the virtual experiment). The influence of the evaporation rate on isotopic fractionation processes in various depths (-0.001, -0.005, -0.009, and -0.018 m ) during the whole virtual experiment time: Use the python script plot_evap_isotopes_v2.py to create the plots. The data for the isotopologues distribution and the saturation can be found in the folder PlotOverTime. All data is named as PlotOverTime_xxxxm with xxxx representing the respective depth (e.g. PlotOverTime_0.001m refers to -0.001 m depth). The data for the evaporation rate can be found in the folder EvaporationRate. Note, the evaporation rate data is available as a .txt because we extract the information about the evaporation directly during the simulation and do not derive it through any post-processing. Scenario: Process behavior of isolated parameters that influences the isotopic fractionation: Use plot_concentration.py to reproduce the plots either represented in the isotopic-specific delta notation or in mole fraction. The corresponding data can be found in the folder IsotopeProfile_Vertical. The data labeling refers to the single cases (1- no fractionation; 2 - only equilibrium fractionation; 3 - only kinetic fractionation; 4 - only liquid diffusion; 5 - Reference). Scenario: Evaporation rate during the virtual experiment for different flow cases: With plot_evap.py and the .txt files which can be found in the folder EvaporationRate, the evaporation progression can be plotted. The labeling of the .txt files refers to the different flow cases (1 - 0.1 m/s (laminar); 2 - 0.13 m/s (laminar); 3 - 0.5 m/s (turbulent); 4 - 1 m/s (turbulent); 5 - 3 m/s (turbulent)). The isotope profiles in the vertical and horizontal direction of the soil column (similar to Scenario 1) for selected days: With plot_cocentration_horizontal_all.py and plot_concentration_spatial_all.py the plots for the horizontal and vertical distribution of isotopologues can be generated. The corresponding data can be found in the folders IsotopeProfile_Horizontal and IsotopeProfile_Vertical. These folders are structured with subfolders containing the data of selected days of the virtual experiments (DoE - Day of Experiments), in this case, day 2, 10, and 35. The data labeling remains similar to Scenario 3a).

Clear search

Close search

Google apps

Main menu

Data set for reproducing plots showing stable water isotopologue transport...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

R codes and dataset for Visualisation of Diachronic Constructional Change...

Video Plankton Recorder data (formatted with taxa displayed in single...

[Superseded] Intellectual Property Government Open Data 2019

What is IPGOD?\r

How do I use IPGOD?\r

IP Data Platform\r

References\r

Updates\r

Tables and columns\r

Data quality improvements\r

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Reddit r/AskScience Flair Dataset

Reddit AskScience Flair Analysis Dataset

Context

Content

Ideas for Usage

ukbtools: An R package to manage and query UK Biobank data

Data from: Dataset from : Browsing is a strong filter for savanna tree...

University SET data, with faculty and courses characteristics

Data from: Thirteen-year Stover Harvest and Tillage Effects on Corn...

Data and tools for studying isograms

Tennessee Eastman Process Simulation Dataset

Intro

Content

Acknowledgements

User Agreement

Movie Rationales (Rationales For Movie Reviews)

License

SocialGrep Reddit Comment & Sentiment

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

EM2040 Water Column Sonar Data Collected During H13177

Data from: A FAIR and modular image-based workflow for knowledge discovery...

Data from: Lower complexity of motor primitives ensures robust control of...

Dataset of list of books by John R. Salter

Data set for reproducing plots showing stable water isotopologue transport and fractionation