37 datasets found

m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
m
LPH Marks et al. Publicly Available Dataset
data.mendeley.com
Updated Mar 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Marks (2021). LPH Marks et al. Publicly Available Dataset [Dataset]. http://doi.org/10.17632/t9wbtt3mt2.1
Explore at:
Unique identifier
https://doi.org/10.17632/t9wbtt3mt2.1
Dataset updated
Mar 16, 2021
Authors
Charles Marks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and R code accompanies a manuscript submitted by Marks et al. to Lancet Public Health entitled "Identifying Counties at Risk of High Overdose Mortality Burden Throughout the Emerging Fentanyl Epidemic in the United States: A Predictive Statistical Modeling Study". The analyses and results are available in the manuscript. All publicly available data used in the study is included in this dataset, in addition to several additional variables. Since the study used restricted mortality records from the CDC, we have censored all variables derived from this restricted data. Given access to the restricted data, researchers can add these variables to this dataset in the indicated columns. The accompanying R Code was used for the analysis.
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
openicpsr.org
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
KORUS-AQ Pandora Column Observations - Dataset - NASA Open Data Portal
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). KORUS-AQ Pandora Column Observations - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-pandora-column-observations-dda12
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
KORUSAQ_Ground_Pandora_Data contains all of the Pandora instrumentation data collected during the KORUS-AQ field study. Contained in this dataset are column measurements of NO2, O3, and HCHO. Pandoras were situated at various ground sites across the study area, including, NIER-Taehwa, NIER-Olympic Park, NIER-Gwangju, NIER-Anmyeon, Busan, Yonsei University, Songchon, and Yeoju. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
u
Test datasets for evaluating automated transcription of primary specimen...
figshare.unimelb.edu.au
csv
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Turnbull; Emily Fitzgerald; Karen Thompson; JOANNE BIRCH (2025). Test datasets for evaluating automated transcription of primary specimen labels on herbarium specimen sheets [Dataset]. http://doi.org/10.26188/25648902.v4
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.26188/25648902.v4
Dataset updated
Jun 18, 2025
Dataset provided by
The University of Melbourne
Authors
Robert Turnbull; Emily Fitzgerald; Karen Thompson; JOANNE BIRCH
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This contains three datasets to evaluate the automated transcription of primary specimen labels (also known as 'institutional labels') on herbarium specimen sheets.Two datasets are derived from the herbarium at the University of Melbourne (MELU), one with printed or typed institutional labels (MELU-T) and the other with handwritten labels (MELU-H). The other dataset (DILLEN) is derived from:Mathias Dillen. (2018). A benchmark dataset of herbarium specimen images with label data: Summary [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6372393Each dataset is in CSV format and has 100 rows, each relating to an image of an individual herbarium specimen sheet.There is a column in each CSV for the URL of the image. The Dillen dataset has an additional column with the DOI for each image.There is a column for the label_classification which indicates if the type of text to be found in the institutional label in the following four categories:handwrittentypewriterprintedmixedThere are also columns for the following twelve text fields:familygenusspeciesinfrasp_taxonauthoritycollector_numbercollectorlocalitygeolocationyearmonthdayIf the text field is not present on the label, then the the corresponding cell is left empty.The text fields in the dataset are designed to come from the primary specimen label only and may not agree with other information on the specimen sheet.In some cases there may be ambiguity for how the text on the labels and human annotators could arrive with different encodings.Evaluation ScriptWe provide a Python script to evaluate the output of an automated pipeline with these datasets. The script requires typer, pandas, plotly and kaleido.You can install these dependencies in a virtual as follows:python3 -m venv .venvsource .venv/bin/activatepip install -r requirements.txtTo evaluate your pipeline, produce another CSV file with the same columns and with output of the pipeline in the save order as one of the datasets.For example, if the CSV of your pipeline is called hespi-dillen.csv, then you can evalulate it like this:python3 ./evaluate.py DILLEN.csv hespi-dillen.csv --output hespi-dillen.pdfThis will produce an output image called hespi-dillen.pdf with a plot of the similarity of each field with the test set in DILLEN.csv. The file format for the plot can also be svg, png or jpg.The similarity measure uses the Gestalt (Ratcliff/Obershelp) approach and is a percentage similarity between the each pair of strings. Only fields where text is provided in either the test dataset or the predictions are included in the results. If a field is present in either the test dataset or the predictions but not the other then the similarity is given as zero. All non-ASCII characters are removed. By default the results are not case-sensitive. If you wish to evaluate with case-sensitive comparison, then use the --case-sensitive option on the command line. The output of the script will also provide the accuracy of the label classification and the whether or not any particular field should be empty.Options for the script can be found by running:python3 ./evaluate.py --helpCreditRobert Turnbull, Emily Fitzgerald, Karen Thompson and Joanne Birch from the University of Melbourne.If you use this dataset, please cite it and the corresponding Hespi paper. More information at https://github.com/rbturnbull/hespiThis dataset is available on Github here: https://github.com/rbturnbull/hespi-test-data
Data from: Bike Sharing Dataset
kaggle.com
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.

Virtual Reality Balance Disturbance Dataset

zenodo.org

bin

Updated Oct 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Nuno Ferrete Ribeiro; Nuno Ferrete Ribeiro; Henrique Pires; Cristina P. Santos; Cristina P. Santos; Henrique Pires (2024). Virtual Reality Balance Disturbance Dataset [Dataset]. http://doi.org/10.5281/zenodo.14013468

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14013468

Dataset updated

Oct 31, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nuno Ferrete Ribeiro; Nuno Ferrete Ribeiro; Henrique Pires; Cristina P. Santos; Cristina P. Santos; Henrique Pires

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Background and Purpose:

There are very few publicly available datasets on real-world falls in scientific literature due to the lack of natural falls and the inherent difficulties in gathering biomechanical and physiological data from young subjects or older adults residing in their communities in a non-intrusive and user-friendly manner. This data gap hindered research on fall prevention strategies. Immersive Virtual Reality (VR) environments provide a unique solution.

This dataset supports research in fall prevention by providing an immersive VR setup that simulates diverse ecological environments and randomized visual disturbances, aimed at triggering and analyzing balance-compensatory reactions. The dataset is a unique tool for studying human balance responses to VR-induced perturbations, facilitating research that could inform training programs, wearable assistive technologies, and VR-based rehabilitation methods.

Dataset Content:
The dataset includes:

Kinematic Data: Captured using a full-body Xsens MVN Awinda inertial measurement system, providing detailed movement data at 60 Hz.
Muscle Activity (EMG): Recorded at 1111 Hz using Delsys Trigno for tracking muscle contractions.
Electrodermal Activity (EDA)*: Captured at 100.21 Hz with a Shimmer GSR device on the dominant forearm to record physiological responses to perturbations.
Metadata: Includes participant demographics (age, height, weight, gender, dominant hand and foot), trial conditions, and perturbation characteristics (timing and type).

The files are named in the format "ParticipantX_labelled", where X represents the participant's number. Each file is provided in a .mat format, with data already synchronized across different sensor sources. The structure of each file is organized into the following columns:

Column 1: Label indicating the visual perturbation applied. 0 means no visual perturbation.
Column 2: Timestamp, providing the precise timing of each recorded data point.
Column 3: Frame identifier, which can be cross-referenced with the MVN file for detailed motion analysis.
Columns 4 to 985: Xsens motion capture features, exported directly from the MVN file.
Columns 986 to 993: EMG data - Tibialis Anterior (R&L), Gastrocnemius Medial Head (R&L), Rectus Femoris (R), Semitendinosus (R), External Oblique (R), Sternocleidomastoid (R).
Columns 994 to 1008: Shimmer data: Accelerometer (x,y,z), Gyroscope (x,y,z), Magnetometer (x,y,z), GSR Range, Skin Conductance, Skin Resistance, PPG, Pressure, Temperature.

In addition, we are also releasing the .MVN and .MVNA files for each participant (1 to 10), which provide comprehensive motion capture data and include the participants' body measurements, respectively. This additional data enables precise body modeling and further in-depth biomechanical analysis.

Participants & VR Headset:

Twelve healthy young adults (average age: 25.09 ± 2.81 years; height: 167.82 ± 8.40 cm; weight: 64.83 ± 7.77 kg; 6 males, 6 females) participated in this study (Table 1). Participants met the following criteria: i) healthy locomotion, ii) stable postural balance, iii) age ≥ 18 years, and iv) body weight < 135 kg.

Participants were excluded if they: i) had any condition affecting locomotion, ii) had epilepsy, vestibular disorders, or other neurological conditions impacting stability, iii) had undergone recent surgeries impacting mobility, iv) were involved in other experimental studies, v) were under judicial protection or guardianship, or vi) experienced complications using VR headsets (e.g., motion sickness).

All participants provided written informed consent, adhering to the ethical guidelines set by the University of Minho Ethics Committee (CEICVS 063/2021), in compliance with the Declaration of Helsinki and the Oviedo Convention.

To ensure unbiased reactions, participants were kept unaware of the specific protocol details. Visual disturbances were introduced in a random sequence and at various locations, enhancing the unpredictability of the experiment and simulating a naturalistic response.

The VR setup involved an HTC Vive Pro headset with two wirelessly synchronized base stations that tracked participants’ head movements within a 5m x 2.5m area. The base stations adjusted the VR environment’s perspective according to head movements, while controllers were used solely for setup purposes.

Table 1 - Participants' demographic information

Participant	Height (cm)	Weight (kg)	Age	Gender	Dom. Hand	Dom. Foot
1	159	56.5	23	F	Right	Right
2	157	55.3	28	F	Right	Right
3	174	67.1	31	M	Right	Right
4	176	73.8	23	M	Right	Right
5	158	57.3	23	F	Right	Right
6	181	70.9	27	M	Right	Right
7	171	73.3	23	M	Right	Right
8	159	69.2	28	F	Right	Right
9	177	57.3	22	M	Right	Right
10	171	75.5	25	M	Right	Right
11	163	58.1	23	F	Right	Right
12	168	63.7	25	F	Right	Right

Data Collection Methodology:

The experimental protocol was designed to integrate four essential components: (i) precise control over stimuli, (ii) high reproducibility of the experimental conditions, (iii) preservation of ecological validity, and (iv) promotion of real-world learning transfer.

Participant Instructions and Familiarization Trial: Before starting, participants were given specific instructions to (i) seek assistance if they experienced motion sickness, (ii) adjust the VR headset for comfort by modifying the lens distance and headset fit, (iii) stay within the defined virtual play area demarcated by a blue boundary, and (iv) complete a familiarization trial. During this trial, participants were encouraged to explore various virtual environments while performing a sequence of three key movements—walking forward, turning around, and returning to the initial location—without any visual perturbations. This familiarization phase helped participants acclimate to the virtual space in a controlled setting.
Experimental Protocol and Visual Perturbations: Participants were exposed to 11 different types of visual perturbations as outlined in Table 2, applied across a total of 35 unique perturbation variants (Table 3). Each variant involved the same type of perturbation, such as a clockwise Roll Axis Tilt, but varied in intensity (e.g., rotation speed) and was presented in randomized virtual locations. The selection of perturbation types was grounded in existing literature on visual disturbances. This design ensured that participants experienced a diverse range of visual effects in a manner that maintained ecological validity, supporting the potential for generalization to real-world scenarios where visual perturbations might occur spontaneously.
Protocol Flow and Randomized Presentation: Throughout the experimental protocol, each visual perturbation variant was presented three times, and participants engaged repeatedly in the familiarization activities over a nearly one-hour period. These activities—walking forward, turning around, and returning to the starting point—took place in a 5m x 2.5m physical space mirrored in VR, allowing participants to take 7–10 steps before turning. Participants were not informed of the timing or nature of any perturbations, which could occur unpredictably during their forward walk, adding a realistic element of surprise. After each return to the starting point, participants were relocated to a random position within the virtual environment, with the sequence of positions determined by a randomized, computer-generated order.

Table 2 - Visual perturbations' name and parameters (L - Lateral; B - Backward; F - Forward; S - Slip; T - Trip; CW- Clockwise; CCW - Counter-Clockwise)

Perturbation [Fall Category]	Parameters
Roll Axis Tilt - CW [L]	[10º, 20º, 30º] during 0.5s
Roll Axis Tilt – CCW [L]	[10º, 20º, 30º] during 0.5s
Support Surface ML Axis Translation - Bidirectional [L]	Discrete Movement (static pauses between movements) – 1

n
Respiration_chambers/raw_log_files and combined datasets of biomass and...
cmr.earthdata.nasa.gov
researchdata.edu.au
+1more
Updated Dec 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. http://doi.org/10.26179/5c1827d5d6711
Explore at:
Unique identifier
https://doi.org/10.26179/5c1827d5d6711
Dataset updated
Dec 18, 2018
Time period covered
Jan 27, 2015 - Feb 23, 2015
Area covered
Description
General overview The following datasets are described by this metadata record, and are available for download from the provided URL.

Raw log files, physical parameters raw log files

Raw excel files, respiration/PAM chamber raw excel spreadsheets

Processed and cleaned excel files, respiration chamber biomass data

Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment

Associated R script file for pump cycles of respirations chambers

####

Physical parameters raw log files

Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual

####

Respiration/PAM chamber raw excel spreadsheets

Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from

####

Respiration chamber biomass data

The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

####

Associated R script file for pump cycles of respirations chambers

Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

####

Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

The headers are PAR: Photoactive radiation relETR: F0/Fm x PAR Notes: Stage/step of light curve Treatment: Acidified or control

The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

After 10.0 sec at 0.5% = 1 umols PAR After 10.0 sec at 0.7% = 1 umols PAR After 10.0 sec at 1.1% = 0.96 umols PAR After 10.0 sec at 1.6% = 4.32 umols PAR After 10.0 sec at 2.4% = 4.32 umols PAR After 10.0 sec at 3.6% = 8.31 umols PAR After 10.0 sec at 5.3% =15.78 umols PAR After 10.0 sec at 8.0% = 25.75 umols PAR

This dataset appears to be missing data, note D5 rows potentially not useable information

See the word document in the download file for more information.

Case Study: Cyclist

kaggle.com

Updated Jul 27, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

PatrickRCampbell (2021). Case Study: Cyclist [Dataset]. https://www.kaggle.com/patrickrcampbell/case-study-cyclist/discussion

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 27, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

PatrickRCampbell

Description

Phase 1: ASK

Key Objectives:

1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.

2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.

3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members

Phase 2: PREPARE:

Key Objectives:

1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.

2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.

#Installing packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("readr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("geosphere", repos = "http://cran.us.r-project.org")
install.packages("gridExtra", repos = "http://cran.us.r-project.org")

library(tidyverse)
library(readr)
library(janitor)
library(geosphere)
library(gridExtra)

#Importing data & verifying the information within the dataset
all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")

glimpse(all_tripdata_clean)

summary(all_tripdata_clean)

Phase 3: PROCESS

Key Objectives:

1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.

#Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),] 

#Creating columns for the individual date components (days_of_week should be run last)
all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")

** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**

#Calculating the ride length in miles & minutes
all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")

all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles

#Calculating the mean time and distance based on the user groups
userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))


userType_means <- all_tripdata_clean %>% 
 group_by(member_casual) %>% 
 summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))

Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.

#Calculations

with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")

with_bike_type %>%
 mutate(weekday = wday(started_at, label = TRUE)) %>% 
 group_by(member_casual,rideable_type,weekday) %>%
 summarise(totals=n(), .groups="drop") %>%
 
with_bike_type %>%
 group_by(member_casual,rideable_type) %>%
 summarise(totals=n(), .groups="drop") %>%

 #Calculating the ride differential
 
 all_tripdata_clean %>% 
 mutate(weekday = wkday(started_at, label = TRUE)) %>% 
 group_by(member_casual, weekday) %>% 
 summarise(number_of_rides = n()
      ,average_duration = mean(ride_length),.groups = 'drop') %>% 
 arrange(me...

Twigstats scripts and example dataset
zenodo.org
application/gzip, pdf +1
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leo Speidel; Leo Speidel (2024). Twigstats scripts and example dataset [Dataset]. http://doi.org/10.5281/zenodo.13880459
Explore at:
application/gzip, sh, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13880459
Dataset updated
Oct 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leo Speidel; Leo Speidel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository provides all scripts to run Relate and Twigstats on imputed ancient genomes. We also provide a complete self contained example dataset, but you should be able to use the exact same scripts on your own datasets as well.

Installation

Please install bcftools if you haven't already (https://samtools.github.io/bcftools/howtos/install.html). Please make sure that the executable is added to your PATH and that BCFTOOLS_PLUGINS is set to the correct plugin path (see bcftools link).

Please download Relate from https://myersgroup.github.io/relate/

Please also install the R package Twigstats from https://leospeidel.github.io/twigstats/

Optional: For plotting purposes and downstream analyses, please install the R packages

relater from https://github.com/leospeidel/relater/

ggplot2

dplyr

tidyr

plyr

umap

Download

To run this on your own dataset please download scripts.tgz and Relate_input_files.tgz.

To run the provided example, please additionally download example_data_chr1.tgz or example_data.tgz.

All output files that are generated by run_wg.sh are stored under results/.

Running the scripts

Please extract tar balls, e.g. using tar -xzvf scripts.tgz.

The script run.sh shows how to run everything 'in order' for chromosome 1. The script run_wg.sh runs everything for the whole genome.
You can find the individual scripts that are being called under scripts/.

Input files

The directory example_data_chr1 stores files for only chromosome 1, whereas example_data stores files for the whole genome.

Under example_data/ and example_data_chr1/ you will find the following files:

GLIMPSE imputed vcf, here named ancients_glimpse2_chr1.bcf.

Modern vcf (e.g. 1000G), here named 1000GP_sub_chr1.bcf.

A poplabels file listing population labels for each individual. Individuals have to appear in the same order as in the merged vcf file. The file should contain four columns: ID POP GROUP SEX. The second column is used for population assignment.

A second poplabels file used for the MDS analysis. The second column should now list IDs of all individuals plotted in the MDS (i.e. should be identical to first column). The outgroup should be grouped together into one population.

File containing sample ages in generations, two lines per sample (diploid), e.g. for 3 samples of ages 0, 10, and 100 generations:
0
0
10
10
100
100

We provide all the other required Relate input files under Relate_input_files/. You can reuse these in your analysis.

In this example, we are using data from the 1000 Genomes Project dataset (Nature 2015). We additionally use low coverage shotgun genomes from Anglo-Saxon contexts, British Iron/Roman Age, Irish Bronze Age, and the Scandinavian Early Iron Age (Cassidy et al, PNAS 2016; Martiniano et al, Nature Communications 2016; Anastasiadou et al, Communications Biology 2023; Schiffels et al Nature Communications 2016; Gretzinger et al Nature 2022; Rodriguez-Varela et al Cell 2023). These were imputed using GLIMPSE (https://odelaneau.github.io/GLIMPSE).

Step by step guide

Please follow run.sh (chromosome 1 only). The script run_wg.sh will run the whole genome.

These scripts will

Run scripts/1_prep_vcf.sh to filter the imputed genotypes.

Then run scripts/2_prep_Relate.sh to prepare Relate input files

Finally run scripts/3_run_Relate.sh to estimate genealogies

We can use these Relate files for various analyses:

You can run Twigstats and infer admixture proportions using Rscript scripts/4_run_Twigstats.R.

You can estimate coalescence rates and population sizes using Rscript scripts/5_plot_popsize.R.

You can run an MDS using Rscript scripts/6_plot_MDS.R.

To see the arguments required in each script, you can execute the script without arguments, e.g. by executing scripts/1_prep_vcf.sh or Rscript scripts/4_run_Twigstats.R.

The expected output is shown in the attached pdf.
d
Data from: Reference transcriptomics of porcine peripheral immune cells...
catalog.data.gov
agdatacommons.nal.usda.gov
+3more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
g
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
gimi9.com
data.usgs.gov
+1more
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://gimi9.com/dataset/data-gov_water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Feb 22, 2025
Area covered
Contiguous United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Eye Image Dataset
kaggle.com
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit R Washimkar (2025). Eye Image Dataset [Dataset]. https://www.kaggle.com/datasets/sumit17125/eye-image-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit R Washimkar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Right Eye Disease Classification Dataset

Introduction

This dataset consists of right eye images along with a CSV file containing image names and corresponding disease labels. It is designed for disease classification tasks using deep learning and computer vision techniques.

Dataset Information

The dataset contains right eye images captured from various individuals.

The accompanying CSV file includes the image filename and the disease label.

Additional columns provide relevant metadata or medical attributes.

CSV File Columns

Image Name: The filename of the corresponding right eye image.

Disease Labels:

N: Normal (No Disease)

D: Diabetic Retinopathy

G: Glaucoma

C: Cataract

A: Age-Related Macular Degeneration

H: Hypertensive Retinopathy

M: Myopia

O: Other Eye Diseases

Additional columns may include patient details (if available), image capture conditions, or severity levels.

Possible Use Cases

Deep Learning for Medical Imaging: Training CNN models for automated disease classification.

Image Processing & Feature Extraction: Analyzing retinal features for disease detection.

Transfer Learning & Fine-Tuning: Using pre-trained models (e.g., ResNet, VGG) for improving classification performance.

Medical AI Research: Developing AI-driven solutions for ophthalmology.

Acknowledgments

This dataset is designed for medical AI research and educational purposes. Proper handling of medical data is advised.
o
Cryptocurrency Discussion Sentiment Dataset
opendatabay.com
.undefined
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Cryptocurrency Discussion Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/financial/a778f43f-65e2-4c2e-9e10-ab4ed0e47518
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides insights into public opinion regarding Bitcoin, derived from comments posted on the /r/Bitcoin subreddit during June 2022 [1, 2]. It is designed to help users track current trends and developments within the cryptocurrency world [2]. The data includes the actual body text of the comments, alongside their assigned sentiment, making it a valuable resource for understanding the evolving landscape of Bitcoin [1, 2].

Columns

The dataset includes several key columns for each comment: * type: Describes the type of post, stored as a String [1-3]. * subreddit.name: The name of the subreddit, which is "/r/Bitcoin" in this case, stored as a String [1-3]. * subreddit.nsfw: Indicates whether the subreddit is Not Safe For Work (NSFW), a Boolean value [1-4]. The sources indicate that almost all entries (170,032 out of 170,036) are marked as 'false' for NSFW [4]. * created_utc: The timestamp when the post was created, allowing for chronological analysis [1-8]. * permalink: The permanent link to the original post or comment on Reddit, a String [1-3]. * score: The score of the post, an Integer value, typically reflecting upvotes or downvotes [1, 2]. * body: The main text content of the comment, stored as a String [1-3]. Notably, about 7% of comments are "[removed]" and 3% are "[deleted]" [8]. * sentiment: The assigned sentiment of the post, a String. This column also appears to have numerical values ranging from -1.00 (most negative) to 1.00 (most positive), with detailed label counts across various ranges [1, 3, 8-10]. A significant portion of comments, 32,903, fall into the -0.04 to 0.00 sentiment range [9].

Distribution

This dataset focuses on comments from the /r/Bitcoin subreddit from June 2022 [1, 2]. It contains approximately 170,035 unique comment entries [4]. The timestamps for created_utc are distributed across June 2022, with varying numbers of comments per time interval, for example, 12,392 comments were recorded between 1655544958.04 and 1655596797.80 [6]. The sentiment analysis is detailed across numerous bins, showing a wide spread of positive, negative, and neutral sentiments [8-10].

Usage

This dataset is ideal for data science and analytics [2]. Potential uses include: * Tracking cryptocurrency trends: Staying up-to-date with the latest developments in Bitcoin [2]. * Sentiment analysis: Analysing public opinion and sentiment towards Bitcoin over time [1]. * Natural Language Processing (NLP) research: Utilising the comment body text for linguistic analysis [2]. * Market research: Understanding community discussions and concerns related to Bitcoin. * Time-series analysis: Observing how sentiment and discussion volume change over the month of June 2022.

Coverage

The dataset covers content from the Reddit /r/Bitcoin subreddit [1, 2]. * Time Range: Specifically the month of June 2022 [1, 2]. * Geographic Scope: While Reddit is global, the specific geographic origin of users is not detailed in the dataset columns. However, it can be considered a global snapshot of online discussion [11]. * Demographic Scope: Reflects the opinions and discussions of Reddit users who actively participate in the /r/Bitcoin subreddit.

License

CC0

Who Can Use It

Data Scientists and Analysts: For conducting sentiment analysis, trend tracking, and NLP projects [2].

Researchers: Studying online communities, cryptocurrency market dynamics, and public discourse.

Cryptocurrency Enthusiasts and Investors: To gain insights into community perception and market sentiment.

Developers: To train and test NLP models related to financial or cryptocurrency text.

Dataset Name Suggestions

Bitcoin Subreddit Comments: June 2022 Sentiment Analysis

Reddit r/Bitcoin Public Opinion Data (June 2022)

Cryptocurrency Discussion Sentiment Dataset

Attributes

Original Data Source:Viral Fads and Cryptocurrency
Reddit Conversations
kaggle.com
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerry Qu (2020). Reddit Conversations [Dataset]. https://www.kaggle.com/jerryqu/reddit-conversations/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 4, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jerry Qu
Description
Context

I've been looking for an open-domain Conversational dataset for training chatbots. I was inspired by the work done by Google Brain in 'Towards a Human-like Open-Domain Chatbot'. While Transformers/BERT are trained on all of Wikipedia, chatbots need a dataset based on conversations.

Content

This data came from Reddit posts/comments under the r/CasualConversation subreddit. The conversations under this subreddit were significantly more 'conversation like' when compared to other subreddits (Ex. r/AskReddit). I'm currently looking for other subreddits to scrape.

This dataset consists of 3 columns, where each row is a Length-3 conversation. For example:

0 - What kind of phone(s) do you guys have? 1 - I have a pixel. It's pretty great. Much better than what I had before. 2 - Does it really charge all the way in 15 min?

This data was collected between 2016-12-29 and 2019-12-31

Furthermore, I have the full comment trees (stored as Python dictionaries), which was an intermediary step to creating this dataset. I plan to add more data in the future. (Ex. Longer sequence lengths, other subreddits)

Acknowledgements / License

Data was collected using Pushshift's API. https://pushshift.io/

Currently unsure about licensing. Reddit does not appear to state a clear licensing agreement, while pushshift does not apply anything either.

Inspiration

Create an open-domain chatbot (Ex. Meena)

I'd love to see how you can represent types of conversations and cluster them. This would be monumentally helpful in collecting more data. (Ex. AskReddit conversations don't resemble typically Person-To-Person conversations. How would you identify Person-To-Person-esk conversations? Perhaps cosine similarity between word-embeddings? Or between sentence-embeddings of POS tags may be very interesting.)
r
Data on how honeybee host brood traits influence Varroa destructor...
researchdata.se
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Scaramella; Ashley Burke; Melissa Oddie; Barbara Locke (2024). Data on how honeybee host brood traits influence Varroa destructor reproduction [Dataset]. http://doi.org/10.5878/znc2-9b12
Explore at:
(4974), (7698), (6945), (3942), (8397), (2887), (3139), (3690), (371), (3724)Available download formats
Unique identifier
https://doi.org/10.5878/znc2-9b12
Dataset updated
Mar 26, 2024
Dataset provided by
Swedish University of Agricultural Sciences
Authors
Nicholas Scaramella; Ashley Burke; Melissa Oddie; Barbara Locke
Time period covered
Jun 2019 - Sep 2021
Description
The data set was collected in Uppsala Sweden between 2019 and 2021. Hives were established using varroa resistant queens from Oslo, Norway (n = 3), Gotland Sweden, (n = 5), and Avignon, France (n = 4), with a varroa susceptible population from Uppsala, Sweden (n = 5) as control. All hives were located at the SLU Lövsta research station (GPS Coordinates: 59° 50’ 2.544”N, 17° 48’ 47.447”E). Varroa destructor mite reproductive success was measured on frames with adult honeybee workers exposed to, and excluded from access to honeybee larvae. Excluders were added directly after brood capping, and frames were dissected nine days later. Cell caps were removed using a scalpel with the pupae and mite families carefully removed from the cell using forceps and a fine paint brush. Mite reproductive success calculated by counting successful reproduction attempts, which was defined as a mite that successfully produced one male, and at least one female offspring. If a mite did not meet this requirement, it was considered a failed reproduction attempt and the reason for failure was documented. All data was analyzed in R version 4.0.1 using R Studio 1.3.959. A linear mixed-effect model was used with mite reproductive success as the response variable, population origin and excluder treatment as independent variables, with colony and year as random effect variables to compare treatments within each population as well as fecundity. Least-square means of the model were used to compare treatments between individual populations.

Scaramella_et_al_2023_Data.tsv - Data set consists of 34 rows and 21 columns. Colony demographics, and designated treatment are listed. All data collected are count data and are explained in more detail in read me file. R script used in analysis is attached. It is split into two sections, with the first being used for statistical analysis, and the second used for plot creations used in the paper. Sections defined by title SECTION 1 - ANALYSIS and SECTION 2 - PLOTs

The output Scaramella_et_al_2023_Analysis_Code_log.txt and plot file Rplots.pdf can, provided that the script is in the same directory as the data files and needed R packages are installed (see sessionInfo.txt), be reproduced by running: Rscript Scaramella_et_al_2023_Analysis_Code.R >
Scaramella_et_al_2023_Analysis_Code_log.txt

Scaramella_et_al_2023_Bar_Graph_Data.tsv - Data set consisting of 8 rows & 5 columns. Colony demographics, and designated treatment are listed. All data generated from the count data in Scaramella_et_al_2023_Data.tsv and are explained in more detail in read me file.

Scaramella_et_al_2023_Stacked_Bar_Graph_Data.tsv - Data set consisting of 102 rows & 8 columns. Colony demographics, and designated treatment are listed. All data is Scaramella_et_al_2023_Data.tsv restructured to include reason failed as a column. The data is explained in more detail in read me file.
e
Examples of CARE-related Activities Carried out by Repositories, in...
portal.edirepository.org
csv, pdf
Updated Mar 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth Duerr (2024). Examples of CARE-related Activities Carried out by Repositories, in Sequences or Groups [Dataset]. http://doi.org/10.6073/pasta/1b812b3bd296d23c4c7c54eb022774fc
Explore at:
pdf(63891 byte), csv(7273 byte)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/1b812b3bd296d23c4c7c54eb022774fc
Dataset updated
Mar 13, 2024
Dataset provided by
EDI
Authors
Ruth Duerr
Time period covered
2020 - 2023
Variables measured
Trigger, Outreach, Technical, Repository Protocols, Situational Awareness
Description
This dataset is designed to accompany the paper submitted to Data Science Journal: O'Brien et al, "Earth Science Data Repositories: Implementing the CARE Principles". This dataset shows examples of activities that data repositories are likely to undertake as they implement the CARE principles. These examples were constructed as part of a discussion about the challenges faced by data repositories when acquiring, curating, and disseminating data and other information about Indigenous Peoples, communities, and lands. For clarity, individual repository activities were very specific. However, in practice, repository activities are not carried out singly, but are more likely to be performed in groups or in sequence. This dataset shows examples of how activities are likely to be combined in response to certain triggers. See related dataset O'Brien, M., R. Duerr, R. Taitingfong, A. Martinez, L. Vera, L. Jennings, R. Downs, E. Antognoli, T. ten Brink, N. Halmai, S.R. Carroll, D. David-Chavez, M. Hudson, and P. Buttigieg. 2024. Alignment between CARE Principles and Data Repository Activities. Environmental Data Initiative. https://doi.org/10.6073/pasta/23e699ad00f74a178031904129e78e93 (Accessed 2024-03-13), and the paper for more information about development of the activities and their categorization, raw data of relationships between specific activities and a discussion of the implementation of CARE Principles by data repositories.

Data in this table are organized into groups delineated by a triggering event in the first column. For example, the first group consists of 9 rows; while the second group has 7 rows. The first row of each group contains the event that triggers the set of actions described in the last 4 columns of the spreadsheet. Within each group, the associated rows in each column are given in numerical not temporal order, since activities will likely vary widely from repository to repository. For example, the first group of rows is about what likely needs to happen if a repository discovers that it holds Indigenous data (O6). Clearly, it will need to develop processes to identify communities to engage (R6) as well as processes for contacting those communities (R7) (if it doesn't already have them). It will also probably need to review and possibly update its data management policies to ensure that they are justifiable (R2). Based on these actions, it is likely that the repository's outreach group needs to prepare for working with more communities (O3) including ensuring that the repository's governance protocols are up-to-date and publicized (O5) and that the repository practices are transparent (O4). If initial contacts go well, it is likely that the repository will need ongoing engagement with the community or communities (S1). This may include adding representation to the repository's advisory board (O2); clarifying data usage with the communities (O9), facilitating relationships between data providers and communities (O1); working with the community to identify educational opportunities (O10); and sharing data with them (O8). It may also become necessary to liaise with whomever is maintaining the vocabularies in use at the repository (O7).
HyG: A hydraulic geometry dataset derived from historical stream gage...
zenodo.org
data.niaid.nih.gov
csv
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh (2024). HyG: A hydraulic geometry dataset derived from historical stream gage measurements across the conterminous United States [Dataset]. http://doi.org/10.5281/zenodo.10425392
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10425392
Dataset updated
Feb 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Contiguous United States
Description
Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.

Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).

At-a-station Hydraulic Geometry

We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:

\(w={aQ^b}\)

\(d=cQ^f\)

\(v=kQ^m\)

where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:

We considered only measurements which reported Q, v, w and A.

For each gage, we excluded measurements older than the most recent five years, so as to minimize the effects of long-term channel evolution on observed hydraulic geometry relationships.

We excluded gages for which measured Q disagreed with the product of measured velocity and measured area by more than 5%. Gages for which \( Q eq vA\) are often tidally influenced and therefore may not conform to expected channel geometry relationships.

Q, v, w, and d from field measurements at each gage were log-transformed. We performed robust linear regressions on the relationships between log(Q) and log(w), log(v), and log(d). AHG parameters were derived from the regressed explanatory variables.

We applied an iterative outlier detection procedure to the linear regression residuals. Values of log-transformed w, v, and d residuals falling outside a three median absolute deviation (MAD) envelope were excluded. Regression coefficients were recalculated and the outlier detection procedure was reapplied until no new outliers were detected.

Gages for which one or more regression had p-values >0.05 were excluded, as the relationships between log-transformed Q and w, v, or d lacked statistical significance.

Gages were omitted if regressed AHG parameters did not fulfill two additional relationships derived by Leopold and Maddock: \(b+f+m=1{\displaystyle \pm }0.1\) and \(a{\displaystyle \times }c{\displaystyle \times }k=1{\displaystyle \pm }0.1\).

If the number of field measurements for a given gage was less than 10, either initially or after individual measurements were removed via steps 1-4, the gage was excluded from further analysis.

Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.

At-a-station Manning's n

We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by

\(n = {{R^{2/3}S^{1/2}} \over v}\)

where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10^-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q₅₀; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10—which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.

Daily discharge percentiles

We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:

For a given gage, we removed blocks of missing discharge values longer than 6 months. These long blocks of missing data generally correspond to intervals in which a gage was temporarily decommissioned for maintenance.

A gage was omitted from further analysis if its discharge record was less than 10 years (3,652 days) long, and/or less than 90% complete (>10% missing values after removal of long blocks in step 1.

We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q₁ and Q₅, increments of 5% (e.g. Q₁₀, Q₁₅, Q₂₀, etc.) between Q₅ and Q₉₅, increments of 1% between Q₉₅ and Q₉₉, and increments of 0.1% between Q₉₉ and Q₁₀₀ in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.

HG Regionalization

We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:

\(log(Q_i) = \beta_1log(DA) + \beta_0\)

where Q_i is streamflow at percentile i, DA is drainage area and \(\beta_1\) and \(\beta_0\) are regression parameters. We report \(\beta_1\), \(\beta_0\) , and the r² value of the regression relationship for Q percentiles Q₁₀, Q₂₅, Q₅₀, Q₇₅, Q₉₀, Q₉₅, Q₉₉, and Q_99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).

Dataset description

We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r² values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.

Errata

In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.
Data from: A FAIR and modular image-based workflow for knowledge discovery...
zenodo.org
data.niaid.nih.gov
bin, csv, png, txt +1
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghan Balk; Meghan Balk; Thibault Tabarin; Thibault Tabarin; John Bradley; John Bradley; Hilmar Lapp; Hilmar Lapp (2024). Data from: A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics [Dataset]. http://doi.org/10.5281/zenodo.8233380
Explore at:
csv, png, xml, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8233380
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Meghan Balk; Meghan Balk; Thibault Tabarin; Thibault Tabarin; John Bradley; John Bradley; Hilmar Lapp; Hilmar Lapp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).

Fish-AIR:
This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:

extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness,
uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.

The outputs from the Minnow_Segmented_Traits workflow are:

sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.

presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.

heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.

minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).

burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.26180/5c844c7a81768

Dataset updated

May 30, 2023

Dataset provided by

Monash University

Authors

Gede Primahadi Wijaya Rajeg

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Clear search

Close search

Google apps

Main menu

R codes and dataset for Visualisation of Diachronic Constructional Change...

LPH Marks et al. Publicly Available Dataset

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

KORUS-AQ Pandora Column Observations - Dataset - NASA Open Data Portal

Test datasets for evaluating automated transcription of primary specimen...

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Virtual Reality Balance Disturbance Dataset

Respiration_chambers/raw_log_files and combined datasets of biomass and...

Case Study: Cyclist

Phase 1: ASK

Key Objectives:

Phase 2: PREPARE:

Key Objectives:

Phase 3: PROCESS

Key Objectives:

Twigstats scripts and example dataset

Installation

Download

Running the scripts

Input files

Step by step guide

Data from: Reference transcriptomics of porcine peripheral immune cells...

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

Eye Image Dataset

Right Eye Disease Classification Dataset

Introduction

Dataset Information

CSV File Columns

Possible Use Cases

Acknowledgments

Cryptocurrency Discussion Sentiment Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Reddit Conversations

Context

Content

Acknowledgements / License

Inspiration

Data on how honeybee host brood traits influence Varroa destructor...

Examples of CARE-related Activities Carried out by Repositories, in...

HyG: A hydraulic geometry dataset derived from historical stream gage...

Data from: A FAIR and modular image-based workflow for knowledge discovery...

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart