The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
=====================================================================
=====================================================================
Authors: Trung-Nghia Le (1), Khanh-Duy Nguyen (2), Huy H. Nguyen (1), Junichi Yamagishi (1), Isao Echizen (1)
Affiliations: (1)National Institute of Informatics, Japan (2)University of Information Technology-VNUHCM, Vietnam
National Institute of Informatics Copyright (c) 2021
Emails: {ltnghia, nhhuy, jyamagis, iechizen}@nii.ac.jp, {khanhd}@uit.edu.vn
Arxiv: https://arxiv.org/abs/2111.12888 NII Face Mask Dataset v1.0: https://zenodo.org/record/5761725
=============================== INTRODUCTION ===============================
The NII Face Mask Dataset is the first large-scale dataset targeting mask-wearing ratio estimation in street cameras. This dataset contains 581,108 face annotations extracted from 18,088 video frames (1920x1080 pixels) in 17 street-view videos obtained from the Rambalac's YouTube channel.
The videos were taken in multiple places, at various times, before and during the COVID-19 pandemic. The total length of the videos is approximately 56 hours.
=============================== REFERENCES ===============================
If your publish using any of the data in this dataset please cite the following papers:
@article{Nguyen202112888, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, author={Nguyen, Khanh-Duy and Nguyen, Huy H and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, archivePrefix={arXiv}, arxivId={2111.12888}, url={https://arxiv.org/abs/2111.12888}, year={2021} }
@INPROCEEDINGS{Nguyen2021EstMaskWearing, author={Nguyen, Khanh-Duv and Nguyen, Huv H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, booktitle={2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)}, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, year={2021}, pages={1-8}, url={https://ieeexplore.ieee.org/document/9667046}, doi={10.1109/FG52635.2021.9667046}}
======================== DATA STRUCTURE ==================================
./NFM ├── dataset │ ├── train.csv: annotations for the train set. │ ├── test.csv: annotations for the test set. └── README_v1.0.md
We use the same structure for two CSV files (train.csv and test.csv). Both CSV files have the same columns: <1st column>: video_id (a source video can be found by following the link: https://www.youtube.com/watch?v=) <2nd column>: frame_id (the index of a frame extracted from the source video) <3rd column>: timestamp in milisecond (the timestamp of a frame extracted from the source video) <4th column>: label (for each annotated face, one of three labels was attached with a bounding box: 'Mask'/'No-Mask'/'Unknown') <5th column>: left <6th column>: top <7th column>: right <8th column>: bottom Four coordinates (left, top, right, bottom) were used to denote a face's bounding box.
============================== COPYING ================================
This repository is made available under Creative Commons Attribution License (CC-BY).
Regarding Creative Commons License: Attribution 4.0 International (CC BY 4.0), please see https://creativecommons.org/licenses/by/4.0/
THIS DATABASE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DATABASE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
====================== ACKNOWLEDGEMENTS ================================
This research was partly supported by JSPS KAKENHI Grants (JP16H06302, JP18H04120, JP21H04907, JP20K23355, JP21K18023), and JST CREST Grants (JPMJCR20D3, JPMJCR18A6), Japan.
This dataset is based on the Rambalac's YouTube channel: https://www.youtube.com/c/Rambalac
This metadata record describes a series of tabular datasets containing metrics used to characterize periods of anomalously low storage for select large reservoirs across the Conterminous United States for the climate years (April 1 – March 31) from 1981 to 2020. These data support the associated Simeone and others (2024) publication. The reservoirs in this dataset are those included in the ResOpsUS dataset, with sufficient data during the period of interest. These metrics include reservoir storage percentiles, identified low-storage anomaly events, annual low storages, low-storage anomaly statistics for each event, and trends in reservoir metrics. This data release contains the following files. One version of the following three files for variable threshold method low-storage anomaly (weibull_jd), fixed threshold method low-storage anomaly (weibull_site), and operating curve-based low-storage anomalies (operating curve). Substitute each of these strings in parentheses for the strings here between . 1) percentiles_1981_2020.zip: Percentile zip files for the period of 1981-2020. This zip file contains the reservoir storage, inflow, and outflow percentiles as individual csv files for each reservoir (for example, res_ops_XXX.csv, where XXX is the ResOpsUS reservoir identification number) including percentiles from each different method. The metadata contains column details for an example version of these files. 2) reservoir_1981_2020_weibull_jddrought_properties.csv: This is a csv file that contains summaries of each low-storage anomaly event for each reservoir for each threshold. One of these files exists for each of our three low-storage anomaly methods. The three files included have different labels for each of the three different methods of identifying low-storage anomalies. 3) reservoir_1981_2020weibull_jdcomplete_annual_stats.csv: This csv is a csv file that contains the annual low-storage anomaly statistics for each climate year, threshold, and reservoir. The three files included have different labels for each of the three different methods of identifying low-storage anomalies. 4) reservoir_1981_2020weibull_jd*_trends.csv: This csv contains data on trends in low-storage anomaly characteristics for selected reservoirs during our primary period of interest from 1981 to 2020. 5) reservoir_metadata.csv: This csv file contains metadata for each reservoir.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Groundwater Withdrawals Peak Over the 21st Century
The large ensemble dataset contains groundwater related model outputs from 900 scenarios modeled using Global Change Analysis Model (GCAM). The scenario ensemble members include five Shared Socioeconomic Pathways (SSPs), four Representative Concentration Pathways (RCPs), five global climate model outputs, three groundwater depletion limits, two surface water storage expansion regimes, and two historical groundwater depletion trends.
Journal Article
Niazi, H., Wild, T.B., Turner, S.W.D., Graham, N.T., Hejazi, M., Msangi, S., Kim, S., Lamontagne, J.R., & Zhao, M. (2024). Global peak water limit of future groundwater withdrawals. Nature Sustainability, 7(4), 413–422. https://doi.org/10.1038/s41893-024-01306-w
Read full-text here: https://rdcu.be/dFpb5
Data Repository
This data repository is to be used in combination with the main meta-repository containing all scripts and files for reproducing the experiment as well as the analysis and post-processing of the model outputs.
Scripts and smaller files are provided in the GitHub meta-repository whereas larger files are provided in this data repository. Please complete the repository by placing the files as described hereunder. Please find the GitHub meta-repository here: https://github.com/JGCRI/niazi-etal_2024_nature-sustainability
Descriptions of files:
gcam-5.7z contains the GCAM version used to simulate 900 scenarios of plausible futures. The model folder contains all necessary input files to reproduce the simulations.
The model is to be used in combination with the meta-repository to setup batch runs on cluster.
Please navigate to model/ folder for other scenario-specific and model setup folders and files. gcam-5 is to be extracted in the same directory (./model/gcam-5/).
For the first-time users of GCAM, please follow guidance on GCAM wiki to setup GCAM or for background knowledge.
crop_yeild.7z: This file contains inputs related to climate impacts on crop yields. This is to be downloaded and extracted in the model/combined_impacts/ folder.
outputs-all.7z: Key model outputs queried and collated from 900 GCAM runs are explained hereunder. The files could be downloaded individually (.csv files) or all at once in .7z format (outputs-all.7z). These files are to be placed in the model/outputs folder of the meta-repository.
ag_prod_all_GW_scenarios.csv - Agricultural production across all scenario for 2050 and 2100 (tonnes)
prices_water_withdrawal_all.csv - Water prices across all scenarios and years ($/km3)
global_irrigated_prod_by_crop.csv - All irrigated agricultural production for each crop across and scenarios all years (tonnes)
surface_water_production_all.csv - Runoff across all scenarios and years (km3)
groundwater_production_FINAL.csv - Groundwater withdrawals across all scenarios and years (km3)
water_withdrawals_desal_all.csv - Water withdrawals from desalination plants across all scenarios and years (km3)
Short introduction to the study
Using 900 GCAM runs, this study finds that global groundwater withdrawals are expected to peak around mid-century, followed by a decline through 21st century, exposing about half of the population living in one-third of basins to groundwater stress, with cost and availability of surface water storage being the most significant driver of future groundwater withdrawals. This first-ever robust, quantitative confirmation of the peak-and-decline pattern for groundwater, previously only known for fossil fuels and minerals, raises concerns for basins heavily dependent on groundwater.
Niazi, H., Wild, T.B., Turner, S.W.D., Graham, N.T., Hejazi, M., Msangi, S., Kim, S., Lamontagne, J.R., & Zhao, M. (2024). Global peak water limit of future groundwater withdrawals. Nature Sustainability, 7(4), 413–422. https://doi.org/10.1038/s41893-024-01306-w
Read full-text here: https://rdcu.be/dFpb5
Contact
Please reach out to Hassan Niazi at hassan.niazi@pnnl.gov for any questions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DATA & FILE OVERVIEW
File List:
/BiomassData.csv: The yield, soil carbon sequestration, field area, and locations for fields identified as historically abandoned marginal land and with crop growth simulated using the SALUS biogeochemical crop model. /ParametersWithSources.xlsx: Economic, environmental, efficiency parameters used to parametrize the MILP model used in the study, source data for Supplementary Table 1. /RefineryData.csv: The potential locations, and CO2 transportation cost for biorefineries /DepotData.csv: The potential locations for preprocessing depots /figdata_Fig3a.csv: The results used to generate figure 3 /figdata_Fig3b_1.csv: The results used to generate figure 3 /figdata_Fig3b_2.csv: The results used to generate figure 3 /figdata_Fig4.csv: The results used to generate figure 4 /figdata_Fig5.csv: The technology matrix used to generate figure 5 /figdata_Fig6.csv: The results used to generate figure 6 /*_map.pdf: map files for supplementary figures /*_geopdf.pdf map files with georeferenced information /SupplementaryFigureData.xlsx: tabular data for supplementary plots. /fields.gpkg The shape file used to plot map files
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset includes images featuring crowds of people ranging from 0 to 5000 individuals. The dataset includes a diverse range of scenes and scenarios, capturing crowds in various settings. Each image in the dataset is accompanied by a corresponding JSON file containing detailed labeling information for each person in the crowd for crowd count and classification.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4b51a212e59f575bd6978f215a32aca0%2FFrame%2064.png?generation=1701336719197861&alt=media" alt="">
Types of crowds in the dataset: 0-1000, 1000-2000, 2000-3000, 3000-4000 and 4000-5000
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F72e0fed3ad13826d6545ff75a79ed9db%2FFrame%2065.png?generation=1701337622225724&alt=media" alt="">
This dataset provides a valuable resource for researchers and developers working on crowd counting technology, enabling them to train and evaluate their algorithms with a wide range of crowd sizes and scenarios. It can also be used for benchmarking and comparison of different crowd counting algorithms, as well as for real-world applications such as public safety and security, urban planning, and retail analytics.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F2e9f36820e62a2ef62586fc8e84387e2%2FFrame%2063.png?generation=1701336725293625&alt=media" alt="">
Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset
keywords: crowd counting, crowd density estimation, people counting, crowd analysis, image annotation, computer vision, deep learning, object detection, object counting, image classification, dense regression, crowd behavior analysis, crowd tracking, head detection, crowd segmentation, crowd motion analysis, image processing, machine learning, artificial intelligence, ai, human detection, crowd sensing, image dataset, public safety, crowd management, urban planning, event planning, traffic management
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Apple is typically stored under low temperature and controlled atmospheric conditions to ensure a year round supply of high quality fruit for the consumer. During storage, losses in quality and quantity occur due to spoilage by postharvest pathogens. One important postharvest pathogen of apple is Botrytis cinerea. The fungus is a broad host necrotroph with a large arsenal of infection strategies able to infect over 1,400 different plant species. We studied the apple-B. cinerea interaction to get a better understanding of the defense response in apple. We conducted an RNAseq experiment in which the transcriptome of inoculated and non-inoculated (control and mock) apples was analyzed at 0, 1, 12, and 28 h post inoculation. Our results show extensive reprogramming of the apple’s transcriptome with about 28.9% of expressed genes exhibiting significant differential regulation in the inoculated samples. We demonstrate the transcriptional activation of pathogen-triggered immunity and a reprogramming of the fruit’s metabolism. We demonstrate a clear transcriptional activation of secondary metabolism and a correlation between the early transcriptional activation of the mevalonate pathway and reduced susceptibility, expressed as a reduction in resulting lesion diameters. This pathway produces the building blocks for terpenoids, a large class of compounds with diverging functions including defense. 1-MCP and hot water dip treatment are used to further evidence the key role of terpenoids in the defense and demonstrate that ethylene modulates this response.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Labelling strategies in mass spectrometry (MS)-based proteomics enhance sample throughput by enabling the acquisition of multiplexed samples within a single run. However, contemporary experiments often involve increasingly complex designs, where the number of samples exceeds the capacity of a single run, resulting in a complex correlation structure that must be addressed for accurate statistical inference and reliable biomarker discovery. To this end, we introduce msqrob2TMT, a suite of mixed model-based workflows specifically designed for differential abundance analysis in labelled MS-based proteomics data. msqrob2TMT accommodates both sample-specific and feature-specific (e.g., peptide or protein) covariates, facilitating inference in experiments with arbitrarily complex designs and allowing for explicit correction of feature-specific covariates. We benchmark our innovative workflows against state-of-the-art tools, including DEqMS, MSstatsTMT, and msTrawler, using two spike-in studies. Our findings demonstrate that msqrob2TMT offers greater flexibility, improved modularity, and enhanced performance, particularly through the application of robust ridge regression. Finally, we demonstrate the practical relevance of msqrob2TMT in a real mouse study, highlighting its capacity to effectively account for the complex correlation structure in the data.
Vandenbulcke S, Vanderaa C, Crook O, Martens L, Clement L. Msqrob2TMT: Robust linear mixed models for inferring differential abundant proteins in labeled experiments with arbitrarily complex design. Mol Cell Proteomics. 2025;24(7):101002.
Also available as a preprint
Vandenbulcke, S., Vanderaa, C ., Crook, O., Martens, L. & Clement, L. msqrob2TMT: robust linear mixed models for inferring differential abundant proteins in labelled experiments with arbitrarily complex design. bioRxiv 2024.03.29.587218 (2024) doi:10.1101/2024.03.29.587218
This repository provides the data required to reproduce the results shown in the msqrob2TMT study. Data are organised in two main parts: input data and processed data.
The input data consist of data generated by others that we used for our analyses. Files are organised using there prefixes, one for each data set.
This data set has been published by Huang et al. 2020 and has been downloaded from the MassIVE repository (RMSV000000265). It contains 2 files:
spikein1_psms.txt
: a table with identified and quantified peptide-to-spectrum matches (FTP link: ftp://massive.ucsd.edu/x01/RMSV000000265/2020-06-08_huang704_4336d436/quant/161117_SILAC_HeLa_UPS1_TMT10_5Mixtures_3TechRep_UPSdB_Multiconsensus_PD22_Intensity_03_with_FDR_control_PSMs.txt)spikein1_annotations.csv
: the associated sample annotations (FTP link: ftp://massive.ucsd.edu/v02/MSV000084264/metadata/SpikeIn5mix_PD_annotation.csv)This data set has been published by O'Brien et al. 2024 and has been downloaded from a private Google Cloud Storage. It contains 3 files:
spikein2_psms.csv
: a table with identified and quantified peptide-to-spectrum matches (link)spikein2_annotations.csv
: a table with the associated sample annotations (link).spilein2_covariateFile.csv
: a file required to run the msTrawler method (link).The data for the mouse study has been published by Plubell et al. 2017 and has been downloaded from the MassIVE RMSV000000264.7 reanalysis repository:
mouse_psms.txt
: a table with identified and quantified peptide-to-spectrum matches (FTP link: ftp://massive.ucsd.edu/x01/RMSV000000264/2020-06-07_huang704_518429df/metadata/mouse3mix_PD_annotation.csv)mouse_annotations.csv
: the associated sample annotations (FTP link: ftp://massive.ucsd.edu/x01/RMSV000000264/2020-06-07_huang704_518429df/181017_Plubell_mouse_sh_lo_LF_HF_diet_adipocytes_3TMT10_HpH_Fusion_PD22_multi_01_PSMs.txt)We generated these data during our analyses and are provided in the processed.zip file. Each file is prefixed with the name of the data set it is related to. Here is a comprehensive list:
mouse_model_MsstatsTMT.rds
: a data.frame containing the MSstatsTMT statistical inference results for the mouse dataset.mouse_model_msqrob2tmt.rds
: a data.frame containing the msqrob2TMT statistical inference results for the mouse dataset where proteins were summarised within fraction.mouse_model_msqrob2tmt_mixture.rds
: a data.frame containing the msqrob2TMT statistical inference results for the mouse dataset where proteins were summarised within mixture.spikein1_input_deqms.rds
: a data.frame containing the spikein1 data after PSM filtering, ready for analysis by DEqMS.spikein1_input_msTrawler.txt
: a tabular text file containing the spikein1 data after PSM filtering, ready for analysis by msTrawler.spikein1_input_msqrob2tmt.rds
: a QFeatures object containing the spikein1 dataafter PSM filtering, ready for analysis by msqrob2.spikein1_input_msstatstmt.rds
: a data.frame containing the spikein1 data after PSM filtering, ready for analysis by MSstatsTMT.spikein1_model_DEqMS.rds
: a data.frame containing the DEqMS statistical inference results for the spikin1 dataset.spikein1_model_MsstatsTMT.rds
: a data.frame containing the MSstatsTMT statistical inference results for the spikin1 dataset.spikein1_model_compare_preprocessing.rds
: a data.frame containing MSstatsTMT and msqrob2TMT statistical inference results for the spikin1 dataset upon different processing workflows carried out by MSstastTMT.spikein1_model_msTrawler.rds
: a data.frame containing the msTrawler statistical inference results for the spikin1 dataset.spikein1_model_msqrob2tmt.rds
: a data.frame containing the msqrob2TMT statistical inference results for the spikin1 dataset.spikein2_input.rds
: a data.frame containing the spikein2 data after running the custom preprocessing pipeline by O'Brien et al. 2024.spikein2_input_preprocessed.rds
: a data.frame containing the spikein2 data after running the custom preprocessing workflow by O'Brien et al. 2024 and the preprocessing workflow by msTrawler.spikein2_model_DEqMS.rds
: a data.frame containing the DEqMS statistical inference results for the spikin2 dataset.spikein2_model_msqrob2tmt.rds
: a data.frame containing the msqrob2tmt statistical inference results for the spikin2 dataset.spikein2_model_MSstatsTMT.rds
: a data.frame containing the MSstatsTMT statistical inference results for the spikin2 dataset.spikein2_model_msTrawler.rds
: a data.frame containing the msTrawler statistical inference results for the spikin2 dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Documented March 19, 2023
!!NEW!!!
GeoDAR reservoirs were registered to the drainage network! Please see the auxiliary data "GeoDAR-TopoCat" at https://zenodo.org/records/7750736. "GeoDAR-TopoCat" contains the drainage topology (reaches and upstream/downstream relationships) and catchment boundary for each reservoir in GeoDAR, based on the algorithm used for Lake-TopoCat (doi:10.5194/essd-15-3483-2023).
Documented April 1, 2022
Citation
Wang, J., Walter, B. A., Yao, F., Song, C., Ding, M., Maroof, A. S., Zhu, J., Fan, C., McAlister, J. M., Sikder, M. S., Sheng, Y., Allen, G. H., Crétaux, J.-F., and Wada, Y.: GeoDAR: georeferenced global dams and reservoirs database for bridging attributes and geolocations. Earth System Science Data, 14, 1869–1899, 2022, https://doi.org/10.5194/essd-14-1869-2022.
Please cite the reference above (which was fully peer-reviewed), NOT the preprint version. Thank you.
Contact
Dr. Jida Wang, jidawang@ksu.edu, gdbruins@ucla.edu
Data description and components
Data folder “GeoDAR_v10_v11” (.zip) contains two consecutive, peer-reviewed versions (v1.0 and v1.1) of the Georeferenced global Dams And Reservoirs (GeoDAR) dataset:
As by-products of GeoDAR harmonization, folder “GeoDAR_v10_v11” also contains:
Attribute description
Attribute |
Description and values |
v1.0 dams (file name: GeoDAR_v10_dams; format: comma-separated values (csv) and point shapefile) | |
id_v10 |
Dam ID for GeoDAR version 1.0 (type: integer). Note this is not the same as the International Code in ICOLD WRD but is linked to the International Code via encryption. |
lat |
Latitude of the dam point in decimal degree (type: float) based on datum World Geodetic System (WGS) 1984. |
lon |
Longitude of the dam point in decimal degree (type: float) on WGS 1984. |
geo_mtd |
Georeferencing method (type: text). Unique values include “geo-matching CanVec”, “geo-matching LRD”, “geo-matching MARS”, “geo-matching NID”, “geo-matching ODC”, “geo-matching ODM”, “geo-matching RSB”, “geocoding (Google Maps)”, and “Wada et al. (2017)”. Refer to Table 2 in Wang et al. (2022) for abbreviations. |
qa_rank |
Quality assurance (QA) ranking (type: text). Unique values include “M1”, “M2”, “M3”, “C1”, “C2”, “C3”, “C4”, and “C5”. The QA ranking provides a general measure for our georeferencing quality. Refer to Supplementary Tables S1 and S3 in Wang et al. (2022) for more explanation. |
rv_mcm |
Reservoir storage capacity in million cubic meters (type: float). Values are only available for large dams in Wada et al. (2017). Capacity values of other WRD records are not released due to ICOLD’s proprietary restriction. Also see Table S4 in Wang et al. (2022). |
val_scn |
Validation result (type: text). Unique values include “correct”, “register”, “mismatch”, “misplacement”, and “Google Maps”. Refer to Table 4 in Wang et al. (2022) for explanation. |
val_src |
Primary validation source (type: text). Values include “CanVec”, “Google Maps”, “JDF”, “LRD”, “MARS”, “NID”, “NPCGIS”, “NRLD”, “ODC”, “ODM”, “RSB”, and “Wada et al. (2017)”. Refer to Table 2 in Wang et al. (2022) for abbreviations. |
qc |
Roles and name initials of co-authors/participants during data quality control (QC) and validation. Name initials are given to each assigned dam or region and are listed generally in chronological order for each role. Collation and harmonization of large dams in Wada et al. (2017) (see Table S4 in Wang et al. (2022)) were performed by JW, and this information is not repeated in the qc attribute for a reduced file size. Although we tried to track the name initials thoroughly, the lists may not be always exhaustive, and other undocumented adjustments and corrections were most likely performed by JW. |
v1.1 dams (file name: GeoDAR_v11_dams; format: comma-separated values (csv) and point shapefile) | |
id_v11 |
Dam ID for GeoDAR version 1.1 (type: integer). Note this is not the same as the International Code in ICOLD WRD but is linked to the International Code via encryption. |
id_v10 |
v1.0 ID of this dam/reservoir (as in id_v10) if it is also included in v1.0 (type: integer). |
id_grd_v13 |
GRanD ID of this dam if also included in GRanD v1.3 (type: integer). |
lat |
Latitude of the dam point in decimal degree (type: float) on WGS 1984. Value may be different from that in v1.0. |
lon |
Longitude of the dam point in decimal degree (type: float) on WGS 1984. Value may be different from that in v1.0. |
geo_mtd |
Same as the value of geo_mtd in v1.0 if this dam is included in v1.0. |
qa_rank |
Same as the value of qa_rank in v1.0 if this dam is included in v1.0. |
val_scn |
Same as the value of val_scn in v1.0 if this dam is included in v1.0. |
val_src |
Same as the value of val_src in v1.0 if this dam is included in v1.0. |
rv_mcm_v10 |
Same as the value of rv_mcm in v1.0 if this dam is included in v1.0. |
rv_mcm_v11 |
Reservoir storage capacity in million cubic meters (type: float). Due to ICOLD’s proprietary restriction, provided values are limited to dams in Wada et al. (2017) and GRanD v1.3. If a dam is in both Wada et al. (2017) and GRanD v1.3, the value from the latter (if valid) takes precedence. |
har_src |
Source(s) to harmonize the dam points. Unique values include “GeoDAR v1.0 alone”, “GRanD v1.3 and GeoDAR 1.0”, “GRanD v1.3 and other ICOLD”, and “GRanD v1.3 alone”. Refer to Table 1 in Wang et al. (2022) for more details. |
pnt_src |
Source(s) of the dam point spatial coordinates. Unique values include “GeoDAR v1.0”, “original |
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all input and output files that were used in the study reported in:
Wade, J., David, C.H., Collins, E.L., Denbina, M., Cerbelaud, A., Tom, M., Reager, J.T., Frasson, R.P.M., Famiglietti, J.S., Lee, T., Gierach, M.M. (In Review), Intrinsic spatial scales of river stores and fluxes and their relative contributions to the global water cycle.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Summary
The Earth’s rivers vary in size across several orders of magnitude. Yet, the relative significance of small upstream reaches compared to large downstream rivers in the global water cycle remains unclear, challenging the determination of adequate spatial resolution for observations. Using monthly simulations of river stores and fluxes from the MeanDRS river routing dataset, we sample global rivers by a range of estimated river width thresholds to investigate the intrinsic spatial scales of the global river water cycle. We frame these scale-dependent river dynamics in terms of observational capabilities, assessing how the size of rivers that can be resolved influences our ability to capture key global hydrologic stores and fluxes.
We aim to answer two questions:
What is the intrinsic spatial resolution of global river dynamics?
How can the spatial scale of river processes be used to inform efficient monitoring and modeling strategies of global river stores and fluxes?
Data sources
The following sources were used to produce files in this dataset:
Mean Discharge Runoff and Storage (MeanDRS) dataset (version v0.4) available under a CC BY-NC-SA 4.0 license. https://zenodo.org/records/10013744. DOI: 10.5281/zenodo.10013744; 10.1038/s41561-024-01421-5
MERIT-Basins (version 1.0) derived from MERIT-Hydro (version 0.7) available under a CC BY-NC-SA 4.0 license. https://www.reachhydro.org/home/params/merit-basins
Software
The software that was used to produce files in this dataset are available at https://github.com/jswade/meandrs-width-sampling.
Data Products
The following files represent the primary outputs of the analysis. Each file class generally has 61 files, corresponding to the 61 global hydrologic regions (region ii).
Riv_coast.zip contains shapefiles of corrected and uncorrected MeanDRS river reaches that intersect with the global coast and are inferred to drain to the ocean.
· riv_coast.zip
o cor: riv_coast_pfaf_ii_COR.shp
o uncor: riv_coast_pfaf_ii_UNCOR.shp
Qout_rivwidth.zip contains csv files of the aggregate river discharge to the ocean (km3/yr) of under each tested river width sampling scenario for each of the 61 global hydrologic regions.
· Qout_rivwidth.zip: Qout_pfaf_ii_rivwidth.csv
V_rivwidth_low.zip contains csv files of the aggregate river storage (km3) for the low residence time scenario under each tested river width sampling scenario for each of the 61 global hydrologic regions.
· V_rivwidth_low.zip: V_pfaf_ii_rivwidth_low.csv
V_rivwidth_nrm.zip contains csv files of the aggregate river storage (km3) for the normal (medium) residence time scenario under each tested river width sampling scenario for each of the 61 global hydrologic regions.
· V_rivwidth_nrm.zip: V_pfaf_ii_rivwidth_nrm.csv
V_rivwidth_hig.zip contains csv files of the aggregate river storage (km3) for the high residence time scenario under each tested river width sampling scenario for each of the 61 global hydrologic regions.
· V_rivwidth_hig.zip: V_pfaf_ii_rivwidth_hig.csv
Largest_rivs.zip contains files related to our analysis of the relative contributions of discharge to the ocean from the 10 largest global river basins.
· largest_rivs.zip
o cat: cat_dis_top10_nxx.shp – dissolved catchments of reaches draining from the 10 largest basins
o csv: Q_df_top10.csv – total discharge contributed by each basin
o riv: riv_top10_nxx.shp – river reaches that drain the 10 largest basins
Smallest_rivs.zip contains files related to our analysis of the relative contributions of discharge to the ocean from global rivers narrower than 100 m.
· smallest_rivs.zip
o cat: cat_pfaf_pfaf_ii_small_100m.shp – dissolved catchments of narrow reaches draining to the ocean for each region ii
o csv: Q_df_top10.csv – total discharge to the ocean from each narrow river reach
o riv: riv_pfaf_ii_small_100m.shp – river reaches narrower than 100 m that drain to the ocean for each region ii
Global_summary.zip contains files related to the global aggregation of our region-specific river width sampling estimates for discharge to the ocean and river storage.
· global_summary.zip
o Qout_rivwidth: global summary files for discharge to the ocean (km3/yr) under river width sampling
o V_rivwidth_low: global summary files for total river storage (km3) for the low residence time scenario under river width sampling
o V_rivwidth_nrm: global summary files for total river storage (km3) for the normal (medium) residence time scenario under river width sampling
o V_rivwidth_hig: global summary files for total river storage (km3) for the hig residence time scenario under river width sampling
o cat_small_gl: cat_dis_global_small_100m.shp – global dissolved catchments contributing to all rivers narrower than 100 m that drain to the ocean
Rivwidth_sens.zip contains files related to our supplemental analysis of the sensitivity of our width estimation approach to choice of input discharge dataset. Here, we compute estimated river widths using 3 versions of MeanDRS discharge outputs (VIC, CLSM, NOAH) and compare the results of river width sampling from those runs to that of the primary analysis. The file formats and explanations follow those presented above, with added information for the land surface model used to generate those discharge simulations.
· Rivwidth_sens.zip
o riv_coast
o Qout_rivwidth_VIC
o Qout_rivwidth_CLSM
o Qout_rivwidth_NOAH
o V_rivwidth_low_VIC
o V_rivwidth_nrm_VIC
o V_rivwidth_hig_VIC
o V_rivwidth_low_CLSM
o V_rivwidth_nrm_CLSM
o V_rivwidth_hig_CLSM
o V_rivwidth_low_NOAH
o V_rivwidth_nrm_NOAH
o V_rivwidth_hig_NOAH
o global_summary_VIC
o global_summary_CLSM
o global_summary_NOAH
Cor_sens.zip contains files related to our supplemental analysis of the sensitivity use of corrected ensemble MeanDRS discharge and volume simulations as opposed to uncorrected ensemble simulations. Here, we repeat our primary analysis using only uncorrected simulations throughout, rather than performing river width sampling using corrected simulations. The file formats and explanations follow those presented above, with the files using uncorrected ensemble (ENS) discharge and storage values in contrast to the primary analysis.
· Cor_sens.zip
o Qout_rivwidth_ENS
o V_rivwidth_low_ENS
o V_rivwidth_nrm_ENS
o V_rivwidth_hig_ENS
o global_summary_ENS
Width_val.zip contains files related to our supplemental validation of river widths estimated from MeanDRS discharge simulations through comparison with optical measurements of widths from the Global River Widths from Landsat (GRWL) Databse (Allen & Pavelsky, 2018).
· Width_val.zip: width_validation_pfaf_ii.csv
Known bugs in this dataset or the associated manuscript
No bugs have been identified at this time.
References
Allen, G. H., & Pavelsky, T. M. (2018). Global extent of rivers and streams. Science, 361(6402), 585-588. https://doi.org/10.1126/science.aat0636
Collins, E. L., David, C. H., Riggs, R., Allen, G. H., Pavelsky, T. M., Lin, P., Pan, M., Yamazaki, D., Meentemeyer, R. K., & Sanchez, G. M. (2024). Global patterns in river water storage dependent on residence time. Nature Geoscience, 1–7. https://doi.org/10.1038/s41561-024-01421-5
Lin, P., Pan, M., Beck, H. E., Yang, Y., Yamazaki, D., Frasson, R., David, C. H., Durand, M., Pavelsky, T. M., Allen, G. H., Gleason, C. J., & Wood, E. F. (2019). Global Reconstruction of Naturalized River Flows at 2.94 Million Reaches. Water Resources Research, 55(8), 6499–6516. https://doi.org/10.1029/2019WR025287
Yang, Y., Pan, M., Lin, P., Beck, H. E., Zeng, Z., Yamazaki, D., David, C. H., Lu, H., Yang, K., Hong, Y., & Wood, E. F. (2021). Global Reach-Level 3-Hourly River Flood Reanalysis (1980–2019). Bulletin of the American Meteorological Society, 102(11), E2086–E2105. https://doi.org/10.1175/BAMS-D-20-0057.1
The datapreview extension for CKAN enhances data accessibility by providing a proxy to retrieve and format data from local storage or remote URLs for previewing in applications like Recline. It addresses performance and file size limitations found in similar solutions, offering a streamlined way to preview CSV and XLS files within the CKAN environment by leveraging the ckanext-archiver extension. This extension provides a local implementation of data proxy functionality, aiming to improve the efficiency of data previewing, especially for larger datasets. Key Features: Data Proxy Functionality: Serves as a proxy for retrieving data from local or remote sources, formatting it into a JSON dictionary suitable for data preview tools. CSV/XLS Parsing: Parses CSV and XLS files to extract data for preview, enabling users to quickly inspect data content without downloading the entire file. File Size Limit Configuration: Allows administrators to configure a maximum file size limit for remote downloads and in-memory processing, preventing server overload when handling large datasets. Local Archive Cache Utilization: Integrates with ckanext-archiver to prioritize retrieving data from the local archive cache, reducing reliance on remote sources and improving retrieval speed if files have already been archived. Technical Integration: The datapreview extension integrates with CKAN by adding a new controller that handles data proxy requests. It relies on the resource ID rather than a URL, which differs from the original dataproxy implementation. The extension also depends on ckanext-archiver for accessing cached resources and messytables for handling CSV and Excel file parsing. To enable the extension, it must be added to the ckan.plugins property in the CKAN configuration file. Benefits & Impact: The datapreview extension improves the performance and scalability of data previewing within CKAN. By using a local archive cache and allowing configuration of file size limits, it addresses the limitations of the original dataproxy implementation. It also enables the previewing of larger files than might otherwise be possible. On data.gov.uk, the extension helps users quickly view data before deciding to download it, which enhances the overall user experience.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The TBX11K dataset is a large dataset containing 11000 chest x-ray images. It's the only TB dataset that I know of that includes TB bounding boxes. This allows both classification and detection models to be trained.
However, it can be mentally tiring to get started with this dataset. It includes many xml, json and txt files that you need to sift through to try to understand what everything means, how it all fits together and how to extract the bounding box coordinates.
Here I've simplified the dataset. Now there's just one csv file, one folder containing the training images and one folder containing the test images.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Fa637d3837c261605a3c2f71a18a9b6f0%2FScreenshot%202023-02-08%20at%2012.26.29.png?generation=1675834031712582&alt=media" alt="">
Paper: Rethinking Computer-aided Tuberculosis Diagnosis
Original TBX11K dataset on Kaggle
1- Please start by reading the paper. It will help you understand what everything means. 2- The original dataset was split into train and validation sets. This split is shown in the 'source' column in the data.csv file. 3- The test images are stored in the folder called "test". There are no labels for these images and I've not included them in data.csv. 4- Each bounding box is on a separate row. Therefore, the file names in the "fname" column are not unique. For example, if an image has two bounding boxes then the file name for that image will appear twice in the "fname" column. 5- The original dataset has a folder named "extra" that contains data from other TB datasets. I've not included that folder here.
Many thanks to the team that created the TBX11K dataset and generously made it publicly available.
# TBX11K dataset
@inproceedings{liu2020rethinking,
title={Rethinking computer-aided tuberculosis diagnosis},
author={Liu, Yun and Wu, Yu-Huan and Ban, Yunfeng and Wang, Huifang and Cheng, Ming-Ming},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={2646--2655},
year={2020}
}
The U.S. Geological Survey (USGS) Water Resources Mission Area (WMA) is working to address a need to understand where the Nation is experiencing water shortages or surpluses relative to the demand for water need by delivering routine assessments of water supply and demand and an understanding of the natural and human factors affecting the balance between supply and demand. A key part of these national assessments is identifying long-term trends in water availability, including groundwater and surface water quantity, quality, and use. This data release contains Mann-Kendall monotonic trend analyses for 18 observed annual and monthly streamflow metrics at 6,347 U.S. Geological Survey streamgages located in the conterminous United States, Alaska, Hawaii, and Puerto Rico. Streamflow metrics include annual mean flow, maximum 1-day and 7-day flows, minimum 7-day and 30-day flows, and the date of the center of volume (the date on which 50% of the annual flow has passed by a gage), along with the mean flow for each month of the year. Annual streamflow metrics are computed from mean daily discharge records at U.S. Geological Survey streamgages that are publicly available from the National Water Information System (NWIS). Trend analyses are computed using annual streamflow metrics computed through climate year 2022 (April 2022- March 2023) for low-flow metrics and water year 2022 (October 2021 - September 2022) for all other metrics. Trends at each site are available for up to four different periods: (i) the longest possible period that meets completeness criteria at each site, (ii) 1980-2020, (iii) 1990-2020, (iv) 2000-2020. Annual metric time series analyzed for trends must have 80 percent complete records during fixed periods. In addition, each of these time series must have 80 percent complete records during their first and last decades. All longest possible period time series must be at least 10 years long and have annual metric values for at least 80% of the years running from 2013 to 2022. This data release provides the following five CSV output files along with a model archive: (1) streamflow_trend_results.csv - contains test results of all trend analyses with each row representing one unique combination of (i) NWIS streamgage identifiers, (ii) metric (computed using Oct 1 - Sep 30 water years except for low-flow metrics computed using climate years (Apr 1 - Mar 31), (iii) trend periods of interest (longest possible period through 2022, 1980-2020, 1990-2020, 2000-2020) and (iv) records containing either the full trend period or only a portion of the trend period following substantial increases in cumulative upstream reservoir storage capacity. This is an output from the final process step (#5) of the workflow. (2) streamflow_trend_trajectories_with_confidence_bands.csv - contains annual trend trajectories estimated using Theil-Sen regression, which estimates the median of the probability distribution of a metric for a given year, along with 90 percent confidence intervals (5th and 95h percentile values). This is an output from the final process step (#5) of the workflow. (3) streamflow_trend_screening_all_steps.csv - contains the screening results of all 7,873 streamgages initially considered as candidate sites for trend analysis and identifies the screens that prevented some sites from being included in the Mann-Kendall trend analysis. (4) all_site_year_metrics.csv - contains annual time series values of streamflow metrics computed from mean daily discharge data at 7,873 candidate sites. This is an output of Process Step 1 in the workflow. (5) all_site_year_filters.csv - contains information about the completeness and quality of daily mean discharge at each streamgage during each year (water year, climate year, and calendar year). This is also an output of Process Step 1 in the workflow and is combined with all_site_year_metrics.csv in Process Step 2. In addition, a .zip file contains a model archive for reproducing the trend results using R 4.4.1 statistical software. See the README file contained in the model archive for more information. Caution must be exercised when utilizing monotonic trend analyses conducted over periods of up to several decades (and in some places longer ones) due to the potential for confounding deterministic gradual trends with multi-decadal climatic fluctuations. In addition, trend results are available for post-reservoir construction periods within the four trend periods described above to avoid including abrupt changes arising from the construction of larger reservoirs in periods for which gradual monotonic trends are computed. Other abrupt changes, such as changes to water withdrawals and wastewater return flows, or episodic disturbances with multi-year recovery periods, such as wildfires, are not evaluated. Sites with pronounced abrupt changes or other non-monotonic trajectories of change may require more sophisticated trend analyses than those presented in this data release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Seagrass ecosystems provide an array of ecosystem services ranging from habitat provision to erosion control. From a climate change and eutrophication mitigation perspective, the ecosystem services include burial and storage of carbon and nutrients in the sediments. Eelgrass (Zostera marina) is the most abundant seagrass species along the Danish coasts, and while its function as a carbon and nutrient sink has been documented in some areas, the spatial variability of these functions, and the drivers behind them, are not well understood. Here we present the first nationwide study on eelgrass sediment stock of carbon (Cstock), nitrogen (Nstock), and phosphorus (Pstock). Stocks were measured in the top 10 cm of eelgrass meadows spanning semi-enclosed estuaries (inner and outer fjords) to open coasts. Further, we assessed environmental factors (level of exposure, sediment properties, level of eutrophication) from each area to evaluate their relative importance as drivers of the spatial pattern in the respective stocks. We found large spatial variability in sediment stocks, representing 155–4413 g C m-2, 24–448 g N m-2, and 7–34 g P m-2. Cstock and Nstock were significantly higher in inner fjords compared to outer fjords and open coasts. Cstock, Nstock, and Pstock showed a significantly positive relationship with the silt-clay content in the sediments. Moreover, Cstock was also significantly higher in more eutrophied areas with high concentrations of nutrients and chlorophyll a (chl a) in the water column. Conversely, silt-clay content was not related to nutrients or chl a, suggesting a spatial dependence of the importance of these factors in driving stock sizes and implying that local differences in sediment properties and eutrophication level should be included when evaluating the storage capacity of carbon, nitrogen, and phosphorus in Danish eelgrass meadows. These insights provide guidance to managers in selecting priority areas for carbon and nutrient storage for climate- and eutrophication mitigation initiatives.
Board-based software tools for managing collaborative work (e.g. Trello or Microsoft Planner) are highly configurable information systems. Their structure is based on boards that contain cards organized in lists. This structure allows users to organize a wide variety of formal or informal information and work processes in a very flexible way. However, this flexibility means that in every situation the user is required to make decisions to design a new board from scratch, which is not a straightforward task, specially if performed by non-technical users.
We have carried out a study following an inductive approach consisting of analyzing 91 Trello board designs from board templates proposed by Trello users (see trello-scrapping.csv), which cover a wide variety of domains and use cases. From this analysis we characterize the following 8 patterns that are commonly used in board designs and are applicable to all board-based tools:
Information or resources lifecycle
Ordered Information
Kanban
Process Tasks
Assigned Information
Categorized Information
Assigned Tasks
Categorized Tasks
About the analysis performed
For the sake of the verifiability of the analysis performed, the sources used for the analysis and its details are also available at this repository:
trello-scrapping.csv: In this csv file you can find the actual scrapping of www.trello.com/templates on the date that the paper was submitted (4th December 2020). This scrapping returns the whole list of templates that can be used in the cited workstream collaborative tool. The 91 templates analyzed by us in the paper were obtained with a similar scrapping about a year before (February 2019 approximately). In this file you can take a look at the actual 230 templates created in Trello (including the previous 91), with their names, link and description among other elements. We have added two new columns to the scrapping (these columns aren't obtained of the scrapping) in which we have clasified the templates in their corresponding pattern, as we did in "trello templates clasification.xlsx" only with the first 91 templates for writing the paper. In these two columns we separate between the 91 templates considered in the paper (previously clasified and isolated in "trello templates clasification.xlsx") and the new templates obtained in the second scrape after submitting the paper.
trello templates clasification.xlsx: This file contains the clasification of the 91 analyzed Trello templates divided in three sheets. The first one show the raw clasification, with one row for each template and an "X" in the column of the pattern(s) in which it is classified. In the other sheets there are both summary tables in which you can clearly see the distribution of templates in our proposed patterns (how many templates are there for each pattern?) and the multiple pattern combinations (when a template contains more than one pattern).
Various population statistics, including structured demographics data.
This dataset contains detailed information on a wide variety of vegetables from different regions across the world. Each entry includes data on the vegetable's category, color, seasonality, origin, nutritional value, pricing, availability, shelf life, storage requirements, growing conditions, health benefits, and common varieties. The dataset is structured to facilitate research and data analysis, offering insights into agricultural trends, nutritional science, and market dynamics. Ideal for use in academic research, market analysis, and agricultural studies.
Various economic indicators.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the cognitive abilities most affected by substance abuse is decision-making. Behavioral tasks such as the Iowa Gambling Task (IGT) provide a means to measure the learning process involved in decision-making. To comprehend this process, three hypotheses have emerged: (1) participants prioritize gains over losses, (2) they exhibit insensitivity to losses, and (3) the capacity of operational storage or working memory comes into play. A dynamic model was developed to examine these hypotheses, simulating sensitivity to gains and losses. The Linear Operator model served as the learning rule, wherein net gains depend on the ratio of gains to losses, weighted by the sensitivity to both. The study further proposes a comparison between the performance of simulated agents and that of substance abusers (n = 20) and control adults (n = 20). The findings indicate that as the memory factor increases, along with high sensitivity to losses and low sensitivity to gains, agents prefer advantageous alternatives, particularly those with a lower frequency of punishments. Conversely, when sensitivity to gains increases and the memory factor decreases, agents prefer disadvantageous alternatives, especially those that result in larger losses. Human participants confirmed the agents’ performance, particularly when contrasting optimal and sub-optimal outcomes. In conclusion, we emphasize the importance of evaluating the parameters of the linear operator model across diverse clinical and community samples.
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel