22 datasets found

w
Randomized Hourly Load Data for use with Taxonomy Distribution Feeders
data.wu.ac.at
datadiscoverystudio.org
application/unknown
Updated Aug 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
Explore at:
application/unknownAvailable download formats
Dataset updated
Aug 29, 2017
Dataset provided by
Department of Energy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
q
Large Datasets in R - Plant Phenology & Temperature Data from NEON
qubeshub.org
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
Explore at:
Unique identifier
https://doi.org/10.25334/Q4DQ3F
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
96 wells fluorescence reading and R code statistic for analysis
zenodo.org
bin, csv, doc, pdf
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JVD Molino; JVD Molino (2024). 96 wells fluorescence reading and R code statistic for analysis [Dataset]. http://doi.org/10.5281/zenodo.1119285
Explore at:
doc, csv, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1119285
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
JVD Molino; JVD Molino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m²s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.

Info

ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3

barplot_R.R -> code to generate bar plot in R statistic 3.3.3

boxplotv2.R -> code to generate boxplot in R statistic 3.3.3

pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.

who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.

who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.

Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content

ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.

Consider citing our work.

Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
P
titanic5 Dataset Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
titanic5 Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/titanic5-dataset
Explore at:
Description
titanic5 Dataset Created by David Beltran del Rio March 2016.

Notes This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3 dataset, in order to compare the two and to get your sibsp and parch variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.

The two datasets line up nicely, most of the differences in the newer titanic5 dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.

I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.

titanic3_wID data can be matched to titanic5 using the Name_ID variable. Tab titanic5 Metadata has the variable descriptions and allowable values for Class and Class/Dept.

A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F) was arrived at. It’s the Age_F_Code variable - the allowable values are in the Titanic5_metadata tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.

Here’s what the tabs are:

Titanic5_all - all (mostly cleaned) Titanic passenger and crew records Titanic5_work - working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work on Titanic5_metadata - Variable descriptions and allowable values titanic3_wID - Original Titanic3 dataset with Name_ID added for merging to Titanic5 I have a csv, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.

If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com

The tabs in titanic5.xls are

Titanic5_all Titanic5_passenger (the one to be used for analysis) Titanic5_metadata (used during analysis file creation) Titanic3_wID
d
R-LOADEST files to produce results in the Heart River Basin, North Dakota,...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R-LOADEST files to produce results in the Heart River Basin, North Dakota, 1970-2020 [Dataset]. https://catalog.data.gov/dataset/r-loadest-files-to-produce-results-in-the-heart-river-basin-north-dakota-1970-2020
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
North Dakota, Heart River
Description
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
case study 1 bike share
kaggle.com
Updated Oct 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohamed osama (2022). case study 1 bike share [Dataset]. https://www.kaggle.com/ososmm/case-study-1-bike-share/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mohamed osama
Description
Cyclistic: Google Data Analytics Capstone Project

Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries

Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

date, intersect, setdiff, union

Set the working directory

setwd("/kaggle/input/cyclistic-bike-share")

Import the csv files

r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...
o
Movie Rationales (Rationales For Movie Reviews)
opendatabay.com
.undefined
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Movie Rationales (Rationales For Movie Reviews) [Dataset]. https://www.opendatabay.com/data/ai-ml/056ebe3b-4213-4643-b69d-3933e0cfa443
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset was created to allow researchers to gain an in-depth understanding of the inner workings of human-generated movie reviews. With these train, test, and validation sets, researchers can explore different aspects of movie reviews, such as sentiment labels or rationales behind them. By analyzing this information and finding patterns and correlations, insightful ideas can be discovered that can lead to developing models powerful enough to uncover importance of the unique human perspectives when interpreting movie reviews. Any data scientist or researcher interested in AI applications is encouraged to take advantage of this dataset which may potentially provide useful insights into better understanding user intent when reviewing movies

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset is intended to enable researchers and developers to uncover the rationales behind movie reviews. To use it effectively, you must understand the data format and how each column in the dataset works.

What does each column mean? review: The text of the movie review. (String)

label: The sentiment label of the review (Positive, Negative, or Neutral). (String)

validation.csv: The validation set which contains reviews, labels, and evidence which can be used to validate models developed for understanding human perspective on movie reviews.

train.csv: The train set which contains reviews, labels as well as evidence used for training a model based on human annotations of movie reviews.

test.csv: The test set which contains reviews, labels and evidence that can be used to evaluate models on unseen data related to understanding perspectives of humans when it comes to movie reviews..

How do I use this dataset? To get started with this dataset you need a working environment such as Python or R where you have access library’s needed for natural language processing(NLP). After setting up an environment with libraries that support NLP tasks execute following steps :

Import csv files into your workspace using appropriate functions provided by specified language libraries e,.g., for Python use pandas read_csv() method .

Preprocess your text data in 'review' & 'label' columns by standardizing them like removing stopwords from sentences & converting words into lowercase etc .Following link link provides best possible preprocessing libraries available in Python .

Train&Test ML algorithms using appropriate feature extraction techniques related to NLP( Bag Of Words , TF-IDF , Word2Vec ) eines are some examples in many more are available Refer link

Measure performance accuracy after running experiments on datasets provided validation & test sets we have also included precision recall curves along famous metrics like F1 score & accuracy score so you could easily analyze hyperparameter tuning & algorithm efficiency according their outputs values you get while testing your ML algorithm

Recommendation systems are always fun! build a simple machine learning reccomendation system by collecting user visits logs post hand writting new featuers might

Research Ideas Developing an automated movie review summarizer based on user ratings, that can accurately capture the salient points of a review and summarize it for moviegoers. Training a model to predict the sentiment of a review, by combining machine learning models with human-annotated rationales from this dataset. Building an AI system that can detect linguistic markers of deception in reviews (e.g., 'fake news', thin reviews etc) and issue warnings on possible fraudulent purchases or online reviews

Columns File: validation.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) File: train.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) File: test.csv

Column name Description review Text from the movie review. (String) label Indicates whether a particular review’s sentiment can be classified as Positive (1), Negative (-1) or Neutral (0). (Integer) Acknowledgements If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

License

CC0

Original Data Source: Movie Rationales (Rationales For Movie Reviews)
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
u
Data from: United States wildlife and wildlife product imports from...
agdatacommons.nal.usda.gov
bin
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evan A. Eskew; Allison M. White; Naom Ross; Kristine M. Smith; Katherine F. Smith; Jon Paul Rodríguez; Carlos Zambrana-Torrelio; William B. Karesh; Peter Daszak (2025). Data from: United States wildlife and wildlife product imports from 2000–2014 [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Data_from_United_States_wildlife_and_wildlife_product_imports_from_2000_2014/24853503
Explore at:
binAvailable download formats
Dataset updated
May 6, 2025
Dataset provided by
Scientific Data
Authors
Evan A. Eskew; Allison M. White; Naom Ross; Kristine M. Smith; Katherine F. Smith; Jon Paul Rodríguez; Carlos Zambrana-Torrelio; William B. Karesh; Peter Daszak
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The global wildlife trade network is a massive system that has been shown to threaten biodiversity, introduce non-native species and pathogens, and cause chronic animal welfare concerns. Despite its scale and impact, comprehensive characterization of the global wildlife trade is hampered by data that are limited in their temporal or taxonomic scope and detail. To help fill this gap, we present data on 15 years of the importation of wildlife and their derived products into the United States (2000–2014), originally collected by the United States Fish and Wildlife Service. We curated and cleaned the data and added taxonomic information to improve data usability. These data include >2 million wildlife or wildlife product shipments, representing >60 biological classes and >3.2 billion live organisms. Further, the majority of species in the dataset are not currently reported on by CITES parties. These data will be broadly useful to both scientists and policymakers seeking to better understand the volume, sources, biological composition, and potential risks of the global wildlife trade. Resources in this dataset:Resource Title: United States LEMIS wildlife trade data curated by EcoHealth Alliance (Version 1.1.0) - Zenodo. File Name: Web Page, url: https://doi.org/10.5281/zenodo.3565869 Over 5.5 million USFWS LEMIS wildlife or wildlife product records spanning 15 years and 28 data fields. These records were derived from >2 million unique shipments processed by USFWS during the time period and represent >3.2 billion live organisms. We provide the final cleaned data as a single comma-separated value file. Original raw data as provided by the USFWS are also available. Although relatively large (~1 gigabyte), the cleaned data file can be imported into a software environment of choice for data analysis. Alternatively, the assocated R package provides access to a release of the same cleaned dataset but with a data download and manipulation framework that is designed to work well with this large dataset. Both the Zenodo data repository and the R package contain a metadata file describing each of the data fields as well as a lookup table to retrieve full values for the abbreviated codes used throughout the dataset. Contents: lemis_2000_2014_cleaned.csv: This file represents the compiled, cleaned LEMIS data from 2000-2014. This data is identical to the version 1.1.0 dataset available through the lemis R package. lemis_codes.csv: Full values for all coded values used in the LEMIS data. Identical to the output from the lemis R package function "lemis_codes()". lemis_metadata.csv: Data fields and field descriptions for all variables in the LEMIS data. Identical to the output from the lemis R package function "lemis_metadata()". raw_data.zip: This archive contains all of the raw LEMIS data files that are processed and cleaned with the code contained in the 'data-raw' subdirectory of the lemis R package repository.Resource Software Recommended: R package,url: https://github.com/ecohealthalliance/lemis
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an...
zenodo.org
bin, csv
Updated Sep 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Colgan Huston; Daniel Colgan Huston (2021). Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107: 726–730 [Dataset]. http://doi.org/10.5281/zenodo.4886698
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4886698
Dataset updated
Sep 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Colgan Huston; Daniel Colgan Huston
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data files for the paper: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107(5) 726–730

Includes raw data, .csv files for import of data into R, R script file, and excel spreadsheet file used to create Figure 1.
Data from: [Dataset] Stroke Caregiver Burden in East Coast Peninsular...
zenodo.org
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohd Azmi Bin Suliman; Mohd Azmi Bin Suliman; Kamarul Imran Musa; Kamarul Imran Musa (2024). [Dataset] Stroke Caregiver Burden in East Coast Peninsular Malaysia, A Short-term Longitudinal Study [Dataset]. http://doi.org/10.5281/zenodo.6998141
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6998141
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohd Azmi Bin Suliman; Mohd Azmi Bin Suliman; Kamarul Imran Musa; Kamarul Imran Musa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Peninsular Malaysia, Malaysia
Description
Raw dataset for study entitled "INFORMAL CAREGIVERS BURDEN AMONG STROKE PATIENTS IN EAST-COAST MALAYSIA: A SHORT-TERM LONGITUDINAL STUDY"

This study is part of study funded by Newton Ungku Omar Fund (2020-2021) under grant for “A Scalable Solution for Supporting Informal Stroke Caregivers in Malaysia: Systematic Development and Feasibility Study” Malaysian Ministry of Education (203.PPSP.678003) and Medical Research Council, United Kingdom (MR/T018968/1).

Please note that this data is in raw csv form, imported from REDCap. due to REDCap system, the raw file need to be relabel and relevel to reflect the original score or response.

Data dictionary provided for data relabel and relevel purpose.

R script also available to convert the raw csv into dataset with appropriate label and level
Z
Data and Code for "Does Organic Farming Jeopardize Food Security of Farm...
data.niaid.nih.gov
zenodo.org
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aïhounton, Ghislain Boris Dossou (2024). Data and Code for "Does Organic Farming Jeopardize Food Security of Farm Households in Benin?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10899544
Explore at:
Dataset updated
Apr 30, 2024
Dataset provided by
Henningsen, Arne
Aïhounton, Ghislain Boris Dossou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Benin
Description
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "Does Organic Farming Jeopardize Food Security of Farm Households in Benin?" authored by Ghislain B.D. Aïhounton and Arne Henningsen and published in the journal Food Policy (Volume 124, April 2024, 102622, DOI: 10.1016/j.foodpol.2024.102622).

We conducted the empirical analysis with the "R" statistical software (version 4.3.3) using the add-on packages "AER" (version 1.2.12), "DescTools" (version 0.99.54), "lmtest" (version 0.9.40), "moments" (version 0.14.1), "sandwich" (version 3.1.0), "stargazer" (version 5.2.3), and "xtable" (version 1.8.4) that are all available at CRAN.

This replication package contains the following files:

READMEThis file.

R/dataBenin.csvA CSV file that contains the (unprepared) data set. The variables in this file are described in file R/Variables.csv. This CSV file is imported by R script PrepareDataFoodNutrition.R.

R/Variables.csvA CSV file that describes the variables in the (unprepared) data set (file R/dataBenin.csv).

R/PrepareData.RAn R script that imports the (unprepared) data set (file R/dataBenin.csv), calculates additional variables and add theses variables to the data set, removes observations that should not be used in the empirical analysis, and saves the prepared data set as CSV file (R/dataFoodNutrition.csv).

R/dataPrepared.csvA CSV file that contains the (prepared) data set used in the empirical analysis. This CSV file is created by the R script R/PrepareDataFoodNutrition.R. It is imported by the R scripts R/DescriptiveTab.R, FoodNutritionImpact.R, and GridSearchFoodSecurity.R.

R/DescriptiveTab.RAn R script that imports the prepared data set (file R/dataFoodNutrition.R) and creates Table 1 of the paper ("Descriptive statistics", file paper/tables/DescriptiveStat.tex) as LaTeX file.

R/Estimations.RAn R script that imports the prepared data set (file R/dataFoodNutrition.R), conducts all the analyses presented in the paper, creates Tables 2 and 3 of the paper ("OLS and IV regression results of the conditional associations between organic farming and outcomes" and "OLS and IV regression results of the conditional associations between organic farming and mediating outcomes", LaTeX files paper/tables/estMainReg.tex and paper/tables/estMedReg.tex), creates Figures 1 and 2 of the paper ("Estimated conditional associations of organic farming with outcomes" and "Estimated conditional associations of organic farming with mediating outcomes", 12 PDF files paper/figures/*.pdf), and 45 Tables that are included in the Supplementary Information: 36 tables with detailed regression results (LaTeX files paper/tables/tabels/est*.tex), one table with results of the first-stage probit regression (LaTeX file paper/tables/tabels/estProbit.tex), 6 tables with detailed regression results of estimations for testing the exogeneity of the instrument as suggested by Di Falco et al. (2011) (LaTeX files paper/tables/tabels/estOLS*Falco.tex), and 2 tables with coefficient bounds obtained as suggested by Oster (2019) (LaTeX files paper/tables/tabels/Oster*.tex).

R/GridSearch.RAn R script that re-runs our regression analyses with different units of measurement of IHS-transformed variables and calculates various indicators that can can be used to assess the appropriateness of different units of measurement as suggested by Aihounton and Henningsen (2021) and that creates 28 Tables that are included in the Supplementary Information (LaTeX files paper/tables/tabels/grid*.tex).

R/functions/calcOsterBounds.RAn R script that defines the R function calcOsterBounds() that calculates coefficient bounds using the method suggested by Oster (2019). This function is used by the R script R/FoodNutritionImpact.R.

R/functions/calcSemiElaOrg.RAn R script that defines the R function calcSemiElaOrg() that calculates the semi-elasticity of various log-transformed or IHS-transformed variables with respect to the dummy variable for organic farming. This function is used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

R/functions/createFormula.RAn R script that defines the R function createFormula() that creates the regression formulas for the various empirical analyses that are presented in the paper. This function is used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

R/functions/functionsTables.RAn R script that defines various R functions that are used to create tables in LaTeX format. These functions are used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

R/functions/predR2.RAn R script that defines the R function predR2() that calculates the predictive R-squared value. This R script has been obtained from the replication package of the article:Aïhounton, G. B. D. and Henningsen, A. (2021). Units of measurement and the inverse hyperbolic sine transformation. The Econometrics Journal, 24(2):334–351. https://doi.org/10.1093/ectj/utaa032The function consists of a slightly modified version of the code that is available at: https://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/ This function is used by the R script R/GridSearchFoodSecurity.R.

paper/figures/*.pdf12 LaTeX files that are the (sub)figures in Figures 1 and 2 of the paper ("Estimated conditional associations of organic farming with outcomes" and "Estimated conditional associations of organic farming with mediating outcomes"). These 12 files are created by the R script R/FoodNutritionImpact.R.

paper/tables/DescriptiveStat.texA LaTeX file that creates Table 1 of the paper ("Descriptive statistics"). This file is created by the R script R/DescriptiveTab.R.

paper/tables/estMainReg.texA LaTeX file that creates Table 2 of the paper ("OLS and IV regression results of the conditional associations between organic farming and outcomes"). This file is created by the R script R/FoodNutritionImpact.R.

paper/tables/estMedReg.texA LaTeX file that creates Table 3 of the paper ("OLS and IV regression results of the conditional associations between organic farming and mediating outcomes"). This file is created by the R script R/FoodNutritionImpact.R.

paper/tables/tabels/est*.tex36 LaTeX files that create 36 tables that are included in the Supplementary Information and present detailed regression results. These 36 files are created by the R script R/FoodNutritionImpact.R.

paper/tables/tabels/estProbit.texA LaTeX files that creates a table that is included in the Supplementary Information and presents the results of the first-stage probit regression. This file is created by the R script R/FoodNutritionImpact.R.

paper/tables/tabels/estOLS*Falco.tex6 LaTeX files that create 6 tables that are included in the Supplementary Information and present detailed regression results for testing the exogeneity of the instrument as suggested by Di Falco et al. (2011). These 6 files are created by the R script R/FoodNutritionImpact.R.

paper/tables/tabels/Oster*.tex2 LaTeX files that create 2 tables that are included in the Supplementary Information and present coefficient bounds obtined as suggested by Oster (2019). These 2 files are created by the R script R/FoodNutritionImpact.R.

paper/tables/tabels/grid*.tex28 LaTeX files that create 28 tables that are included in the Supplementary Information and present various indicators for assessing the appropriateness of different units of measurement of IHS-transformed variables as suggested by Aihounton and Henningsen (2021). These 28 files are created by the R script R/GridSearchFoodSecurity.R
f
Supplement 1. Example data and R code.
wiley.figshare.com
html
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael L. Collyer; Dean C. Adams (2023). Supplement 1. Example data and R code. [Dataset]. http://doi.org/10.6084/m9.figshare.3527483.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3527483.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Wiley
Authors
Michael L. Collyer; Dean C. Adams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List collyer_adams_Rcode.txt -- R code for running analysis collyer_adams_example_data.csv -- example data to input into R routine collyer_adams_example_xmat.csv -- coding for the design matrix used collyer_adams_ESA_supplement.zip -- all files at once

Description The collyer_adams_Rcode.txt file contains a procedure for performing the analysis described in Appendix A, using R. The procedure imports data and a design matrix (collyer_adams_example_data.csv and collyer_adams_example_xmat.csv are provided, and correspond to the example in Appendix A). The default number of permutations is 999, but can be changed. A matrix of random values (distances, contrasts, angles) and a results summary are created from the program. Users should be aware that importing different data sets will require altering some of the R code to accommodate their data (e.g., matrix dimensions would need to be changed).
MovieLens ratings
kaggle.com
zip
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khoa Hoàngg (2024). MovieLens ratings [Dataset]. https://www.kaggle.com/datasets/khoahongg/movielens-ratings/code
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2024
Authors
Khoa Hoàngg
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Each folder contains the following files: - train.csv: A csv file that contains the training data. - test.csv: A csv file that contains the testing data. - movie_to_index.pkl: A Python pickle file that contains a dictionary. The dictionary maps a movie_id to its corresponding index in the similarity matrix. - user_to_index.pkl: A Python pickle file that contains a dictionary. The dictionary maps a user_id to an index. - rating_matrix.npy: A npy file that contains the rating matrix in the training data \( \text{rating\_matrix}[u, i] = \text{r}_{\text{user_at_index_u}, \text{ movie\_at\_index\_i}} \) - similarity_matrix.npy: A npy file that contains a precomputed similarity matrix between movies in the training data. \( \text{similarity\_matrix}[i, j] = \text{purecosine}(R_{\text{movie\_at\_index\_i}}, R_{\text{movie\_at\_index\_j}}) \) - qtus.pkl: A Python pickle file that contains a dictionary. + Keys: Pair of user_index, movie_index (u, t). + Values: Indexes of movies rated by u, sorted by similarity in DESCENDING ORDER. For neighborhood_size = k \( \text{qtus}[(u,t)][:k] = Q_t(u) \)

Loading the Similarity Matrix

import numpy as np # Load the similarity matrix similarity_matrix = np.load('path_to_your_folder/similarity_matrix.npy')

Loading the Dictionaries

import pickle # Load the movie_to_index dictionary with open('path_to_your_folder/movie_to_index.pkl', 'rb') as f: movie_to_index = pickle.load(f) # Load the user_to_index dictionary with open('path_to_your_folder/user_to_index.pkl', 'rb') as f: user_to_index = pickle.load(f)
B
Replication Data for: Lameness during the dry period: epidemiology and...
borealisdata.ca
open.library.ubc.ca
+1more
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruan R. Daros; Hanna K. Eriksson; Daniel M. Weary; Marina A. G. von Keyserlingk (2019). Replication Data for: Lameness during the dry period: epidemiology and associated factors [Dataset]. http://doi.org/10.5683/SP2/YTZMKX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/YTZMKX
Dataset updated
Sep 9, 2019
Dataset provided by
Borealis
Authors
Ruan R. Daros; Hanna K. Eriksson; Daniel M. Weary; Marina A. G. von Keyserlingk
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Original data, R script (code) and code output for the paper published on Journal of Dairy Science. For best use, replicate analysis using R. Importing data using the .csv file may cause some variables (columns of the spreadsheet) to be imported with the wrong format. Any issues, do not hesitate in contact. Happy coding!
H
Replication Data for: Crossing Over: Gendered Reading Formations at the...
dataverse.harvard.edu
zip
Updated Mar 22, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2018). Replication Data for: Crossing Over: Gendered Reading Formations at the Muncie Public Library, 1891-1902 [Dataset]. http://doi.org/10.7910/DVN/QOFLEZ
Explore at:
zip(348597), zip(84034585), zip(7197810), zip(28844444)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/QOFLEZ
Dataset updated
Mar 22, 2018
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Text files, python and r scripts, LIWC results, and csv files with checkout data to support essay
f
Datasets (.csv format) for "Evaluation of acaricide treatments to...
figshare.com
txt
Updated Oct 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florent Déry; Delphine De Pierre; Anthony Asselin; Patrick A. Leighton; Steeve D. Côté; Jean-Pierre Tremblay (2024). Datasets (.csv format) for "Evaluation of acaricide treatments to experimentally reduce winter tick load on moose" by De Pierre, Déry et al., 2024. [Dataset]. http://doi.org/10.6084/m9.figshare.24077919.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24077919.v1
Dataset updated
Oct 7, 2024
Dataset provided by
figshare
Authors
Florent Déry; Delphine De Pierre; Anthony Asselin; Patrick A. Leighton; Steeve D. Côté; Jean-Pierre Tremblay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is necessary to reproduce results in the Manuscript"Evaluation of acaricide treatments to experimentally reduce winter tick load on moose" by De Pierre, Déry, Asselin, Leighton, Côté & Tremblay, 2024.See read_me.csv for variable names and details, and metadata. Additional details available in the manuscript and in R script.
f
Data Sheet 1_An investigation of the load-velocity relationship between...
frontiersin.figshare.com
figshare.com
csv
Updated May 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziwei Zhu; Jiayong Chen; Ruize Sun; Renchen Wang; Jiaxin He; Wenfeng Zhang; Weilong Lin; Duanying Li (2025). Data Sheet 1_An investigation of the load-velocity relationship between flywheel eccentric and barbell training methods.csv [Dataset]. http://doi.org/10.3389/fpubh.2025.1579291.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2025.1579291.s001
Dataset updated
May 30, 2025
Dataset provided by
Frontiers
Authors
Ziwei Zhu; Jiayong Chen; Ruize Sun; Renchen Wang; Jiaxin He; Wenfeng Zhang; Weilong Lin; Duanying Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveFlywheel resistance training (FRT) is a training modality for developing lower limb athletic performance. The relationship between FRT load parameters and barbell squat loading remains ambiguous in practice, resulting in experience-driven load selection during training. Therefore, this study investigates optimal FRT loading for specific training goals (maximal strength, power, muscular endurance) by analyzing concentric velocity at varying barbell 1RM percentages (%1RM), establishes correlations between flywheel load, velocity, and %1RM, and integrates force-velocity profiling to develop evidence-based guidelines for individualized load prescription.MethodsThirty-nine participants completed 1RM barbell squats to establish submaximal loads (20–90%1RM). Concentric velocities were monitored via linear-position transducer (Gymaware) for FRT inertial load quantification, with test–retest measurements confirming protocol reliability. Simple and multiple linear regression modeled load-velocity interactions and multivariable relationships, while Pearson’s r and R2 quantified correlations and model fit. Predictive equations estimated inertial loads (kg·m2), supported by ICC (2, 1) and CV assessments of relative/absolute reliability.ResultsA strong inverse correlation (r = −0.88) and high linearity (R2 = 0.78) emerged between rotational inertia and velocity. The multivariate model demonstrated excellent fit (R2 = 0.81) and robust correlation (r = 0.90), yielding the predictive equation: y = 0.769–0.846v + 0.002 kg.ConclusionThe strong linear inertial load-velocity relationship enables individualized load prescription through regression equations incorporating velocity and strength parameters. While FRT demonstrates limited efficacy for developing speed-strength, its longitudinal periodization effects require further investigation. Optimal FRT loading ranges were identified: 40–60%1RM for strength-speed, 60–80%1RM for power development, and 80–100% + 1RM for maximal strength adaptations.

Facebook

Twitter

Click to copy link

Link copied

Cite

Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

application/unknownAvailable download formats

Dataset updated

Aug 29, 2017

Dataset provided by

Department of Energy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

Clear search

Close search

Google apps

Main menu

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

96 wells fluorescence reading and R code statistic for analysis

titanic5 Dataset Dataset

R-LOADEST files to produce results in the Heart River Basin, North Dakota,...

case study 1 bike share

Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

Set the working directory

Import the csv files

Movie Rationales (Rationales For Movie Reviews)

License

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Data from: United States wildlife and wildlife product imports from...

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an...

Data from: [Dataset] Stroke Caregiver Burden in East Coast Peninsular...

Data and Code for "Does Organic Farming Jeopardize Food Security of Farm...

Supplement 1. Example data and R code.

MovieLens ratings

Loading the Similarity Matrix

Loading the Dictionaries

Replication Data for: Lameness during the dry period: epidemiology and...

Replication Data for: Crossing Over: Gendered Reading Formations at the...

Datasets (.csv format) for "Evaluation of acaricide treatments to...

Data Sheet 1_An investigation of the load-velocity relationship between...

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders