10 datasets found

f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
F
Data from: Dynamic Technical and Environmental Efficiency Performance of...
dataverse.fgcu.edu
data.mendeley.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaiah Magambo; Isaiah Magambo (2024). Dynamic Technical and Environmental Efficiency Performance of Large Gold Mines in Developing Countries [Dataset]. http://doi.org/10.17632/pp3g267hny.1
Explore at:
zip(322671)Available download formats
Unique identifier
https://doi.org/10.17632/pp3g267hny.1
Dataset updated
Aug 2, 2024
Dataset provided by
FGCU Data Repository
Authors
Isaiah Magambo; Isaiah Magambo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Firm-level data from 2009 to 2018 of 34 large gold mines in Developing countries. The data is used to compute the deterministic, dynamic environmental and technical efficiencies of large gold mines in developing countries. Steps to reproduce1. Run the R command to generate dynamic technical and dynamic inefficiencies per every two subsequent period (i.e period t and t+1)2. combine the results files of inefficiencies per period generated in R into a panel (see the Excel files in the results folder)3. Import the excel folder into Stata and generate the final results indicated in the paper.
Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an...
zenodo.org
bin, csv
Updated Sep 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Colgan Huston; Daniel Colgan Huston (2021). Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107: 726–730 [Dataset]. http://doi.org/10.5281/zenodo.4886698
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4886698
Dataset updated
Sep 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Colgan Huston; Daniel Colgan Huston
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data files for the paper: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107(5) 726–730

Includes raw data, .csv files for import of data into R, R script file, and excel spreadsheet file used to create Figure 1.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
P
titanic5 Dataset Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
titanic5 Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/titanic5-dataset
Explore at:
Description
titanic5 Dataset Created by David Beltran del Rio March 2016.

Notes This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3 dataset, in order to compare the two and to get your sibsp and parch variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.

The two datasets line up nicely, most of the differences in the newer titanic5 dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.

I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.

titanic3_wID data can be matched to titanic5 using the Name_ID variable. Tab titanic5 Metadata has the variable descriptions and allowable values for Class and Class/Dept.

A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F) was arrived at. It’s the Age_F_Code variable - the allowable values are in the Titanic5_metadata tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.

Here’s what the tabs are:

Titanic5_all - all (mostly cleaned) Titanic passenger and crew records Titanic5_work - working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work on Titanic5_metadata - Variable descriptions and allowable values titanic3_wID - Original Titanic3 dataset with Name_ID added for merging to Titanic5 I have a csv, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.

If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com

The tabs in titanic5.xls are

Titanic5_all Titanic5_passenger (the one to be used for analysis) Titanic5_metadata (used during analysis file creation) Titanic3_wID
Students Test Data
kaggle.com
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ATHARV BHARASKAR
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

Here's a column-wise description of the dataset:

Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

UNIVERSITY: The university where the student is enrolled.

PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

Semester: The semester or academic term in which the student took the exam.

Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.
H
Syria town database
dataverse.harvard.edu
csv, pdf, tsv, xls
Updated Nov 22, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2018). Syria town database [Dataset]. http://doi.org/10.7910/DVN/YQQ07L
Explore at:
pdf(132327), tsv(4919869), tsv(298930), tsv(636235), xls(1587712), csv(5978293)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/YQQ07L
Dataset updated
Nov 22, 2018
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Syria
Description
The purpose of this dataset is to provide a detailed picture of the characteristics of Syrian towns in the years preceding the 2011 Syrian uprising and ensuing civil war. It incorporates the 2004 national census, the last before the uprising, and a newly collected set of data on ethnic identity. The level of analysis is the town (the Syrian Census Bureau’s fourth administrative level). TECHNICAL NOTE: The .csv files in this data package contain both Arabic and English, so are encoded in UTF-8. The Arabic script should render if opened directly in Open Office, Numbers, Google Drive, or R statistical software. To read the Arabic in Excel, you can open the .csv file in any of these applications and save it as an .xlsx file, or open it through Excel using the following steps: (1) open a blank excel document (2) import the data using “Data -> Get External Data -> Import text file” (3) select “File Origin: Unicode (UTF-8)” (4) select “Delimiters: comma” (5) select the top left cell to place the data See the following post for further details: https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically
Data from: Composition of Foods Raw, Processed, Prepared USDA National...
agdatacommons.nal.usda.gov
s.cnmilf.com
+4more
pdf
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson (2025). Composition of Foods Raw, Processed, Prepared USDA National Nutrient Database for Standard Reference, Release 28 [Dataset]. http://doi.org/10.15482/USDA.ADC/1324304
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1324304
Dataset updated
Apr 30, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
[Note: Integrated as part of FoodData Central, April 2019.] The database consists of several sets of data: food descriptions, nutrients, weights and measures, footnotes, and sources of data. The Nutrient Data file contains mean nutrient values per 100 g of the edible portion of food, along with fields to further describe the mean value. Information is provided on household measures for food items. Weights are given for edible material without refuse. Footnotes are provided for a few items where information about food description, weights and measures, or nutrient values could not be accommodated in existing fields. Data have been compiled from published and unpublished sources. Published data sources include the scientific literature. Unpublished data include those obtained from the food industry, other government agencies, and research conducted under contracts initiated by USDA’s Agricultural Research Service (ARS). Updated data have been published electronically on the USDA Nutrient Data Laboratory (NDL) web site since 1992. Standard Reference (SR) 28 includes composition data for all the food groups and nutrients published in the 21 volumes of "Agriculture Handbook 8" (US Department of Agriculture 1976-92), and its four supplements (US Department of Agriculture 1990-93), which superseded the 1963 edition (Watt and Merrill, 1963). SR28 supersedes all previous releases, including the printed versions, in the event of any differences. Attribution for photos: Photo 1: k7246-9 Copyright free, public domain photo by Scott Bauer Photo 2: k8234-2 Copyright free, public domain photo by Scott Bauer Resources in this dataset:Resource Title: READ ME - Documentation and User Guide - Composition of Foods Raw, Processed, Prepared - USDA National Nutrient Database for Standard Reference, Release 28. File Name: sr28_doc.pdfResource Software Recommended: Adobe Acrobat Reader,url: http://www.adobe.com/prodindex/acrobat/readstep.html Resource Title: ASCII (6.0Mb; ISO/IEC 8859-1). File Name: sr28asc.zipResource Description: Delimited file suitable for importing into many programs. The tables are organized in a relational format, and can be used with a relational database management system (RDBMS), which will allow you to form your own queries and generate custom reports.Resource Title: ACCESS (25.2Mb). File Name: sr28db.zipResource Description: This file contains the SR28 data imported into a Microsoft Access (2007 or later) database. It includes relationships between files and a few sample queries and reports.Resource Title: ASCII (Abbreviated; 1.1Mb; ISO/IEC 8859-1). File Name: sr28abbr.zipResource Description: Delimited file suitable for importing into many programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Title: Excel (Abbreviated; 2.9Mb). File Name: sr28abxl.zipResource Description: For use with Microsoft Excel (2007 or later), but can also be used by many other spreadsheet programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/ Resource Title: ASCII (Update Files; 1.1Mb; ISO/IEC 8859-1). File Name: sr28upd.zipResource Description: Update Files - Contains updates for those users who have loaded Release 27 into their own programs and wish to do their own updates. These files contain the updates between SR27 and SR28. Delimited file suitable for import into many programs.
d
Dodd Frank financial reform at the Commodity Futures Trading Commission...
dataone.org
search.dataone.org
Updated Mar 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Posch; Thomas Nath; J. Nicholas Ziegler (2024). Dodd Frank financial reform at the Commodity Futures Trading Commission (CFTC): Public comments, January 14th, 2010 â€“ July 16th, 2014 [Dataset]. http://doi.org/10.6078/D1610G
Explore at:
Unique identifier
https://doi.org/10.6078/D1610G
Dataset updated
Mar 7, 2024
Dataset provided by
Dryad Digital Repository
Authors
Konrad Posch; Thomas Nath; J. Nicholas Ziegler
Time period covered
Jan 10, 2024
Description
This dataset includes a complete record of the 36,066 public comments submitted to the Commodity Futures Trading Commission (CFTC) in response to notices of proposed rule-making (NPRMs) implementing the Dodd-Frank Act over a 42-month period (January 14, 2010 to July 16, 2014). The data was exported from the agencyâ€™s internal database by the CFTC and provided to the authors by email correspondence following a cold call to the CFTC public relations department. The source internal database is maintained by the CFTC as part of its internal compliance with the Administrative Procedures Act (APA) and includes all rule-making notices that appear in the Federal Register. Owing to the salience and publicity of the Dodd-Frank Act, the CFTC made a special tag in its database for all comments submitted in response to rules proposed under the authority of the Dodd-Frank Act. This database thus includes all comments which the CFTC considers relevant to the Dodd-Frank reform. In short, the CFTC gave t..., This dataset was exported by the CFTC from their internal database of public comments in response to NPRMs. The uploaded file is the exact raw data generated by the CTFC and provided to the authors. An updated version of the data file including the author's classifications based on the organization value will be uploaded when the related work is accepted for publication., , # Dodd Frank Financial Reform at the CFTC - Public Comments, January 14th, 2010 to July 16th, 2014

Description of the data and file structure

NOTE: The Comment Text ( and variables) are longer than the maximum character count of Microsoft Excel cells (32,767 characters). All analysis should take this into account and import the .txt file directly into your analysis program (R, Stata, etc.) rather than attempt to edit or modify the data in Excel before using computational analysis.

There are two files provided:

DoddFrankCommentsAll(uncompressed).txt - the full raw data file from the CFTC internal database of all 36,066 comments on NPRMs

(2014-07-30) CFTC Original Codebook.xlsx - the codebook provided by the CTFC with the raw data. Originally provided as email text, formatted in Excel by authors.

Codebook:Â

| Variable | Explanation ...

Replication Package for the Paper: "An Insight into Security Code Review...

zenodo.org

zip

Updated Jun 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai (2025). Replication Package for the Paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors". [Dataset]. http://doi.org/10.5281/zenodo.15572151

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15572151

Dataset updated

Jun 2, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the replication package for the paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors".

The replication package is organized into three folders:

1. RQ1 Performance of LLMs

- Five prompt templates.pdf
This PDF demonstrates the detailed structures of the five prompt templates designed in Section 3.3.2 of our paper.

- source code of the Python and C/C++ datasets
This folder contains the source code of the Python and C/C++ datasets, used to construct prompts and apply the baseline tools for static analysis.

- prompts for the Python and C/C++ datasets
This folder contains the prompts constructed from the source code of the Python and C/C++ datasets based on the five prompt templates.

- responses of LLMs and baselines
This folder contains the responses generated by LLMs for each prompt and the analysis results of baseline tools. For CodeQL, you need to upload results.sarif to GitHub (https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github) to view the analysis results. For SonarQube, you need to import the export file into an Enterprise Edition or higher instance of the same version (v10.5 in our work) and similar configuration (default configuration in our work) to view the analysis results.

- entropy_calculation.py
This Python script calculates the average entropy of each llm-prompt combination to measure the consistency of LLM responses in three repetitive experiments.

- Data Labelling for the C/C++ Dataset.xlsx
- Data Labelling for the Python Dataset.xlsx
The two Microsoft (MS) files contain the labeling results for LLMs and baselines in the C/C++ and Python datasets, including the category of each response generated by LLM for each prompt, as well as the category of each analysis result generated by baseline for each code file. The four categories(i.e., Instrumental, Helpful, Misleading and Uncertain) are defined in Section 3.3.3 of our paper as the labelling criteria.

How to Read the MS Excel files:
Both MS Excel files contain 5 sheets. The first sheet ('all_c++_data' or 'all_python_data') includes the information of all data in each dataset. The sheets 'first round', 'second round' and 'third round' represent the labelling results for LLMs under five prompts in three repetitive experiments. The sheet 'Baselines' include the labelling results for baseline tools.

Column	Description
File ID	the identifier of each code file in our dataset.
Security Defect	the security defect(s) that the code file contains.
Project	the source project of the code file.
Suffix	the suffix of the code file.

2. RQ2 Quality Problem in Responses
- data_analysis_first_round.mx22
- data_analysis_second_round.mx22
- data_analysis_third_round.mx22

These three MAXQDA project files contain the results of data extraction for quality problems present in responses generated by the best-performing LLM-prompt combination across three repetitive experiments. This file can be opened by MAXQDA 2022 or higher versions, which are available at https://www.maxqda.com/ for download. You may also use the free 14 days trial version of MAXQDA 2024, which is available at https://www.maxqda.com/trial for download.

3. RQ3 Factor influencing LLMs
This folder contains two sub-folders:

- Step 1 - correlation analysis
Files in this subfolder are for conducting correlation analysis for explanatory variables through a Python script.

- Step 2 - redundancy analysis and model fitting
Files in this subfolder are for conducting redundancy analysis, allocation of degree of freedoms, model fitting and evaluation through an R script. Detailed instructions for running the R script can be found in readme.md in this subfolder.

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

figshare

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Data from: Dynamic Technical and Environmental Efficiency Performance of...

Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

titanic5 Dataset Dataset

Students Test Data

Syria town database

Data from: Composition of Foods Raw, Processed, Prepared USDA National...

Dodd Frank financial reform at the Commodity Futures Trading Commission...

Description of the data and file structure

Replication Package for the Paper: "An Insight into Security Code Review...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate