41 datasets found

Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
H
Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...
dataverse.harvard.edu
dataone.org
Updated Jul 6, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6C3JR1
Dataset updated
Jul 6, 2017
Dataset provided by
Harvard Dataverse
Authors
Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1
Description
User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
Data Mining Project - Boston
kaggle.com
zip
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
d
Need for speed: Short lifespan selects for increased learning ability - Data...
datadryad.org
search.dataone.org
zip
Updated Oct 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannis Liedtke; Lutz Fromhage (2019). Need for speed: Short lifespan selects for increased learning ability - Data [Dataset]. http://doi.org/10.5061/dryad.k0p2ngf43
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.k0p2ngf43
Dataset updated
Oct 25, 2019
Dataset provided by
Dryad
Authors
Jannis Liedtke; Lutz Fromhage
Time period covered
Oct 9, 2019
Description
The first dataset provides the R code for the simualiton.

The other datasets give trait values of "learning speed" ("L") for each Individual in each generation (1-200) for different lifespans (season length). From season length 1 to 800.

One dataframe ("Metapop") provide results of all 10 runs and give the mean trait value ("L") for a given sl (1-800) for each run.

One dataframe ("Meta118") provide results of all 10 runs and give the individual trait values ("L","Picky") and the individual scores for collected number of resource items ("Colsum") and sum of value of all collected resoruces ("sumS") for a given sl=118 for each run.

The dataframes can be uploaded into R by e.g.:

df<-read.csv(".../dfL_1")
Z
Dispa-SET Output files for the JRC report "Power System Flexibility in a...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
De Felice, Matteo (2024). Dispa-SET Output files for the JRC report "Power System Flexibility in a variable climate" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3778132
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
JRC
Authors
De Felice, Matteo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here you can find the model results of the report:

De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.

This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.

We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.

A description of the methodology and the data sources with the references can be found into the report.

Linked resources

Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc

Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability

Update

[29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year
m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
r
Respiration_chambers/raw_log_files and combined datasets of biomass and...
researchdata.edu.au
data.aad.gov.au
Updated Dec 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY (2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. https://researchdata.edu.au/respirationchambersrawlogfiles-combined-datasets-physical-parameters/1360456
Explore at:
Dataset updated
Dec 3, 2018
Dataset provided by
Australian Antarctic Data Centre
Australian Antarctic Division
Authors
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 27, 2015 - Feb 23, 2015
Area covered

Description
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.

- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers

####

Physical parameters raw log files

Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual

####

Respiration/PAM chamber raw excel spreadsheets

Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 – Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from

####

Respiration chamber biomass data

The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

####

Associated R script file for pump cycles of respirations chambers

Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

####

Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control

The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR

This dataset appears to be missing data, note D5 rows potentially not useable information

See the word document in the download file for more information.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

d
Replication Data for: Modelling Policy Action Using Natural Language...
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popa, Mircea (2024). Replication Data for: Modelling Policy Action Using Natural Language Processing: Evidence for a Long-Run Increase in Policy Activism in the UK [Dataset]. http://doi.org/10.7910/DVN/F7CDMQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/F7CDMQ
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Popa, Mircea
Description
DFM dataframe and R code for replication. Note that lines 1-219 require access to the original pdf/html documents.
Data and R code for "Dolphin social phenotypes vary in response to food...
figshare.com
txt
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Fisher; Barbara Cheney (2024). Data and R code for "Dolphin social phenotypes vary in response to food availability but not the North Atlantic Oscillation index" - post correction [Dataset]. http://doi.org/10.6084/m9.figshare.23256845.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23256845.v3
Dataset updated
Oct 14, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
David Fisher; Barbara Cheney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two text files containing data and one .R file with R code. These files are sufficient to recreate the analysis found in the manuscript "Dolphin social phenotypes vary in response to food availability but not the North Atlantic Oscillation index", published in Proceedings of the Royal Society B: Biological Sciences in October 2023 and corrected in October 2024 (see below).In brief, the data are based on regular observations of bottlenose dolphins (Tursiops truncatus) off the north east coast of Scotland between 1990 and 2021 inclusive. Regular observations of dolphins co-occurring in groups allowed us to infer social associations and to build social networks. We built social networks for each month and each year there were sufficient observations. From each network we calculated three social network measures (strength, weighted clustering coefficient, and closeness) and we then analyses how these traits vary at both the yearly and monthly scale in response to variation in the North Atlantic Oscillation index and to salmon abundance (data obtained from other sources). We upload the dataset both before filtering (suffix "raw", including individuals of unknown sex and with only a few observations per year/month) and the dataset after filtering which is used for the analyses in the paper.The correction revolves around the calculation of the social network measure "closeness" using the R package igraph. We determined that this function treats the interaction strengths between individuals as distances or costs, where higher values mean more distant/less well-connected. This interpretation of interaction strengths is opposite to how they are interpreted for most other social network metrics, where higher values indicate closer and more well-connected individuals. The consequences are that the closeness values we analysed in the original version of the article are incorrect, and so the results and conclusions around closeness are erroneous. We then re-calculated closeness using a different R package, tnet, which treats interaction strengths in the manner expected i.e., higher values mean closer together, and re-ran all analyses involving closeness. See the supporting documentation of the paper for a description of the changes to the results in full."Dol Soc by Env Yearly data tC.txt" is the data frame for the yearly scale analysis, with network metrics per individual per year and environmental variables per year. Columns are:dol_name - the unique ID of the dolphinyear - the year of observationsex - sex of the dolphin, 1 = male, 2 = femaleyear_nao - the north atlantic oscillation index record for that yearyear_fish - the yearly salmon abudance measureindiv_str - the individual's strength in that yearindiv_cc - the individual's weighted lcustering coefficient in that yearindiv_close - the individual's closeness in that year"Dol Soc by Env Monthly data tC.txt" is the data frame for the monthly scale analysis, with network metrics per individual per month and environmental variables per month. Columns are:dol_name - the unique ID of the dolphinyear - the year of observationmonth - the month of observation, coded numerically i.e., April = 4sex - sex of the dolphin, 1 = male, 2 = femalemonth_year_nao - the north atlantic oscillation index record for that monthmonth_year_fish - the monthly salmon abundance measureindiv_str - the individual's strength in that monthindiv_cc - the individual's weighted clustering coefficient in that monthindiv_close - the individual's closeness in that month"Dol Soc by Env Monthly data tC raw.txt" and "Dol Soc by Env Yearly data tC raw.txt" are the above datasets but prior to filtering (see R code)."Fisher & Cheney code Dol Soc by Env tC.R" is the R code file to recreate the analyses found in the manuscript (a series of mixed-effect models). We used R version 4.3.1 for the analysis. Note requires loading the packages "glmmTMB" (version 1.1.7) and "car" (version 3.1-2) so they must be installed first. Additionally, you will need to save the following R script: https://github.com/hschielzeth/RandomSlopeR2/blob/master/condR.R and refer to it with the source() command to enable the calculation of conditional repeatabilities.
d
Data from: Constraints on trait combinations explain climatic drivers of...
datadryad.org
search.dataone.org
zip
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John M. Dwyer; Daniel C. Laughlin (2018). Constraints on trait combinations explain climatic drivers of biodiversity: the importance of trait covariance in community assembly [Dataset]. http://doi.org/10.5061/dryad.76kt8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.76kt8
Dataset updated
Apr 27, 2018
Dataset provided by
Dryad
Authors
John M. Dwyer; Daniel C. Laughlin
Time period covered
Apr 27, 2017
Description
quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.species.in.quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.Dwyer_&_Laughlin_2017_Trait_covariance_scriptThis script reads in the two dataframes of "raw" data, calculates diversity and trait metrics and runs the major analyses presented in Dwyer & Laughlin 2017.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
Time Series Forecasting Using Prophet in R
kaggle.com
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Time Series Forecasting Using Prophet in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/time-series-forecasting-using-prophet-in-r
Explore at:
zip(9000 bytes)Available download formats
Dataset updated
Jul 25, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective : To forecast the page visits of a website

Tool : Time Series Forecasting using Prophet in R.

Steps:

Read the data

Data Cleaning: Checking data types, date formats and missing data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F56d7b1edf4f51157804e81b02c032e4d%2FPicture1.png?generation=1690271521103777&alt=media" alt="">

Run libraries (dplyr, ggplot2, tidyverse, lubridate, prophet, forecast)

Change the Date column from character vector to date and change data format using lubridate package

Rename the column "Date" to "ds" and "Visits" to "y".

Treat "Christmas" and "Black.Friday" as holiday events. As the data ranges from 2016 to 2020, there will be 5 Christmas and 5 Black Friday days.

We will look at the impact of Christmas 3 days prior and 3 days later from Christmas date on "Visits" and 3 days prior and 1 day later for Black Friday

We create two data frames called Christmas and Black.Friday and merge the two into a data frame called "holidays". https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd07b366be2050fefe6a62563b6abac0c%2FPicture2.png?generation=1690272066356516&alt=media" alt="">

We create train and test data. In train data & test data, we select only 3 variables namely ds, y , Easter. In train data, ds contains data before 2020-12-01 and test data contains data equal to and after 2020-12-01 (31 days) data

Train Data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8f3f58fe40b29b276bb7103cb1dfdde1%2FPicture3.png?generation=1690272272038405&alt=media" alt="">

Test Data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb4362117f46aeb210dad23f07d3ecb39%2FPicture4.png?generation=1690272400355824&alt=media" alt="">

Use prophet model which will include multiple parameter. We are going with the default parameters. Thereafter, we add the external regressor "Easter".

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7325be63d887372cc5764ddf29a94310%2FPicture5.png?generation=1690272892963939&alt=media" alt="">

We create the future data frame for forecasting and name the data frame "future". It will include "m" and 31 days of the test data. We then predict this future data frame and create a new data frame called "forecast".

Forecast data frame consists of 1827 rows and 34 variables. This shows the external Regressor (Easter) value is 0 through the entire time period. This shows that "Easter" has no impact or effect on "Visits".

yhat stands for the predicted value (predicted visits).

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fae5c9414d1b1bbb2670b372a326970a5%2FPicture6.png?generation=1690273558489681&alt=media" alt="">

We try to understand the impact of Holiday events "Christmas" and "Black.Friday"

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5a36cc5308f9e46f0b63fa8e37c4b932%2FPicture7.png?generation=1690273814760538&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8cc3dd0581db1e8b9d542d9a524abd39%2FPicture8.png?generation=1690273879506571&alt=media" alt="">

We plot the forecast.

plot(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa7968ff05abdd5b4e789f3723b41c4ed%2FPicture9.png?generation=1690274020880594&alt=media" alt="">

blue is predicted value(yhat) and black is actual value(y) and blue shaded regions are the yhat_upper and yhat_lower values

prophet_plot_components(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F52408afb8c71118ef6729420085875e8%2FPicture10.png?generation=1690274184325240&alt=media" alt="">

Trend indicates that the page visits remained constant from Jan'16 to Mid'17 and thereafter there was an upswing from Mid'19 to End of 2020

From Holidays, we can make out that Christmas had a negative effect on page visits whereas Black Friday had a positive effect on page visits

Weekly seasonality indicates that page visits tend to remain the highest from Monday to Thursday and starts going down thereafter

Yearly seasonality indicates that page visits are the highest in Apr and then starts going down thereafter with

Oct having reaching the bottom point

External regressor "Easter" has no impact on page visits

plot(m,forecast) + add_changepoints_to_plot(m)

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1253a0e381ae04d3156a4b098dafb2ca%2FPicture11.png?generation=1690274373570449&alt=media" alt="">

Trend which is indicated by the red line starts moving upwards from Mid 2019 to 2020 onwards

We check for acc...
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
o
University SET data with faculty and course characteristics from a...
openicpsr.org
Updated Mar 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krzysztof Rybinski (2022). University SET data with faculty and course characteristics from a university in Poland [Dataset]. http://doi.org/10.3886/E166061V1
Explore at:
Unique identifier
https://doi.org/10.3886/E166061V1
Dataset updated
Mar 27, 2022
Dataset provided by
Vistula University
Authors
Krzysztof Rybinski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Poland
Description
This is a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university and variables' descriptions are provided in the Data_description word file. The data is aggregated at the teacher/course level with 1,021 data points and 29 variables.The data file is in the Rdata format, use R load() function. The name of the dataframe is "dat" .
Lightning NOx Emissions in CMAQ Data
catalog.data.gov
s.cnmilf.com
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Lightning NOx Emissions in CMAQ Data [Dataset]. https://catalog.data.gov/dataset/lightning-nox-emissions-in-cmaq-data
Explore at:
Dataset updated
Sep 9, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Meta Data of the dataset for “Assessing the Impact of Lightning NOx Emissions in CMAQ Using Lightning Flash Data from WWLLN over the Contiguous United States” Figure 2: ThreeYear_NLDN2WWLLN_byNOAAcr_Region_anal.xlsx. The names of the variable are self-explanatory and the original figure is included. Figure 3: NLDN_flash_Monthly_mean_2016_07.ncf.gz, WWLLN_flash_Monthly_mean_2016_07.ncf.gz, WWLLNs_flash_Monthly_mean_2016_07.ncf.gz. These netcdf files contain the monthly mean values of gridded lightning flash rate for all the cases and the figure can be created using any netcdf visualization tool (such as VERDI) or statistical package (such as R). Figures 4,5,6: CMAQ_*_.rds.gz files. These files contain the paired observation-model O3 concentrations from all the model cases for hourly, daily max-8hr, and other statistics. The rds datasets can be read into R as data frame to make these figures. Figure 7 & 8: CCTM_CONC*.nc.gz. The vertical profiles (CONC) contain model data to make Figures 7 and 8. While the observation data are available publicly. Figure 9: NADP_v532_intel18_0_2016_CONUS_.csv. Figure 10: avg_DEP_concentrations.nc.gz. These files contain the monthly mean vet deposition of NO3. Figure 11: NADP_v532_intel18_0_2016_CONUS_.csv. Figure 12: DDEP_TNO3_.nc.gz. These files contain hourly dry deposition of TNO3 over the CONUS domain
n
Effect of data source on estimates of regional bird richness in northeastern...
data.niaid.nih.gov
datadryad.org
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m905qfv0h
Dataset updated
May 4, 2021
Dataset provided by
Columbia University
Gettysburg College
Massachusetts Audubon Society
Agricultural Research Service
New York State Department of Environmental Conservation
University of Vermont
Hebrew University of Jerusalem
University of Michigan
Authors
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Northeastern United States, United States
Description
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

Methods Overview

This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

The methods for compilation are contained in the supplementary information of the manuscript but also here:

Bird data

For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

Environmental data

Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
d
Data from: HomeRange: A global database of mammalian home ranges
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maarten Broekman; Selwyn Hoeks; Rosa Freriks; Merel Langendoen; Katharina Runge; Ecaterina Savenco; Ruben ter Harmsel; Mark Huijbregts; Marlee Tucker (2025). HomeRange: A global database of mammalian home ranges [Dataset]. http://doi.org/10.5061/dryad.d2547d85x
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.d2547d85x
Dataset updated
Jul 15, 2025
Dataset provided by
Dryad Digital Repository
Authors
Maarten Broekman; Selwyn Hoeks; Rosa Freriks; Merel Langendoen; Katharina Runge; Ecaterina Savenco; Ruben ter Harmsel; Mark Huijbregts; Marlee Tucker
Time period covered
Jan 1, 2022
Description
Motivation: Home range is a common measure of animal space use as it provides ecological information that is useful for conservation applications. In macroecological studies, values are typically aggregated to species means to examine general patterns of animal space use. However, this ignores the environmental context in which the home range was estimated and does not account for intraspecific variation in home range size. In addition, the focus of macroecological studies on home ranges has been historically biased toward terrestrial mammals. The use of aggregated numbers and terrestrial focus limits our ability to examine home range patterns across different environments, variation in time and between different levels of organisation. Here we introduce HomeRange, a global database with 75,611 home-range values across 960 different mammal species, including terrestrial, as well as aquatic and aerial species. Main types of variable contained: The dataset contains mammal home-range estim..., Mammalian home range papers were compiled via an extensive literature search. All home range values were extracted from the literature including individual, group and population-level home range values. Associated values were also compiled including species names, methodological information on data collection, home-range estimation method, period of data collection, study coordinates and name of location, as well as species traits derived from the studies, such as body mass, life stage, reproductive status and locomotor habit. Here we include the database, associated metadata and reference list of all sources from which home range data was extracted from.Â We also provide an R package, which can be installed from https://github.com/SHoeks/HomeRange. The HomeRange R package provides functions for downloading the latest version of the HomeRange database and loading it as a standard dataframe into R, plotting several statistics of the database and finally attaching species traits (e.g. spe..., , # Title of Dataset: HomeRange: A global database of mammalian home ranges

Mammalian home range papers were compiled via an extensive literature search. All home range values were extracted from the literature including individual, group and population-level home range values. Associated values were also compiled including species names, methodological information on data collection, home-range estimation method, period of data collection, study coordinates and name of location, as well as species traits derived from the studies, such as body mass, life stage, reproductive status and locomotor habit.

We also provide an R package, which can be installed from https://github.com/SHoeks/HomeRange. The HomeRange R package provides functions for downloading the latest version of the HomeRange database and loading it as a standard dataframe into R, plotting several statistics of the database and finally attaching species traits (e.g. species average body mass, trophic level). from the CO...
Reddit Machine Learning
kaggle.com
zip
Updated Oct 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tyler Landowski (2020). Reddit Machine Learning [Dataset]. https://www.kaggle.com/fishboi/redditmachinelearning
Explore at:
zip(145094611 bytes)Available download formats
Dataset updated
Oct 16, 2020
Authors
Tyler Landowski
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
What is it?

This is a collection of scraped data from the MachineLearning subreddit: - From subreddit's beginning in 2009 until end of February 2020 - All submissions - All comments - No images

All data collected using the Pushshift API. Not all features of comments and submissions are included - only the ones that are probably important. View a json file to see what's included.

Some .pickle files are included incase you have use for them.

.pickle Files

These are exported as Pandas dataframe pickle exports (protocol 3). Unlike the .json files, posts which are [deleted] or [removed] are scrapped. Also, the comments dataframe includes number of direct replies and number of total replies (any comments in the subtree of the comment).

Bonus Files

Additionally, some feature-engineering was applied to produce submissions_fe and comments_fe, which may not be super useful, but feel free to play around with them. These feature engineered files include columns for counts of important Machine Learning terms found within the text.
e
Replication files for "Integrating biodiversity: A longitudinal and...
data.europa.eu
envidat.ch
unknown
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EnviDat (2022). Replication files for "Integrating biodiversity: A longitudinal and cross-sectoral analysis of Swiss politics" [Dataset]. https://data.europa.eu/data/datasets/9a78b620-82fd-4b38-a1c5-c04e95913cc1-envidat?locale=sl
Explore at:
unknown(1830842700)Available download formats
Dataset updated
Mar 22, 2022
Dataset authored and provided by
EnviDat
License
http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by
Area covered
Switzerland
Description
Introduction

The ZIP file contains all data and code to replicate the analyses reported in the following paper.

Reber, U., Fischer, M., Ingold, K., Kienast, F., Hersperger, A. M., Grütter, R., & Benz, R. (2022). Integrating biodiversity: A longitudinal and cross-sectoral analysis of Swiss politics. Policy Sciences. https://doi.org/10.1007/s11077-022-09456-4

If you use any of the material included in this repository, please refer to the paper. If you use (parts of) the text corpus, please also refer to the sources used for its compilation listed below. The content of the texts may not be changed.

Data folder

The data folder contains the following files.

corpus.parquet: Text corpus of Swiss policy documents

_dict_de.csv_: Biodiversity dictionary (German)

_dict_fr.csv_: Biodiversity dictionary (French)

_dict_it.csv_: Biodiversity dictionary (Italian)

_topic_labels.csv_: labels/codes for policy sectors

topics.csv: labels/codes for policy sectors

The corpus and the dictionary were compiled by the authors specifically for this project. The labels/codes for policy sectors are based on the coding scheme of the Swiss Parliament.

Text corpus

The text corpus consists of 439,984 Swiss policy documents in German, French, and Italian from 1999 to 2018. The corpus was compiled from the following source between 2020-10-01 and 2021-01-31.

Transcripts and parliamentary businesses (e.g. questions, motions, parliamentary initiatives) via the Web Services (WS) provided by the Swiss Parliament

The official compilation of federal legislation ("Amtliche Sammlung", AS) via opendata.swiss provided by the Swiss Federal Archives (SFA)

The federal gazette ("Bundesblatt") via fedlex.admin.ch

Decisions of federal courts via entscheidsuche.ch (ES)

The corpus is stored in a single data frame to use with R saved as PARQUET file (corpus.parquet). The data frame has the following structure.

_text_id_: Unique identifier for each text (source information as prefix, e.g. "t_")

_doc_type_: Document type (see coding scheme below)

branch: Government branche (1 legislative, 2 executive, 3 judicative)

stage: Stage of policy process (1 drafting, 2 introduction, 3 interpretation)

year: Year of publication

topic: Policy sector (coding scheme in separate file in data folder)

lang: Language (de, fr, it)

text: Text

The following list contains the coding scheme for the doc_type variable.

101: Federal gazette // Draft for public consultation ("Vernehmlassungsverfahren")

102: Federal gazette // Explanation of draft for parliament ("Botschaft")

103: Federal gazette // Strategy, action plan

104: Federal gazette // Federal council decree ("Bundesratsbeschluss")

105: Federal gazette // (Simple) Federal decree ("(Einfacher) Bundesbeschluss")

106: Federal gazette // General decree ("Allgemeinverfügung")

107: Federal gazette // Treaty ("Übereinkommen")

108: Federal gazette // Treaty ("Abkommen")

109: Federal gazette // Draft for parliament ("Entwurf")

110: Federal gazette // Report ("Bericht")

111: Federal gazette // Report of parliamentary comission ("Bericht")

112: Federal gazette // Report of federal council ("Bericht")

201: Parl. businesses // Submitted text

202: Parl. businesses // Reason text

203: Parl. businesses // Federal council response

204: Parl. businesses // Initial situation

205: Parl. businesses // Proceedings

301: Parl. transcripts // Speech of MP

302: Parl. transcripts // Speech of federal council

401: Federal legislation // Legal text of the official compilation (law, ordinances, etc.)

501: Court decisions // Federal Supreme Court

502: Court decisions // Federal Criminal Court

503: Court decisions // Federal Administrative Court

Code folder

The code folder contains all R code for the analyses. The files are numbered chronologically.

_1_classifier_training.R_: Training of classifiers for classification of policy sectors

_2_classifier_application.R_: Classification of documents in corpus

_3_dictionary_application.R_: Biodiversity indexing of documents in corpus

_4_stm_truncation.R_: Truncation of indexed documents to keep only relevant parts

_5_stm_translation.R_: Translation of FR and IT documents to DE

_6_stm_model.R_: Preprocesssing and structural topic model

_7_plots.R_: Plots and numbers as included in the paper

The code/functions folder contains custom functions used in the scripts, e.g. to support topic model interpretation.

Package versions and setup details are noted in the code files.

Contact

Please direct any questio

Facebook

Twitter

Click to copy link

Link copied

Cite

Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset

Tennessee Eastman Process Simulation Dataset

Process Simulation Data for Anomaly Detection Evaluation

Explore at:

337 scholarly articles cite this dataset (View in Google Scholar)

zip(1370814903 bytes)Available download formats

Dataset updated

Feb 9, 2020

Authors

Sergei Averkiev

Description

Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

Clear search

Close search

Google apps

Main menu

Tennessee Eastman Process Simulation Dataset

Intro

Content

Acknowledgements

User Agreement

Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Need for speed: Short lifespan selects for increased learning ability - Data...

Dispa-SET Output files for the JRC report "Power System Flexibility in a...

R codes and dataset for Visualisation of Diachronic Constructional Change...

Respiration_chambers/raw_log_files and combined datasets of biomass and...

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Replication Data for: Modelling Policy Action Using Natural Language...

Data and R code for "Dolphin social phenotypes vary in response to food...

Data from: Constraints on trait combinations explain climatic drivers of...

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Time Series Forecasting Using Prophet in R

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

University SET data with faculty and course characteristics from a...

Lightning NOx Emissions in CMAQ Data

Effect of data source on estimates of regional bird richness in northeastern...

Data from: HomeRange: A global database of mammalian home ranges

Reddit Machine Learning

What is it?

.pickle Files

Replication files for "Integrating biodiversity: A longitudinal and...

Tennessee Eastman Process Simulation Dataset

Process Simulation Data for Anomaly Detection Evaluation

Intro

Content

Acknowledgements

User Agreement