41 datasets found
  1. Tennessee Eastman Process Simulation Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
    Explore at:
    zip(1370814903 bytes)Available download formats
    Dataset updated
    Feb 9, 2020
    Authors
    Sergei Averkiev
    Description

    Intro

    This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

    Content

    Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

    Each dataframe contains 55 columns:

    Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

    Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

    Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

    Columns 4 to 55 contain the process variables; the column names retain the original variable names.

    Acknowledgements

    This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

    User Agreement

    By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

    The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

    In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

    Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

    When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

    This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

  2. H

    Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1

    Description

    User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

  3. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  4. d

    Need for speed: Short lifespan selects for increased learning ability - Data...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Oct 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jannis Liedtke; Lutz Fromhage (2019). Need for speed: Short lifespan selects for increased learning ability - Data [Dataset]. http://doi.org/10.5061/dryad.k0p2ngf43
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 25, 2019
    Dataset provided by
    Dryad
    Authors
    Jannis Liedtke; Lutz Fromhage
    Time period covered
    Oct 9, 2019
    Description

    The first dataset provides the R code for the simualiton.

    The other datasets give trait values of "learning speed" ("L") for each Individual in each generation (1-200) for different lifespans (season length). From season length 1 to 800.

    One dataframe ("Metapop") provide results of all 10 runs and give the mean trait value ("L") for a given sl (1-800) for each run.

    One dataframe ("Meta118") provide results of all 10 runs and give the individual trait values ("L","Picky") and the individual scores for collected number of resource items ("Colsum") and sum of value of all collected resoruces ("sumS") for a given sl=118 for each run.

    The dataframes can be uploaded into R by e.g.:

    df<-read.csv(".../dfL_1")

  5. Z

    Dispa-SET Output files for the JRC report "Power System Flexibility in a...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    De Felice, Matteo (2024). Dispa-SET Output files for the JRC report "Power System Flexibility in a variable climate" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3778132
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    JRC
    Authors
    De Felice, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find the model results of the report:

    De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.

    This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.

    We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.

    A description of the methodology and the data sources with the references can be found into the report.

    Linked resources

    Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc

    Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability

    Update

    [29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year

  6. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  7. r

    Respiration_chambers/raw_log_files and combined datasets of biomass and...

    • researchdata.edu.au
    • data.aad.gov.au
    Updated Dec 3, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY (2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. https://researchdata.edu.au/respirationchambersrawlogfiles-combined-datasets-physical-parameters/1360456
    Explore at:
    Dataset updated
    Dec 3, 2018
    Dataset provided by
    Australian Antarctic Data Centre
    Australian Antarctic Division
    Authors
    BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 27, 2015 - Feb 23, 2015
    Area covered
    Description

    General overview
    The following datasets are described by this metadata record, and are available for download from the provided URL.

    - Raw log files, physical parameters raw log files
    - Raw excel files, respiration/PAM chamber raw excel spreadsheets
    - Processed and cleaned excel files, respiration chamber biomass data
    - Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
    - Associated R script file for pump cycles of respirations chambers

    ####

    Physical parameters raw log files

    Raw log files
    1) DATE=
    2) Time= UTC+11
    3) PROG=Automated program to control sensors and collect data
    4) BAT=Amount of battery remaining
    5) STEP=check aquation manual
    6) SPIES=check aquation manual
    7) PAR=Photoactive radiation
    8) Levels=check aquation manual
    9) Pumps= program for pumps
    10) WQM=check aquation manual

    ####

    Respiration/PAM chamber raw excel spreadsheets

    Abbreviations in headers of datasets
    Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

    Date: ISO 1986 - Check
    Time:UTC+11 unless otherwise stated
    DATETIME: UTC+11 unless otherwise stated
    ID (of instrument in respiration chambers)
    ID43=Pulse amplitude fluoresence measurement of control
    ID44=Pulse amplitude fluoresence measurement of acidified chamber
    ID=1 Dissolved oxygen
    ID=2 Dissolved oxygen
    ID3= PAR
    ID4= PAR
    PAR=Photo active radiation umols
    F0=minimal florescence from PAM
    Fm=Maximum fluorescence from PAM
    Yield=(F0 – Fm)/Fm
    rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
    Temp=Temperature degrees C
    PAR=Photo active radiation
    PAR2= Photo active radiation2
    DO=Dissolved oxygen
    %Sat= Saturation of dissolved oxygen
    Notes=This is the program of the underwater submersible logger with the following abreviations:
    Notes-1) PAM=
    Notes-2) PAM=Gain level set (see aquation manual for more detail)
    Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
    Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual)
    Notes-5) Yield step 2=PAM yield measurement and calculation of control
    Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
    Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

    8) Rapid light curve data
    Pre LC: A yield measurement prior to the following measurement
    After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
    Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
    Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
    PAM PAR: This is copied from the PAR or PAR2 column
    PAR all: This is the complete PAR file and should be used
    Deployment: Identifying which deployment the data came from

    ####

    Respiration chamber biomass data

    The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

    ####

    Associated R script file for pump cycles of respirations chambers

    Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

    To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

    ####

    Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

    This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

    The headers are
    PAR: Photoactive radiation
    relETR: F0/Fm x PAR
    Notes: Stage/step of light curve
    Treatment: Acidified or control


    The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

    After 10.0 sec at 0.5% = 1 umols PAR
    After 10.0 sec at 0.7% = 1 umols PAR
    After 10.0 sec at 1.1% = 0.96 umols PAR
    After 10.0 sec at 1.6% = 4.32 umols PAR
    After 10.0 sec at 2.4% = 4.32 umols PAR
    After 10.0 sec at 3.6% = 8.31 umols PAR
    After 10.0 sec at 5.3% =15.78 umols PAR
    After 10.0 sec at 8.0% = 25.75 umols PAR

    This dataset appears to be missing data, note D5 rows potentially not useable information

    See the word document in the download file for more information.

  8. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
    
    Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the “na.omit function”
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  9. d

    Replication Data for: Modelling Policy Action Using Natural Language...

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popa, Mircea (2024). Replication Data for: Modelling Policy Action Using Natural Language Processing: Evidence for a Long-Run Increase in Policy Activism in the UK [Dataset]. http://doi.org/10.7910/DVN/F7CDMQ
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Popa, Mircea
    Description

    DFM dataframe and R code for replication. Note that lines 1-219 require access to the original pdf/html documents.

  10. Data and R code for "Dolphin social phenotypes vary in response to food...

    • figshare.com
    txt
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Fisher; Barbara Cheney (2024). Data and R code for "Dolphin social phenotypes vary in response to food availability but not the North Atlantic Oscillation index" - post correction [Dataset]. http://doi.org/10.6084/m9.figshare.23256845.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    David Fisher; Barbara Cheney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two text files containing data and one .R file with R code. These files are sufficient to recreate the analysis found in the manuscript "Dolphin social phenotypes vary in response to food availability but not the North Atlantic Oscillation index", published in Proceedings of the Royal Society B: Biological Sciences in October 2023 and corrected in October 2024 (see below).In brief, the data are based on regular observations of bottlenose dolphins (Tursiops truncatus) off the north east coast of Scotland between 1990 and 2021 inclusive. Regular observations of dolphins co-occurring in groups allowed us to infer social associations and to build social networks. We built social networks for each month and each year there were sufficient observations. From each network we calculated three social network measures (strength, weighted clustering coefficient, and closeness) and we then analyses how these traits vary at both the yearly and monthly scale in response to variation in the North Atlantic Oscillation index and to salmon abundance (data obtained from other sources). We upload the dataset both before filtering (suffix "raw", including individuals of unknown sex and with only a few observations per year/month) and the dataset after filtering which is used for the analyses in the paper.The correction revolves around the calculation of the social network measure "closeness" using the R package igraph. We determined that this function treats the interaction strengths between individuals as distances or costs, where higher values mean more distant/less well-connected. This interpretation of interaction strengths is opposite to how they are interpreted for most other social network metrics, where higher values indicate closer and more well-connected individuals. The consequences are that the closeness values we analysed in the original version of the article are incorrect, and so the results and conclusions around closeness are erroneous. We then re-calculated closeness using a different R package, tnet, which treats interaction strengths in the manner expected i.e., higher values mean closer together, and re-ran all analyses involving closeness. See the supporting documentation of the paper for a description of the changes to the results in full."Dol Soc by Env Yearly data tC.txt" is the data frame for the yearly scale analysis, with network metrics per individual per year and environmental variables per year. Columns are:dol_name - the unique ID of the dolphinyear - the year of observationsex - sex of the dolphin, 1 = male, 2 = femaleyear_nao - the north atlantic oscillation index record for that yearyear_fish - the yearly salmon abudance measureindiv_str - the individual's strength in that yearindiv_cc - the individual's weighted lcustering coefficient in that yearindiv_close - the individual's closeness in that year"Dol Soc by Env Monthly data tC.txt" is the data frame for the monthly scale analysis, with network metrics per individual per month and environmental variables per month. Columns are:dol_name - the unique ID of the dolphinyear - the year of observationmonth - the month of observation, coded numerically i.e., April = 4sex - sex of the dolphin, 1 = male, 2 = femalemonth_year_nao - the north atlantic oscillation index record for that monthmonth_year_fish - the monthly salmon abundance measureindiv_str - the individual's strength in that monthindiv_cc - the individual's weighted clustering coefficient in that monthindiv_close - the individual's closeness in that month"Dol Soc by Env Monthly data tC raw.txt" and "Dol Soc by Env Yearly data tC raw.txt" are the above datasets but prior to filtering (see R code)."Fisher & Cheney code Dol Soc by Env tC.R" is the R code file to recreate the analyses found in the manuscript (a series of mixed-effect models). We used R version 4.3.1 for the analysis. Note requires loading the packages "glmmTMB" (version 1.1.7) and "car" (version 3.1-2) so they must be installed first. Additionally, you will need to save the following R script: https://github.com/hschielzeth/RandomSlopeR2/blob/master/condR.R and refer to it with the source() command to enable the calculation of conditional repeatabilities.

  11. d

    Data from: Constraints on trait combinations explain climatic drivers of...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Apr 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John M. Dwyer; Daniel C. Laughlin (2018). Constraints on trait combinations explain climatic drivers of biodiversity: the importance of trait covariance in community assembly [Dataset]. http://doi.org/10.5061/dryad.76kt8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2018
    Dataset provided by
    Dryad
    Authors
    John M. Dwyer; Daniel C. Laughlin
    Time period covered
    Apr 27, 2017
    Description

    quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.species.in.quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.Dwyer_&_Laughlin_2017_Trait_covariance_scriptThis script reads in the two dataframes of "raw" data, calculates diversity and trait metrics and runs the major analyses presented in Dwyer & Laughlin 2017.

  12. FacialRecognition

    • kaggle.com
    zip
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
    Explore at:
    zip(121674455 bytes)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    TheNicelander
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    #https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

    ###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

    ###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

    ###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

    ###To look at samples of the data, uncomment this line:

    head(d.train)

    ###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

    im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

    im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

    ################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

    #strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

    ###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

    install.packages('foreach')

    library("foreach", lib.loc="~/R/win-library/3.3")

    ###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

    ###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

    save(d.train, im.train, d.test, im.test, file='data.Rd')

    load('data.Rd')

    #each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

    #im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

    #To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

    #Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

    #Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

    #there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

    #One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

    #To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

    #The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

    install.packages('reshape2')

    library(...

  13. Time Series Forecasting Using Prophet in R

    • kaggle.com
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Time Series Forecasting Using Prophet in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/time-series-forecasting-using-prophet-in-r
    Explore at:
    zip(9000 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Main objective : To forecast the page visits of a website
    • Tool : Time Series Forecasting using Prophet in R.
    • Steps:
    • Read the data
    • Data Cleaning: Checking data types, date formats and missing data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F56d7b1edf4f51157804e81b02c032e4d%2FPicture1.png?generation=1690271521103777&alt=media" alt="">
    • Run libraries (dplyr, ggplot2, tidyverse, lubridate, prophet, forecast)
    • Change the Date column from character vector to date and change data format using lubridate package
    • Rename the column "Date" to "ds" and "Visits" to "y".
    • Treat "Christmas" and "Black.Friday" as holiday events. As the data ranges from 2016 to 2020, there will be 5 Christmas and 5 Black Friday days.
    • We will look at the impact of Christmas 3 days prior and 3 days later from Christmas date on "Visits" and 3 days prior and 1 day later for Black Friday
    • We create two data frames called Christmas and Black.Friday and merge the two into a data frame called "holidays". https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd07b366be2050fefe6a62563b6abac0c%2FPicture2.png?generation=1690272066356516&alt=media" alt="">
    • We create train and test data. In train data & test data, we select only 3 variables namely ds, y , Easter. In train data, ds contains data before 2020-12-01 and test data contains data equal to and after 2020-12-01 (31 days) data
    • Train Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8f3f58fe40b29b276bb7103cb1dfdde1%2FPicture3.png?generation=1690272272038405&alt=media" alt="">
    • Test Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb4362117f46aeb210dad23f07d3ecb39%2FPicture4.png?generation=1690272400355824&alt=media" alt="">
    • Use prophet model which will include multiple parameter. We are going with the default parameters. Thereafter, we add the external regressor "Easter".
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7325be63d887372cc5764ddf29a94310%2FPicture5.png?generation=1690272892963939&alt=media" alt="">
    • We create the future data frame for forecasting and name the data frame "future". It will include "m" and 31 days of the test data. We then predict this future data frame and create a new data frame called "forecast".
    • Forecast data frame consists of 1827 rows and 34 variables. This shows the external Regressor (Easter) value is 0 through the entire time period. This shows that "Easter" has no impact or effect on "Visits".
    • yhat stands for the predicted value (predicted visits).
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fae5c9414d1b1bbb2670b372a326970a5%2FPicture6.png?generation=1690273558489681&alt=media" alt="">
    • We try to understand the impact of Holiday events "Christmas" and "Black.Friday"
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5a36cc5308f9e46f0b63fa8e37c4b932%2FPicture7.png?generation=1690273814760538&alt=media" alt="">
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8cc3dd0581db1e8b9d542d9a524abd39%2FPicture8.png?generation=1690273879506571&alt=media" alt="">
    • We plot the forecast.
    • plot(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa7968ff05abdd5b4e789f3723b41c4ed%2FPicture9.png?generation=1690274020880594&alt=media" alt="">
    • blue is predicted value(yhat) and black is actual value(y) and blue shaded regions are the yhat_upper and yhat_lower values
    • prophet_plot_components(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F52408afb8c71118ef6729420085875e8%2FPicture10.png?generation=1690274184325240&alt=media" alt="">
    • Trend indicates that the page visits remained constant from Jan'16 to Mid'17 and thereafter there was an upswing from Mid'19 to End of 2020
    • From Holidays, we can make out that Christmas had a negative effect on page visits whereas Black Friday had a positive effect on page visits
    • Weekly seasonality indicates that page visits tend to remain the highest from Monday to Thursday and starts going down thereafter
    • Yearly seasonality indicates that page visits are the highest in Apr and then starts going down thereafter with
    • Oct having reaching the bottom point
    • External regressor "Easter" has no impact on page visits
    • plot(m,forecast) + add_changepoints_to_plot(m)
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1253a0e381ae04d3156a4b098dafb2ca%2FPicture11.png?generation=1690274373570449&alt=media" alt="">
    • Trend which is indicated by the red line starts moving upwards from Mid 2019 to 2020 onwards
    • We check for acc...
  14. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  15. o

    University SET data with faculty and course characteristics from a...

    • openicpsr.org
    Updated Mar 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krzysztof Rybinski (2022). University SET data with faculty and course characteristics from a university in Poland [Dataset]. http://doi.org/10.3886/E166061V1
    Explore at:
    Dataset updated
    Mar 27, 2022
    Dataset provided by
    Vistula University
    Authors
    Krzysztof Rybinski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Poland
    Description

    This is a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university and variables' descriptions are provided in the Data_description word file. The data is aggregated at the teacher/course level with 1,021 data points and 29 variables.The data file is in the Rdata format, use R load() function. The name of the dataframe is "dat" .

  16. Lightning NOx Emissions in CMAQ Data

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Lightning NOx Emissions in CMAQ Data [Dataset]. https://catalog.data.gov/dataset/lightning-nox-emissions-in-cmaq-data
    Explore at:
    Dataset updated
    Sep 9, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Meta Data of the dataset for “Assessing the Impact of Lightning NOx Emissions in CMAQ Using Lightning Flash Data from WWLLN over the Contiguous United States” Figure 2: ThreeYear_NLDN2WWLLN_byNOAAcr_Region_anal.xlsx. The names of the variable are self-explanatory and the original figure is included. Figure 3: NLDN_flash_Monthly_mean_2016_07.ncf.gz, WWLLN_flash_Monthly_mean_2016_07.ncf.gz, WWLLNs_flash_Monthly_mean_2016_07.ncf.gz. These netcdf files contain the monthly mean values of gridded lightning flash rate for all the cases and the figure can be created using any netcdf visualization tool (such as VERDI) or statistical package (such as R). Figures 4,5,6: CMAQ_*_.rds.gz files. These files contain the paired observation-model O3 concentrations from all the model cases for hourly, daily max-8hr, and other statistics. The rds datasets can be read into R as data frame to make these figures. Figure 7 & 8: CCTM_CONC*.nc.gz. The vertical profiles (CONC) contain model data to make Figures 7 and 8. While the observation data are available publicly. Figure 9: NADP_v532_intel18_0_2016_CONUS_.csv. Figure 10: avg_DEP_concentrations.nc.gz. These files contain the monthly mean vet deposition of NO3. Figure 11: NADP_v532_intel18_0_2016_CONUS_.csv. Figure 12: DDEP_TNO3_.nc.gz. These files contain hourly dry deposition of TNO3 over the CONUS domain

  17. n

    Effect of data source on estimates of regional bird richness in northeastern...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    Columbia University
    Gettysburg College
    Massachusetts Audubon Society
    Agricultural Research Service
    New York State Department of Environmental Conservation
    University of Vermont
    Hebrew University of Jerusalem
    University of Michigan
    Authors
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Northeastern United States, United States
    Description

    Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

    Methods Overview

    This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

    Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

    The methods for compilation are contained in the supplementary information of the manuscript but also here:

    Bird data

    For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

    Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

    The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

    Environmental data

    Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.

  18. d

    Data from: HomeRange: A global database of mammalian home ranges

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maarten Broekman; Selwyn Hoeks; Rosa Freriks; Merel Langendoen; Katharina Runge; Ecaterina Savenco; Ruben ter Harmsel; Mark Huijbregts; Marlee Tucker (2025). HomeRange: A global database of mammalian home ranges [Dataset]. http://doi.org/10.5061/dryad.d2547d85x
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Maarten Broekman; Selwyn Hoeks; Rosa Freriks; Merel Langendoen; Katharina Runge; Ecaterina Savenco; Ruben ter Harmsel; Mark Huijbregts; Marlee Tucker
    Time period covered
    Jan 1, 2022
    Description

    Motivation: Home range is a common measure of animal space use as it provides ecological information that is useful for conservation applications. In macroecological studies, values are typically aggregated to species means to examine general patterns of animal space use. However, this ignores the environmental context in which the home range was estimated and does not account for intraspecific variation in home range size. In addition, the focus of macroecological studies on home ranges has been historically biased toward terrestrial mammals. The use of aggregated numbers and terrestrial focus limits our ability to examine home range patterns across different environments, variation in time and between different levels of organisation. Here we introduce HomeRange, a global database with 75,611 home-range values across 960 different mammal species, including terrestrial, as well as aquatic and aerial species. Main types of variable contained: The dataset contains mammal home-range estim..., Mammalian home range papers were compiled via an extensive literature search. All home range values were extracted from the literature including individual, group and population-level home range values. Associated values were also compiled including species names, methodological information on data collection, home-range estimation method, period of data collection, study coordinates and name of location, as well as species traits derived from the studies, such as body mass, life stage, reproductive status and locomotor habit. Here we include the database, associated metadata and reference list of all sources from which home range data was extracted from. We also provide an R package, which can be installed from https://github.com/SHoeks/HomeRange. The HomeRange R package provides functions for downloading the latest version of the HomeRange database and loading it as a standard dataframe into R, plotting several statistics of the database and finally attaching species traits (e.g. spe..., , # Title of Dataset: HomeRange: A global database of mammalian home ranges

    Mammalian home range papers were compiled via an extensive literature search. All home range values were extracted from the literature including individual, group and population-level home range values. Associated values were also compiled including species names, methodological information on data collection, home-range estimation method, period of data collection, study coordinates and name of location, as well as species traits derived from the studies, such as body mass, life stage, reproductive status and locomotor habit.

    We also provide an R package, which can be installed from https://github.com/SHoeks/HomeRange. The HomeRange R package provides functions for downloading the latest version of the HomeRange database and loading it as a standard dataframe into R, plotting several statistics of the database and finally attaching species traits (e.g. species average body mass, trophic level). from the CO...

  19. Reddit Machine Learning

    • kaggle.com
    zip
    Updated Oct 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tyler Landowski (2020). Reddit Machine Learning [Dataset]. https://www.kaggle.com/fishboi/redditmachinelearning
    Explore at:
    zip(145094611 bytes)Available download formats
    Dataset updated
    Oct 16, 2020
    Authors
    Tyler Landowski
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    What is it?

    This is a collection of scraped data from the MachineLearning subreddit: - From subreddit's beginning in 2009 until end of February 2020 - All submissions - All comments - No images

    All data collected using the Pushshift API. Not all features of comments and submissions are included - only the ones that are probably important. View a json file to see what's included.

    Some .pickle files are included incase you have use for them.

    .pickle Files

    These are exported as Pandas dataframe pickle exports (protocol 3). Unlike the .json files, posts which are [deleted] or [removed] are scrapped. Also, the comments dataframe includes number of direct replies and number of total replies (any comments in the subtree of the comment).

    Bonus Files

    Additionally, some feature-engineering was applied to produce submissions_fe and comments_fe, which may not be super useful, but feel free to play around with them. These feature engineered files include columns for counts of important Machine Learning terms found within the text.

  20. e

    Replication files for "Integrating biodiversity: A longitudinal and...

    • data.europa.eu
    • envidat.ch
    unknown
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EnviDat (2022). Replication files for "Integrating biodiversity: A longitudinal and cross-sectoral analysis of Swiss politics" [Dataset]. https://data.europa.eu/data/datasets/9a78b620-82fd-4b38-a1c5-c04e95913cc1-envidat?locale=sl
    Explore at:
    unknown(1830842700)Available download formats
    Dataset updated
    Mar 22, 2022
    Dataset authored and provided by
    EnviDat
    License

    http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by

    Area covered
    Switzerland
    Description

    Introduction

    The ZIP file contains all data and code to replicate the analyses reported in the following paper.

    Reber, U., Fischer, M., Ingold, K., Kienast, F., Hersperger, A. M., Grütter, R., & Benz, R. (2022). Integrating biodiversity: A longitudinal and cross-sectoral analysis of Swiss politics. Policy Sciences. https://doi.org/10.1007/s11077-022-09456-4

    If you use any of the material included in this repository, please refer to the paper. If you use (parts of) the text corpus, please also refer to the sources used for its compilation listed below. The content of the texts may not be changed.

    Data folder

    The data folder contains the following files.

    • corpus.parquet: Text corpus of Swiss policy documents
    • _dict_de.csv_: Biodiversity dictionary (German)
    • _dict_fr.csv_: Biodiversity dictionary (French)
    • _dict_it.csv_: Biodiversity dictionary (Italian)
    • _topic_labels.csv_: labels/codes for policy sectors
    • topics.csv: labels/codes for policy sectors

    The corpus and the dictionary were compiled by the authors specifically for this project. The labels/codes for policy sectors are based on the coding scheme of the Swiss Parliament.

    Text corpus

    The text corpus consists of 439,984 Swiss policy documents in German, French, and Italian from 1999 to 2018. The corpus was compiled from the following source between 2020-10-01 and 2021-01-31.

    • Transcripts and parliamentary businesses (e.g. questions, motions, parliamentary initiatives) via the Web Services (WS) provided by the Swiss Parliament
    • The official compilation of federal legislation ("Amtliche Sammlung", AS) via opendata.swiss provided by the Swiss Federal Archives (SFA)
    • The federal gazette ("Bundesblatt") via fedlex.admin.ch
    • Decisions of federal courts via entscheidsuche.ch (ES)

    The corpus is stored in a single data frame to use with R saved as PARQUET file (corpus.parquet). The data frame has the following structure.

    • _text_id_: Unique identifier for each text (source information as prefix, e.g. "t_")
    • _doc_type_: Document type (see coding scheme below)
    • branch: Government branche (1 legislative, 2 executive, 3 judicative)
    • stage: Stage of policy process (1 drafting, 2 introduction, 3 interpretation)
    • year: Year of publication
    • topic: Policy sector (coding scheme in separate file in data folder)
    • lang: Language (de, fr, it)
    • text: Text

    The following list contains the coding scheme for the doc_type variable.

    • 101: Federal gazette // Draft for public consultation ("Vernehmlassungsverfahren")
    • 102: Federal gazette // Explanation of draft for parliament ("Botschaft")
    • 103: Federal gazette // Strategy, action plan
    • 104: Federal gazette // Federal council decree ("Bundesratsbeschluss")
    • 105: Federal gazette // (Simple) Federal decree ("(Einfacher) Bundesbeschluss")
    • 106: Federal gazette // General decree ("Allgemeinverfügung")
    • 107: Federal gazette // Treaty ("Übereinkommen")
    • 108: Federal gazette // Treaty ("Abkommen")
    • 109: Federal gazette // Draft for parliament ("Entwurf")
    • 110: Federal gazette // Report ("Bericht")
    • 111: Federal gazette // Report of parliamentary comission ("Bericht")
    • 112: Federal gazette // Report of federal council ("Bericht")
    • 201: Parl. businesses // Submitted text
    • 202: Parl. businesses // Reason text
    • 203: Parl. businesses // Federal council response
    • 204: Parl. businesses // Initial situation
    • 205: Parl. businesses // Proceedings
    • 301: Parl. transcripts // Speech of MP
    • 302: Parl. transcripts // Speech of federal council
    • 401: Federal legislation // Legal text of the official compilation (law, ordinances, etc.)
    • 501: Court decisions // Federal Supreme Court
    • 502: Court decisions // Federal Criminal Court
    • 503: Court decisions // Federal Administrative Court

      Code folder

    The code folder contains all R code for the analyses. The files are numbered chronologically.

    • _1_classifier_training.R_: Training of classifiers for classification of policy sectors
    • _2_classifier_application.R_: Classification of documents in corpus
    • _3_dictionary_application.R_: Biodiversity indexing of documents in corpus
    • _4_stm_truncation.R_: Truncation of indexed documents to keep only relevant parts
    • _5_stm_translation.R_: Translation of FR and IT documents to DE
    • _6_stm_model.R_: Preprocesssing and structural topic model
    • _7_plots.R_: Plots and numbers as included in the paper

    The code/functions folder contains custom functions used in the scripts, e.g. to support topic model interpretation.

    Package versions and setup details are noted in the code files.

    Contact

    Please direct any questio

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Organization logo

Tennessee Eastman Process Simulation Dataset

Process Simulation Data for Anomaly Detection Evaluation

Explore at:
337 scholarly articles cite this dataset (View in Google Scholar)
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description

Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

Search
Clear search
Close search
Google apps
Main menu