41 datasets found
  1. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
    
    Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the “na.omit function”
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  2. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
    • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
    • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
    • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ‘postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  3. Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

    • search.datacite.org
    • doi.org
    • +1more
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    DataCitehttps://www.datacite.org/
    Authors
    Jacob Kaplan
    Description

    Version 5 release notes:
    Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
    Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
    Version 4 release notes:
    Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
    Version 3 release notes:
    Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
    Fix bug where Philadelphia Police Department had incorrect FIPS county code.
    The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
    All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

    I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

    To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

    To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

    I created 9 arrest categories myself. The categories are:
    Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

    As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

    Index Crimes
    MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
    LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
    Other Sex Offenses
    ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
    Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
    VandalismVagrancy
    Simple
    This data set has every crime and only the arrest categories that I created (see above).
    If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

  4. Students Performance EDA in R

    • kaggle.com
    zip
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Students Performance EDA in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/students-performance
    Explore at:
    zip(7847 bytes)Available download formats
    Dataset updated
    Sep 6, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    We will be doing Exploratory Data Analysis on the Dataset.

    • Set the working directory and read the data
    • Check the summary of the data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fad5a02011e6566baedbc40677e3e0b72%2FPicture2.png?generation=1693983416961939&alt=media" alt="">
    • Data Cleaning: No missing values or duplicated values found. Data types for 5 columns needed to be changed from character vector to factor vector. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F14351f36a8bd8987f73bb29dd0978a82%2FPicture3.png?generation=1693983829428938&alt=media" alt="">
    • EDA: Renamed columns ‘race.ethnicity’ to ‘race’, ‘parental.level.of.education’ to ‘parents_edu’, ‘test.preparation.course’ to ‘test_prep’. Created new column ‘avg_score’ by taking the average score of columns ‘math.score’, ‘reading.score’, ‘writing.score’.
    • Run libraries for data visualisation ‘dplyr’, ‘ggplot2’, ‘corrplot’, ‘tidyr’ https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6f76a53892907e18c45d3e841db4f4c0%2FPicture1.jpg?generation=1693983705727407&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9d01012a6039316ad5138bf059296a37%2FPicture2.jpg?generation=1693983800460383&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb085faf0af221d44dd461742d714943a%2FPicture3.jpg?generation=1693983876047599&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe28cbea4f9ffee5d3325b85bb968f32b%2FPicture4.jpg?generation=1693983910231678&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fcac9db4c50c89f8b194a866efd7a10fd%2FPicture5.jpg?generation=1693983931073804&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F545b65e4c4ce23511e6d6d454e3bcb38%2FPicture7.jpg?generation=1693984000632751&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F98af3b998c2876a76b29625f0edcc894%2FPicture8.jpg?generation=1693984025214732&alt=media" alt="">

    • Conclusion:
    • Female students (518) are more represented than male students (482). Total Students being 1000
    • 58% students belong to Group C Race (180- females and 139-males) & Group D Race (129- females and 133-males) and the least number of students belong to Group A race (53-females and 36-males) Total = 89. 22.6% students parents education is of some college followed closely by associate's degree (22.2%). 5.9% students parents have a master's degree
    • 35.5% students have free or reduced lunch versus 64.5% who get standard lunch. Within this, 18.9% female students and 16.6% male students get free or reduced lunch versus 32.9% female students and 31.6% male students who get standard lunch
    • Females students total average score is more than that of male students. This could also be due to higher proportion of female students
    • 35.8% students had completed the test preparations versus 64.2% who had not completed. Within this, 18.4% female students and 17.4% male students had completed the test preparations versus 33.4% female and 30.8% male students who had not.
    • Highest correlation is between writing score and reading score i.e 0.95
  5. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law...

    • openicpsr.org
    • search.gesis.org
    Updated Mar 25, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law Enforcement Officers Killed and Assaulted (LEOKA) 1960-2018 [Dataset]. http://doi.org/10.3886/E102180V7
    Explore at:
    Dataset updated
    Mar 25, 2018
    Dataset provided by
    University of Pennsylvania
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1960 - 2018
    Area covered
    United States
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 7 release notes:Add data from 2018Version 6 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 5 release notes: Adds data for 1960-1974 and 2017. Note: many columns (including number of female officers) will always have a value of 0 for years prior to 1971.Removes support for .csv and .sav files.Adds a number_of_months_reported variable for each agency-year. A month is considered reported if the month_indicator column for that month has a value of "normal update" or "reported, not data."The formatting of the monthly data has changed from wide to long. This means that each agency-month has a single row. The old data had each agency being a single row with each month-category (e.g. jan_officers_killed_by_felony) being a column. Now there will just be a single column for each category (e.g. officers_killed_by_felony) and the month can be identified in the month column. This also results in most column names changing. As such, be careful when aggregating the monthly data since some variables are the same every month (e.g. number of officers employed is measured annually) so aggregating will be 12 times as high as the real value for those variables. Adds a date column. This date column is always set to the first of the month. It is NOT the date that a crime occurred or was reported. It is only there to make it easier to create time-series graphs that require a date input.All the data in this version was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. Data is the same as from NACJD but using all FBI files makes cleaning easier as all column names are already identical. Version 4 release notes: Add data for 2016.Order rows by year (descending) and ORI.Version 3 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The LEOKA data sets contain highly detailed data about the number of officers/civilians employed by an agency and how many officers were killed or assaulted. All the data was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. About 7% of all agencies in the data report more officers or civilians than population. As such, I removed the officers/civilians per 1,000 population variables. You should exercise caution if deciding to generate and use these variables yourself. Several agency had impossible large (>15) officer deaths in a single month. For those months I changed the value to NA. See the R code for a complete list. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data.The UCR Handbook (https://ucr.fbi.gov/additional-ucr-publications/ucr_handbook.pdf/view) describes the LEOKA data as follows:"The UCR Program collects data from all contributing agencies ... on officer line-of-duty deaths and assaults. Reporting agencies must submit data on ... their own duly sworn officers feloniously or accidentally killed or assaulted in the line of duty. The purpose of this data collection is to identify situations in which officers are killed or assaulted, describe the incidents statistically, and publish the data to aid agencies in developing policies to improve officer safety."... agencies must record assaults on sworn officers. Reporting agencies must count all assaults that resulted in serious injury or assaults in which a weapon was used that could have caused serious injury or death. They must include other assaults not causing injury if the assault involved more than mere verbal abuse or minor resistance to an arrest. In other words, agencies must include in this section all assaults on officers, whether or not the officers sustained injuries."

  6. d

    Data from: Data and code from: Stem borer herbivory dependent on...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-stem-borer-herbivory-dependent-on-interactions-of-sugarcane-variety-ass-1e076
    Explore at:
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip

  7. Data from: BEING A TREE CROP INCREASES THE ODDS OF EXPERIENCING YIELD...

    • zenodo.org
    bin, zip
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcelo Adrián Aizen; Marcelo Adrián Aizen; Gabriela Gleiser; Gabriela Gleiser; Thomas Kitzberger; Thomas Kitzberger; Rubén Milla; Rubén Milla (2023). BEING A TREE CROP INCREASES THE ODDS OF EXPERIENCING YIELD DECLINES IRRESPECTIVE OF POLLINATOR DEPENDENCE [Dataset]. http://doi.org/10.5281/zenodo.7863825
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marcelo Adrián Aizen; Marcelo Adrián Aizen; Gabriela Gleiser; Gabriela Gleiser; Thomas Kitzberger; Thomas Kitzberger; Rubén Milla; Rubén Milla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Marcelo A. Aizen, Gabriela R. Gleiser, Thomas Kitzberger, Ruben Milla. Being a tree crop increases the odds of experiencing yield declines irrespective of pollinator dependence (to be submitted to PCI)

    Data and R scripts to reproduce the analyses and the figures shown in the paper. All analyses were performed using R 4.0.2.

    Data

    1. FAOdata_21-12-2021.csv

    This file includes yearly data (1961-2020, column 8) on yield and cultivated area (columns 6 and 10) at the country, sub-regional, and regional levels (column 2) for each crop (column 4) drawn from the United Nations Food and Agriculture Organization database (data available at http://www.fao.org/faostat/en; accessed July 21-12-2021). [Used in Script 1 to generate the synthesis dataset]

    2. countries.csv

    This file provides information on the region (column 2) to which each country (column 1) belongs. [Used in Script 1 to generate the synthesis dataset]

    3. dependence.csv

    This file provides information on the pollinator dependence category (column 2) of each crop (column 1).

    4. traits.csv

    This file provides information on the traits of each crop other than pollinator dependence, including, besides the crop name (column1), the variables type of harvested organ (column 5) and growth form (column 6). [Used in Script 1 to generate the synthesis dataset]

    5. dataset.csv

    The synthesis dataset generated by Script 1.

    6. growth.csv

    The yield growth dataset generated by Script 1 and used as input by Scripts 2 and 3.

    7. phylonames.csv

    This file lists all the crops (column 1) and their equivalent tip names in the crop phylogeny (column 2). [Used in Script 2 for the phylogenetically-controlled analyses]

    8.phylo137.tre

    File containing the phylogenetic tree.

    Scripts

    1. dataset

    This R script curates and merges all the individual datasets mentioned above into a single dataset, estimating and adding to this single dataset the growth rate for each crop and country, and the (log) cumulative harvested area per crop and country over the period 1961-2020.

    2. analyses

    This R script includes all the analyses described in the article’s main text.

    3. figures

    This R script creates all the main and supplementary figures of this article.

    4. lme4_phylo_setup

    R function written by Li and Bolker (2019) to carry out phylogenetically-controlled generalized linear mixed-effects models as described in the main text of the article.

    References

    Li, M., and B. Bolker. 2019. wzmli/phyloglmm: First release of phylogenetic comparative analysis in lme4- verse. Zenodo. https://doi.org/10.5281/zenodo.2639887.

  8. u

    Data from: Data corresponding to the paper "Traveling Bubbles and Vortex...

    • portalcientifico.uvigo.gal
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michinel, Humberto; Michinel, Humberto (2025). Data corresponding to the paper "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" [Dataset]. https://portalcientifico.uvigo.gal/documentos/682afb714c44bf76b287f3ae
    Explore at:
    Dataset updated
    2025
    Authors
    Michinel, Humberto; Michinel, Humberto
    Description

    Datasets generated for the Physical Review E article with title: "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" by Paredes, Guerra-Carmenate, Salgueiro, Tommasini and Michinel. In particular, we provide the data needed to generate the figures in the publication, which illustrate the numerical results found during this work.

    We also include python code in the file "plot_from_data_for_repository.py" that generates a version of the figures of the paper from .pt data sets. Data can be read and plots can be produced with a simple modification of the python code.

    Figure 1: Data are in fig1.csv

    The csv file has four columns separated by comas. The four columns correspond to values of r (first column) and the function psi(r) for the three cases depicted in the figure (columns 2-4).

    Figures 2 and 4: Data are in data_figs_2_and_4.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 2 and 4 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 2 is the square of the modulus and figure 4 is the argument, both are obtained from the same data sets.

    Figure 3: Data are in fig3.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), energy E (second column) and velocity U (third column).

    Figure 5: Data are in fig5.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), the minimum value of |psi|^2 (second column) and the value of |psi|^2 at the center (third column).

    Figure 6: Data are in data_fig_6.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 6 ("psia", "psib", "psic", "psid").

    Figure 7: Data are in data_fig_7.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 7 ("psia", "psib", "psic", "psid").

    Figures 8 and 10: Data are in data_figs_8_and_10.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 8 and 10 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 8 is the square of the modulus and figure 10 is the argument, both are obtained from the same data sets.

    Figure 9: Data are in fig9.csv

    The csv file has two columns separated by comas. The two columns correspond to values of momentum p (first column) and energy (second column).

    Figure 11: Data are in data_fig_11.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the two cases, four instants of time for each case, depicted in figure 11 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

    Figure 12: Data are in data_fig_12.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six instants of time depicted in figure 12 ("psia", "psib", "psic", "psid", "psie", "psif").

    Figure 13: Data are in data_fig_13.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the eight instants of time depicted in figure 13 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

  9. Data and Scripts Associated with the Manuscript “Water Column Respiration in...

    • osti.gov
    • search.dataone.org
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    River Corridor Hydro-biogeochemistry from Molecular to Multi-Basin Scales SFA (2024). Data and Scripts Associated with the Manuscript “Water Column Respiration in the Yakima River Basin is Explained by Temperature, Nutrients and Suspended Solids” [Dataset]. http://doi.org/10.15485/2283171
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    River Corridor Hydro-biogeochemistry from Molecular to Multi-Basin Scales SFA
    Area covered
    Yakima River
    Description

    This data package is associated with the publication “Water Column Respiration in the Yakima River Basin is Explained by Temperature, Nutrients and Suspended Solids” submitted to EGU Biogeochemistry (Laan et al. 2025). In this research, water column respiration (ERwc) data, surface water chemistry data, organic matter (OM) chemistry data, and publicly available geospatial data were used in analysis to evaluate the variability in ERwc at 47 sites across the Yakima River basin in Washington, USA.In addition to this readme, this data package also includes a file-level metadata (FLMD) file that describes each file and a data dictionary (DD) that describes all column/row headers and variable definitions.The data package includes the data inputs, and outputs, and R scripts to reproduce all the analyses performed in the manuscript and create manuscript figures. The data package is comprised of three main folders (Code, Data, and Figures). The Code folder is comprised of four scripts and three analysis-specific subfolders that contain the R scripts to perform the analyses described in the publication and create publication figures. The Data folder is comprised of two “.csv” files and four subfolders that contain data input and output files. The Published_Data folder contains a readme that directs the user to download the appropriate files and add to this folder when using scripts. The Figures folder includes figures from the manuscript in “.pdf” and “.png” formats and a folder with intermediate figure files. This data package is associated with a GitHub repository which can be found at https://github.com/river-corridors-sfa/rcsfa-RC2-SPS-ERwc.

  10. d

    Data from: Posterior predictive checks of coalescent models: P2C2M, an R...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated May 28, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Gruenstaeudl; Noah M. Reid; Gregory L. Wheeler; Bryan C. Carstens (2015). Posterior predictive checks of coalescent models: P2C2M, an R package [Dataset]. http://doi.org/10.5061/dryad.n715n
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2015
    Dataset provided by
    Dryad
    Authors
    Michael Gruenstaeudl; Noah M. Reid; Gregory L. Wheeler; Bryan C. Carstens
    Time period covered
    Nov 28, 2014
    Description

    Bayesian inference operates under the assumption that the empirical data are a good statistical fit to the analytical model, but this assumption can be challenging to evaluate. Here, we introduce a novel r package that utilizes posterior predictive simulation to evaluate the fit of the multispecies coalescent model used to estimate species trees. We conduct a simulation study to evaluate the consistency of different summary statistics in comparing posterior and posterior predictive distributions, the use of simulation replication in reducing error rates and the utility of parallel process invocation towards improving computation times. We also test P2C2M on two empirical data sets in which hybridization and gene flow are suspected of contributing to shared polymorphism, which is in violation with the coalescent model: Tamias chipmunks and Myotis bats. Our results indicate that (i) probability-based summary statistics display the lowest error rates, (ii) the implementation of simulation ...

  11. H

    Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1

    Description

    User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

  12. d

    R script to obtain daily averages of ancillary water data (turbidity,...

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandra Villamizar (2021). R script to obtain daily averages of ancillary water data (turbidity, specific conductivity, etc.) [Dataset]. https://search.dataone.org/view/sha256%3A6dbf91d2a1bd6f1df3fe6341ec3c473944b65c26fadac7238e0f68e920a4eb33
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Sandra Villamizar
    Description

    We present the procedure to obtain daily averages of ancillary water data (e.g., turbidity and chlorophyll) that may be used to support the interpretation of the daily metabolic rate estimates. The Ancillary_DarilyAvg.R script needs to be executed as many times as the number of available ancillary water parameters (twice for our case). For the case of the SJR restoration reach, the data can be accessed by doing a query on the CDEC website indicating the sensor number (28 for chlorophyll, 27 for turbidity), the interval of the data (event or hourly), and the starting and ending dates of the period of interest. A comma separated value file is produced after the request, and each of these files will be the input to this R script. Section 1 of the script sets up the working directory; the ancillary water data input file is read in section 2 and particular configuration parameters are specified in section 3. According to these parameters, a new table (‘table1’) with only the columns of interest is created in section 4; section 5 converts into NA’s all the cell values that report missing data (“m”) or errors (“-9998” or “-9997”). The actual daily averages are calculated in section 6 and stored in the variable ‘table4’. Section 7 deals with the transformation of the date column to an actual date format in order to identify whether or not there are missing days within the time series. The output of this section (dates and intervals between dates) as well as the daily averages contained in ‘table4’, are merged into a new variable ‘table5’. Finally, section 8 prints the output of this script as a tab separated text file. DOI: 10.6084/m9.figshare.3413023

  13. An Index to Roscher's Lexicon of Mythology

    • zenodo.org
    txt
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Groß; Jonathan Groß (2024). An Index to Roscher's Lexicon of Mythology [Dataset]. http://doi.org/10.5281/zenodo.13294014
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonathan Groß; Jonathan Groß
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    May 7, 2024
    Description

    compiled by Jonathan Groß
    ORCID 0000-0002-2564-9530
    jgross85 [AT] gmail [DOT] com

    1. Introduction

    1.1. General Disclaimer

    This index file was created as a private research project with the goal to make the wealth of information in Wilhelm Heinrich Roscher's "Detailed Lexicon of Greek and Roman Mythology" (Ausführliches Lexikon der griechischen und römischen Mythologie) more accessible to everybody.

    Roscher's Lexicon, originally published by B. G. Teubner in Leipzig from 1884 to 1937, is the most complete resource on Greek and Roman mythological names to date and also encompasses mythological (and religious) subjects from Sumeran, Akkadian, Babylonian, Hittite, Egyptian, Celtic, Germanic and other neighbouring cultures.

    The Lexicon was reprinted three times in the latter half of the 20th century (by Georg Olms in Hildesheim), even after its pictorial content had been superseded by the Lexicon Iconographicum Mythologiae Classicae (1981–1999, 2009). Unfortunately, since the last reprint of 1992/1993, Roscher's Lexicon has been out of stock at both publishers (Olms and De Gruyter Brill).

    Since the late 2000s, Roscher's Lexicon (the 6 main volumes and 4 supplements) was digitised by initiatives such as Google Books and the Internet Archive, and its contents can now be viewed there (with OCR text). One prominent use case of these scans is the German Wikipedia, where more than 2,500 pages use Roscher's Lexicon as a reference with a link to a scanned page in the Internet Archive.

    1.2. Licensing

    This dataset is released under the CC0 1.0 Universal License (https://creativecommons.org/publicdomain/zero/1.0/deed.en). I chose this license in order to maximise the usefulness of the data to everybody.

    Use and reuse of this data is strongly encouraged, and one use case has already been initiated by the author:
    -https://mythogram.wikibase.cloud/wiki/Project:Roscher%27s_Lexicon_of_Mythology (presentation of the information from the index file as Linked Open Data, finished on 28 May 2024 with emendations until 5 August 2024)

    Although not technically required by the licensing agreement, the author would appreciate being informed about other uses of the data.

    The contents of the Lexicon themselves are mostly in the Public Domain as of 2024. Additionally, many of the smaller entries do not reach the threshold of originality. This includes most of the cover addenda.

    2. Description of Data

    The index file is formatted as tabular data. This file was created with LibreOffice Calc (originally in 7.6.6.3, in LibreOffice Calc 24.2 as of version 1.1 of this file) and is stored in its native .ods format. For convenience, an .xlsx version is also provided. Both files are practically identical, but the .ods file is to be regarded as the 'original'.

    Data is stored in several tabs:
    (A) 'main alphabet' with the headwords of the main work (excluding addenda and corrigenda from the covers; for these see below).
    (B) 'cover addenda' with the additional entries
    (C) 'authors' with information on the authors
    (D) 'fascicles' with information on the individual issues of the Lexicon

    The Tabs are available separately as .csv files (with tab separation, so strictly speaking it should be .tsv).

    Tabs A and B are almost identical in structure, with the columns:
    A id = unique entry ID (not authoritative, just a means to identify individual entries)
    B headword = lemma of the entry as stated by the Lexicon
    C subject_type = classification scheme for the subject matter of the article (again, not authoritative and in places even contentious)
    D vol = volume number
    E fascicle = issue number (not found in most exemplars, assigned according to my own research)
    F date = publication date of the entry (inferred from the issue date of the fascicle)
    G–H col1,2 = start and end column
    I colspan = span of columns
    J–M author1,2,3,4 = author of the entry (please refer to Tab C, column A)
    N entry_type = classification of entry (article, cross-reference, addendum, correction)
    O scan = URL to a scan of the start column in the Internet Archive
    P Wikidata = ID of the Wikidata item representing the subject (incomplete as of Version 1.1)
    Q FactGrid = ID of the FactGrid item representing the subject (mostly missing as of Version 1.1)
    R Mythogram = ID of the (bibliographic) Mythogram item representing the Lexicon entry
    S redirect_target = target headword as stated, if the entry is a cross-reference
    T remarks = remarks on the entry or subject (such as 'non-entity', 'duplicate', 'double lemma')
    U PD = if the entry is in the Public Domain (either 'yes' or year where it enters the PD)

    Tab B has two additional columns, which are mostly empty as of version 1.1
    V referring_to = target entry (in the main alphabet) of the correction or addenda
    W excerpt = textual excerpt from the entry

    Tab C has information on the authors:
    A short_name = for sorting reasons
    B full_name = full name
    C Wikidata = Wikidata item
    D FactGrid = FactGrid item
    E Mythogram = Mythogram item
    F yob = year of birth
    G yod = year of death
    H vols = volumes contributed to
    I article_count = number of articles written (not counting corrections and addenda from Tab B)
    J–L namestring1,2,3 = name as written in the Lexicon
    M remarks = remarks on completeness and certainty of data

    Tab D informs about the individual fascicles of the Lexicon as they appeared from 1884 to 1937:
    A no. = fascicle number
    B vol = volume(s) the fascicle belongs to
    C colspan = column span of the fascicle
    D headwords = headwords contained in the fascicle as advertised on the cover page
    E issue_date = date of publication of the fascicle as stated on the cover page
    F quires = quire numbers of the fascicle
    G quire_count = quire count of the fascicle (calculated from column numbers: in some cases, at the end of a volume, quires were shortened, returning rational numbers here)
    H remarks = remarks (in German)

    3. Version History and Change Log

    --------------------------------------------------
    Version 1.1 (August 11th, 2024)
    -Tab A, column P (Wikidata Q-ids): added 2,380 out of 15,489 = 15.4%)
    -Tab A, column R (Mythogram Q-ids): completed
    -Tab C: added data for author Wilhelm Windisch (translated Cumont's article on 'Mithras')
    -minor corrections to some entries (typos)
    -volume number changed from 3.2 to 3.1 for 125 entries (Pasikrateia–Peirithoos)
    -fascicle number changed from 104/105 to 106/107 for the last 12 entries (Tameobrigus–Kerberos [Nachtrag])
    -addition of 11 missed entries, values in column A renumbered accordingly

    --------------------------------------------------
    Version 1.0 (May 4th, 2024)
    -Tabs A–B with complete and checked data for columns A–N and S
    -Tab A with complete data for column O
    -Tabs C and D with complete data

    --------------------------------------------------
    Prior to publication:

    -collection and checking of data (roughly 376 hours of work, started in July 2023 and finished on Star Wars Day 2024)

  14. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  15. u

    HIPPO Pressure-Weighted Mean Total, 10-km, and 100-m Interval Column...

    • data.ucar.edu
    • ckanprod.data-commons.k8s.ucar.edu
    archive
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Watt; Anne E. Perring; Benjamin R. Miller; Bin Xiang; Bradley Hall; Britton B. Stephens; Bruce C. Daube; Christopher A. Pickett-Heaps; Colm Sweeney; Dale Hurst; Daniel J. Jacob; David C. Rogers; David Nance; David W. Fahey; Elliot Atlas; Eric A. Kort; Eric A. Ray; Eric J. Hintsa; Fred Moore; Geoff S. Dutton; Greg Santoni; Huiqun Wang; J. Ryan Spackman; James W. Elkins; Jasna V. Pittman; Jenny A. Fisher; Jonathan Bent; Joshua P. Schwarz; Julie Haggerty; Karen H. Rosenlof; Kevin J. Wecht; Laurel A. Watts; Mark Zondlo; Michael J. Mahoney; Minghui Diao; Pavel Romashkin; Qiaoqiao Wang; Ralph F. Keeling; Richard Lueb; Rodrigo Jimenez-Pizarro; Roger Hendershot; Roisin Commane; Ru-Shan Gao; Samuel J. Oltmans; Stephen A. Montzka; Stephen R. Shertz; Steven C. Wofsy; Stuart Beaton; Sunyoung Park; Teresa Campos; William A. Cooper (2025). HIPPO Pressure-Weighted Mean Total, 10-km, and 100-m Interval Column Concentrations [Dataset]. http://doi.org/10.3334/CDIAC/HIPPO_011
    Explore at:
    archiveAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Andrew Watt; Anne E. Perring; Benjamin R. Miller; Bin Xiang; Bradley Hall; Britton B. Stephens; Bruce C. Daube; Christopher A. Pickett-Heaps; Colm Sweeney; Dale Hurst; Daniel J. Jacob; David C. Rogers; David Nance; David W. Fahey; Elliot Atlas; Eric A. Kort; Eric A. Ray; Eric J. Hintsa; Fred Moore; Geoff S. Dutton; Greg Santoni; Huiqun Wang; J. Ryan Spackman; James W. Elkins; Jasna V. Pittman; Jenny A. Fisher; Jonathan Bent; Joshua P. Schwarz; Julie Haggerty; Karen H. Rosenlof; Kevin J. Wecht; Laurel A. Watts; Mark Zondlo; Michael J. Mahoney; Minghui Diao; Pavel Romashkin; Qiaoqiao Wang; Ralph F. Keeling; Richard Lueb; Rodrigo Jimenez-Pizarro; Roger Hendershot; Roisin Commane; Ru-Shan Gao; Samuel J. Oltmans; Stephen A. Montzka; Stephen R. Shertz; Steven C. Wofsy; Stuart Beaton; Sunyoung Park; Teresa Campos; William A. Cooper
    Time period covered
    Jan 8, 2009 - Sep 8, 2011
    Area covered
    Description

    This dataset contains the total column and vertical profile data for all Missions, 1 through 5, of the HIAPER Pole-to-Pole Observations (HIPPO) study of carbon cycle and greenhouse gases. The pressure-weighted mean column concentrations of parameters reported in this data set are estimates of the quantities that would be observed from a total column instrument at the top of each profile, i.e., from an airplane looking down or from a satellite. The Missions took place from 08 January 2009 to 08 September 2011. There are five spacedelimited format ASCII files included with this data set that have been compressed into one *.zip file for convenient download. Please refer to the readme for more information. The EOL Version 1.0 data set was created in 2012 and previously served as R. 20121129 by ORNL.

  16. Z

    Data from: A FAIR and modular image-based workflow for knowledge discovery...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meghan Balk; Thibault Tabarin; John Bradley; Hilmar Lapp (2024). Data from: A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8233379
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    National Ecological Observatory Network
    Duke University School of Medicine
    Authors
    Meghan Balk; Thibault Tabarin; John Bradley; Hilmar Lapp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).

    Fish-AIR: This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:

    extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

    imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

    multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

    meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.

    The outputs from the Minnow_Segmented_Traits workflow are:

    sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.

    presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.

    heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.

    minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).

    burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.

  17. Divvy Bikeshare Data | April 2020 - May 2021

    • kaggle.com
    Updated Aug 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoni K Pestka (2021). Divvy Bikeshare Data | April 2020 - May 2021 [Dataset]. https://www.kaggle.com/antonikpestka/divvy-bikeshare-data-april-2020-may-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Antoni K Pestka
    Description

    Original Divvy Bikeshare Data obtained from here

    City of Chicago Zip Code Boundary Data obtained from here

    Tableau Dashboard Viz can be seen here

    R code can be found here

    Context

    This is my first-ever project after recently completing the Google Data Analytics Certificate on Coursera.

    The goal of the project are to answer the following questions: 1. How do annual riders and casual riders use Divvy bikeshare differently? 2. Why would casual riders buy annual memberships? 3. How can Divvy use digital media to influence casual riders to become members?

    Casual riders are defined as those who do not have an annual membership, and instead use the service on a a pay-per-ride basis.

    Content

    Original Divvy Bikeshare Data obtained from here

    The original datasets included the following columns: Ride ID # Rideable Type (electric, docked bike, classic) Started At Date/Time Ended At Date/Time Start Station Address Start Station ID End Station Address End Station ID Start Longitude Start Latitude End Longitude End Latitude Member Type (member, casual)

    City of Chicago Zip Code Boundary Data obtained from here

    The zip code boundary geospatial files were used to calculate the zip code of trip origin for each trip based on start longitude and start latitude.

    Caveats and Assumptions

    1. Divvy utilizes two types of bicycles: electric bicycles and classic bicycles. For the column labeled "rideable_type", three values existed: docked_bike, electric_bike, and classic. Docked_bike and classic were aggregated into the same category. Therefore, they are labeled as "other" on the visualization.

    2. Negative ride lengths and ride lengths under 90 seconds in length were not included in the calculation of average ride length. -Negative ride lengths exist due to the end time and date being recorded as occurring BEFORE the start time and date on certain data entries. -Ride lengths 90 seconds and less were ruled out due to the possibility of bikes failing to dock properly or being checked out for a short time for maintenance checks. -This removed 90,842 records from the calculations for average ride length.

    The process

    R programming language was used for the following:

    1. Create a new column for the zip code of each trip origin based on the start longitude and start latitude
    2. Calculate the ride length in seconds for each trip
    3. Remove unnecessary columns
    4. Rename "electric_bike" to EL and "docked_bike" to DB

    The R code I utilized is found here

    Excel was used for the following:

    1. Deletion of header rows for all dataset files except for the first file (April 2020)
    2. Deletion of the geometry information to save file space

    A .bat file utilizing DOS command line was utilized to merged all the cleaned CSV files into a single file.

    Finally, the cleaned and merged dataset was connected to Tableau for analysis and visualization. A link to the the dashboard can be found here

    Data Analysis Overview

    Zip Code with highest quantity of trips: 60614 (615,010) Total Quantity of Zip Codes: 56 Trip Quantity of Top 9 Zip Codes: 60.35% (2,630,330) Trip Quantity of the Remaining 47 Zip Codes: 39.65% (1,728,281)

    Total Quantity of Trips: 4,358,611 Quantity of Trips by Annual Members: 58.15% (2,534,718) Quantity of Trips by Casual Members: 41.85% (1,823,893)

    Average Ride Length with Electric Bicycle: Annual Members: 13.8 minutes Casual Members: 22.3 minutes

    Average Ride Length with Classic Bicycle: Annual Members: 16.8 minutes Casual Members: 49.7 minutes

    Average Ride Length Overall: Annual Members: 16.2 minutes Casual Members: 44.2 minutes

    Peak Day of the Week for Overall Trip Quantity: Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Overall Trip Quantity: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak Day of the Week for Electric Bikes: Saturday Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Electric Bikes: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak day of the Week for Classic Bikes: Saturday Ann...

  18. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  19. f

    Data from: SynchroSep-MS: Parallel LC Separations for Multiplexed Proteomics...

    • acs.figshare.com
    xlsx
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah M. Lancaster; Li-Yu Chen; Bingnan Zhao; Benton J. Anderson; Mitchell D. Probasco; Vadim Demichev; Daniel A. Polasky; Alexey I. Nesvizhskii; Katherine A. Overmyer; Scott T. Quarmby; Joshua J. Coon (2025). SynchroSep-MS: Parallel LC Separations for Multiplexed Proteomics [Dataset]. http://doi.org/10.1021/jasms.5c00207.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    ACS Publications
    Authors
    Noah M. Lancaster; Li-Yu Chen; Bingnan Zhao; Benton J. Anderson; Mitchell D. Probasco; Vadim Demichev; Daniel A. Polasky; Alexey I. Nesvizhskii; Katherine A. Overmyer; Scott T. Quarmby; Joshua J. Coon
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Achieving high throughput remains a challenge in MS-based proteomics for large-scale applications. We introduce SynchroSep-MS, a novel method for parallelized, label-free proteome analysis that leverages the rapid acquisition speed of modern mass spectrometers. This approach employs multiple liquid chromatography columns, each with an independent sample, simultaneously introduced into a single mass spectrometer inlet. A precisely controlled retention time offset between sample injections creates distinct elution profiles, facilitating unambiguous analyte assignment. We modified the DIA-NN workflow to effectively process these unique parallelized data, accounting for retention time offsets. Using a dual-column setup with mouse brain peptides, SynchroSep-MS detected approximately 16,700 unique protein groups, nearly doubling the peptide information obtained from a conventional single proteome analysis. The method demonstrated excellent precision and reproducibility (median protein %RSDs less than 4%) and high quantitative linearity (median R2 greater than 0.96) with minimal matrix interference. SynchroSep-MS represents a new paradigm for data collection and the first example of label-free multiplexed proteome analysis via parallel LC separations, offering a direct strategy to accelerate throughput for demanding applications such as large-scale clinical cohorts and single-cell analyses without compromising peak capacity or causing ionization suppression.

  20. Comparison table across software for entropic and complexity timeseries...

    • plos.figshare.com
    xls
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Datseris; Kristian Agasøster Haaga (2025). Comparison table across software for entropic and complexity timeseries analysis. The symbols mean: ✓ = has aspect, ✗ = does not have aspect, ◗ = partially has aspect. The numeric superscripts in the first column correspond to more extensive descriptions that we provide in the main text of Sect 4. The codebase that produced the benchmarks can be found online in [66]. Benchmarks were run on a laptop with 11th Gen Intel(R) Core(TM) i7-1165G7 at 2.80 GHz. [Dataset]. http://doi.org/10.1371/journal.pone.0324431.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    George Datseris; Kristian Agasøster Haaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison table across software for entropic and complexity timeseries analysis. The symbols mean: ✓ = has aspect, ✗ = does not have aspect, ◗ = partially has aspect. The numeric superscripts in the first column correspond to more extensive descriptions that we provide in the main text of Sect 4. The codebase that produced the benchmarks can be found online in [66]. Benchmarks were run on a laptop with 11th Gen Intel(R) Core(TM) i7-1165G7 at 2.80 GHz.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
Organization logo

Google Data Analytics Case Study Cyclistic

Difference between Casual vs Member in Cyclistic Riders

Explore at:
zip(1299 bytes)Available download formats
Dataset updated
Sep 27, 2022
Authors
Udayakumar19
Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Share

Guiding Quesions:

Were you able to answer the question of how ...
Search
Clear search
Close search
Google apps
Main menu