9 datasets found
  1. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  2. FacialRecognition

    • kaggle.com
    zip
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
    Explore at:
    zip(121674455 bytes)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    TheNicelander
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    #https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

    ###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

    ###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

    ###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

    ###To look at samples of the data, uncomment this line:

    head(d.train)

    ###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

    im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

    im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

    ################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

    #strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

    ###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

    install.packages('foreach')

    library("foreach", lib.loc="~/R/win-library/3.3")

    ###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

    ###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

    save(d.train, im.train, d.test, im.test, file='data.Rd')

    load('data.Rd')

    #each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

    #im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

    #To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

    #Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

    #Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

    #there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

    #One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

    #To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

    #The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

    install.packages('reshape2')

    library(...

  3. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
    
    Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the “na.omit function”
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  4. Time Series Forecasting Using Prophet in R

    • kaggle.com
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Time Series Forecasting Using Prophet in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/time-series-forecasting-using-prophet-in-r
    Explore at:
    zip(9000 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Main objective : To forecast the page visits of a website
    • Tool : Time Series Forecasting using Prophet in R.
    • Steps:
    • Read the data
    • Data Cleaning: Checking data types, date formats and missing data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F56d7b1edf4f51157804e81b02c032e4d%2FPicture1.png?generation=1690271521103777&alt=media" alt="">
    • Run libraries (dplyr, ggplot2, tidyverse, lubridate, prophet, forecast)
    • Change the Date column from character vector to date and change data format using lubridate package
    • Rename the column "Date" to "ds" and "Visits" to "y".
    • Treat "Christmas" and "Black.Friday" as holiday events. As the data ranges from 2016 to 2020, there will be 5 Christmas and 5 Black Friday days.
    • We will look at the impact of Christmas 3 days prior and 3 days later from Christmas date on "Visits" and 3 days prior and 1 day later for Black Friday
    • We create two data frames called Christmas and Black.Friday and merge the two into a data frame called "holidays". https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd07b366be2050fefe6a62563b6abac0c%2FPicture2.png?generation=1690272066356516&alt=media" alt="">
    • We create train and test data. In train data & test data, we select only 3 variables namely ds, y , Easter. In train data, ds contains data before 2020-12-01 and test data contains data equal to and after 2020-12-01 (31 days) data
    • Train Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8f3f58fe40b29b276bb7103cb1dfdde1%2FPicture3.png?generation=1690272272038405&alt=media" alt="">
    • Test Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb4362117f46aeb210dad23f07d3ecb39%2FPicture4.png?generation=1690272400355824&alt=media" alt="">
    • Use prophet model which will include multiple parameter. We are going with the default parameters. Thereafter, we add the external regressor "Easter".
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7325be63d887372cc5764ddf29a94310%2FPicture5.png?generation=1690272892963939&alt=media" alt="">
    • We create the future data frame for forecasting and name the data frame "future". It will include "m" and 31 days of the test data. We then predict this future data frame and create a new data frame called "forecast".
    • Forecast data frame consists of 1827 rows and 34 variables. This shows the external Regressor (Easter) value is 0 through the entire time period. This shows that "Easter" has no impact or effect on "Visits".
    • yhat stands for the predicted value (predicted visits).
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fae5c9414d1b1bbb2670b372a326970a5%2FPicture6.png?generation=1690273558489681&alt=media" alt="">
    • We try to understand the impact of Holiday events "Christmas" and "Black.Friday"
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5a36cc5308f9e46f0b63fa8e37c4b932%2FPicture7.png?generation=1690273814760538&alt=media" alt="">
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8cc3dd0581db1e8b9d542d9a524abd39%2FPicture8.png?generation=1690273879506571&alt=media" alt="">
    • We plot the forecast.
    • plot(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa7968ff05abdd5b4e789f3723b41c4ed%2FPicture9.png?generation=1690274020880594&alt=media" alt="">
    • blue is predicted value(yhat) and black is actual value(y) and blue shaded regions are the yhat_upper and yhat_lower values
    • prophet_plot_components(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F52408afb8c71118ef6729420085875e8%2FPicture10.png?generation=1690274184325240&alt=media" alt="">
    • Trend indicates that the page visits remained constant from Jan'16 to Mid'17 and thereafter there was an upswing from Mid'19 to End of 2020
    • From Holidays, we can make out that Christmas had a negative effect on page visits whereas Black Friday had a positive effect on page visits
    • Weekly seasonality indicates that page visits tend to remain the highest from Monday to Thursday and starts going down thereafter
    • Yearly seasonality indicates that page visits are the highest in Apr and then starts going down thereafter with
    • Oct having reaching the bottom point
    • External regressor "Easter" has no impact on page visits
    • plot(m,forecast) + add_changepoints_to_plot(m)
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1253a0e381ae04d3156a4b098dafb2ca%2FPicture11.png?generation=1690274373570449&alt=media" alt="">
    • Trend which is indicated by the red line starts moving upwards from Mid 2019 to 2020 onwards
    • We check for acc...
  5. Rcode – Custom code written the R programming language that will translate...

    • plos.figshare.com
    txt
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Nearman; Alriana Buller-Jarrett; Dawn Boncristiani; Eugene Ryabov; Yanping Chen; Jay D. Evans (2025). Rcode – Custom code written the R programming language that will translate an open reading frame for an existing sequence, then compare it to a data frame of nucleotide polymorphisms at specific locations, and retranslate the amino acid changes into a new data frame. [Dataset]. http://doi.org/10.1371/journal.pone.0337191.s009
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anthony Nearman; Alriana Buller-Jarrett; Dawn Boncristiani; Eugene Ryabov; Yanping Chen; Jay D. Evans
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Rcode – Custom code written the R programming language that will translate an open reading frame for an existing sequence, then compare it to a data frame of nucleotide polymorphisms at specific locations, and retranslate the amino acid changes into a new data frame.

  6. Kickastarter Campaigns

    • kaggle.com
    zip
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
    Explore at:
    zip(2233314 bytes)Available download formats
    Dataset updated
    Jan 25, 2024
    Authors
    Alessio Cantara
    Description

    Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

    ASK

    Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

    PREPARE

    I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

    Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

    Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

    PROCESS

    I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

    csv_files <- list.files(pattern = "\\.csv$")
    data_frames <- list()
    
    for (file in csv_files) {
     variable_name <- sub("\\.csv$", "", file)
     assign(variable_name, read.csv(file))
     data_frames[[variable_name]] <- get(variable_name)
    }
    

    Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

    data_frames <- lapply(data_frames, function(df) {
     df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
     return(df)
    })
    data_frames <- lapply(data_frames, function(df) {
     df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
     return(df)
    })
    data_frames <- lapply(data_frames, function(df) {
     df$usd_pledged <- as.numeric(df$usd_pledged)
     return(df)
    })
    

    In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

    all_nov_2023 = bind_rows(data_frames)
    all_dec_2023 = bind_rows(data_frames)
    all_jan_2024 = bind_rows(data_frames)`
    

    After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

    filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
     select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
     filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% 
     mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% 
     mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% 
     mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% 
     mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% 
     select(-category, -deadline, -launched_at, -created_at) %>% 
     relocate(created, launched, setted_deadline, .before = goal)
    
    write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
    
    

    The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...

  7. Data for analysis in Barrie et al. (2025)

    • figshare.com
    csv
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor Barrie; Luke L. Powell; Billi Krochuk; Patricia F Rodrigues; Jared D Wolfe; Crinan Jarrett; Diogo F Ferreira; Kristin E Brzeski; Jacob C Cooper; Susana Lin Mufumu; Silvestre Esteban Malanza; Agustin Ebana Nsue Akele; Cayetano Ebana Ebana Alene (2025). Data for analysis in Barrie et al. (2025) [Dataset]. http://doi.org/10.6084/m9.figshare.29114960.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Eleanor Barrie; Luke L. Powell; Billi Krochuk; Patricia F Rodrigues; Jared D Wolfe; Crinan Jarrett; Diogo F Ferreira; Kristin E Brzeski; Jacob C Cooper; Susana Lin Mufumu; Silvestre Esteban Malanza; Agustin Ebana Nsue Akele; Cayetano Ebana Ebana Alene
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Barrie
    Description

    These files contain the data used for analysis in: Barrie EM, Krochuk BA, Jarrett C, Ferreira DF, Rodrigues P, Mufumu SL, Malanza SE, Akele AEN, Alene CEE, Brzeski KE, Cooper JC, Wolfe JD and Powell LL (2025) Specialized insectivores drive differences in avian community composition between primary and secondary forest in Central Africa. Front. Conserv. Sci. 6:1504350. doi: 10.3389/fcosc.2025.1504350At a long-term bird banding station on mainland Equatorial Guinea, we captured over 3200 birds across 6 field seasons in selectively logged secondary forest and in largely undisturbed primary forest. Our objective was to understand how community composition changed with human disturbance—with particular interest in the guilds and species that indicate primary rainforest.banding_data.csv consists of the raw banding/capture data from mist-netting and ringing in the field, including info on time and date of capture, net lane and net number, species, ring number, and recaptures.buffers.csv lists (for each net lane) the amount of overlap with other nearby net lanes and the proportion used for the offset in statistical analysis. See Barrie et al. (2025) for methodology.days.csv lists all combinations of net lanes and dates run and whether these were "Day 1" or "Day 2" (all net lanes were run for two consecutive days per year.effort.csv contains data on effort in terms of mist net hours, with the opening and closing times and duration open for every net run.forest_type.csv lists each net lane and whether it was in primary or secondary forestguilds.csv contains data on the dietary guild classifications of all focal species analysed in Barrie et al. (2025), which is needed to merge with banding_data.csv in R and create the data frame for analysis

  8. Z

    Data from: Code and Data Schimmelradar manuscript 1.1

    • data-staging.niaid.nih.gov
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kortenbosch, Hylke (2025). Code and Data Schimmelradar manuscript 1.1 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14851614
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset provided by
    Wageningen University & Research
    Authors
    Kortenbosch, Hylke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Read me – Schimmelradar manuscript

    The code in this repository was written to analyse the data and generate figures for the manuscript “Land use drives spatial structure of drug resistance in a fungal pathogen”.

    This repository consists of two original .csv raw data files, 2 .tif files that are minimally reformatted after being downloaded from LGN.nl and www.pdok.nl/introductie/-/article/basisregistratie-gewaspercelen-brp-, and 9 scripts using the R language. The remaining files include intermediate .tif and .csv files to skip more computationally heavy steps of the analysis and facilitate the reproduction of the analysis.

    Data files:§1

    Schimmelradar_360_submission.csv: The raw phenotypic resistance spatial data from the air sample

    • Sample: an arbitrary sample code given to each of the participants

    • Area: A random number assigned to each of the 100 areas the Netherlands was split up into to facilitate an even spread of samples across the country during participant selection.

    • Logistics status: Variable used to indicate whether the sample was returned in good order, not otherwise used in the analysis.

    • Arrived back on: The date by which the sample arrived back at Wageningen University

    • Quality seals: quality of the seals upon sample return, only samples of a quality designated as good seals were processed. (also see Supplement file – section A).

    • Start sampling: The date on which the trap was deployed and the stickers exposed to the air, recorded by the participant

    • End sampling: The date on which the trap was taken down and the stickers were re-covered and no longer exposed to the air, recorded by the participant

    • 3 back in area?: Binary indicating whether at least three samples have been returned in the respective area (see Area)

    • Batch: The date on which processing of the sample was started. To be more specific, the date at which Flamingo medium was poured over the seals of the sample and incubation was started.

    • Lab processing: Binary indication completion of lab processing

    • Tot ITR: A. fumigatus CFU count in the permissive layer of the itraconazole-treated plate

    • RES ITR: CFU count of colonies that had breached the surface of the itraconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

    • RF ITR: The itraconazole (~4 mg/L) resistance fraction = RES ITR/Tot ITR

    • Muccor ITR: Indication of the presence of Mucorales spp. growth on the itraconazole treatment plate

    • Tot VOR: A. fumigatus CFU count in the permissive layer of the voriconazole-treated plate

    • RES VOR: CFU count of colonies that had breached the surface of the voriconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

    • RF VOR: The voriconazole (~2 mg/L) resistance fraction = RES VOR/Tot VOR

    • Muccor VOR: Indication of the presence of Mucorales spp. growth on the voriconazole treatment plate

    • Tot CON: CFU count on the untreated growth control plate Note: note on the sample based on either information given by the participant or observations in the lab. The exclude label was given if the sample had either too little (<25) or too many (>300) CFUs on one or more of the plates (also see Supplement file – section A).

    • Lat: Exact latitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

    • Long: Exact longitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

    • Round_Lat: Rounded latitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

    • Round_Long: Rounded longitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

    Analysis_genotypic_schimmelradar_TR_types.csv: The genotype data inferred from gel electrophoresis for all resistant isolates

    • TR type: Indicates the length of the tandem repeats in bp, as judged from a gel. 34 bp, 46 bp, or multiples of 46.

    • Plate: 96-well plate on which the isolate was cultured

    • 96-well: well in which the isolate was cultured

    • Azole: Azole on which the isolate was grown and resistant to. Itraconazole (ITRA) or Voriconazole (VORI).

    • Sample: The air sample the isolate was taken from, corresponds to “Sample” in “Schimmelradar_360_submission.csv”.

    • Strata: The number that equates to “Area” in “Schimmelradar_360_submission.csv”.

    • WT: A binary that indicates whether an isolate had a wildtype cyp51a promotor.

    • TR34: A binary that indicates whether an isolate had a TR34 cyp51a promotor.

    • TR46: A binary that indicates whether an isolate had a TR46 cyp51a promotor.

    • TR46_3: A binary that indicates whether an isolate had a TR46*3 cyp51a promotor.

    • TR46_4: A binary that indicates whether an isolate had a TR46*4 cyp51a promotor.

    Script 1 - generation_100_equisized_areas_NL

    NOTE: Running this code is not necessary for the other analyses, it was used primarily for sample selection. The area distribution was used during the analysis in script 2B, yet each sample is already linked to an area in “Schimmelradar_360_submission.csv". This script was written to generate a spatial polygons data frame of 100 equisized areas of the Netherlands. The registrations for the citizen science project Schimmelradar were binned into these areas to facilitate a relatively even distribution of samples throughout the country which can be seen in Figure S1. The spatial polygons data frame can be opened and displayed in open-source software such as QGIS. The package “spcosa” used to generate the areas has RJava as a dependency, so having Java installed is required to run this script. The R script uses a shapefile of the Netherlands from the tmap package to generate the areas within the Netherlands. Generating a similar distribution for other countries will require different shape files!

    Script 2 - Spatial_data_integration_fungalradar

    This script produces 4 data files that describe land use in the Netherlands: The three focal.RData files with land use and resistant/colony counts, as well as the “Predictor_raster_NL.tif” land use file.

    In this script, both the phenotypic and genotypic resistance spatial data from the air samples taken during the Fungal radar citizen science project are integrated with the land use and weather data used to model them. It is not recommended to run this code because the data extraction is fairly computationally demanding and it does not itself contain key statistical analyses. Rather it is used to generate the objects used for modelling and spatial predictions that are also included in this repository.

    The phenotypic resistance is summarised in Table 1, which is generated in this script. Subsequently, the spatial data from the LNG22 and BRP datasets are integrated into the data. These dataset can be loaded from the "LGN2022.tif" and "Gewas22rast.tiff" raster files, respectively. Link to webpages where these files can be downloaded can found in the code.

    Once the raster files are loaded, the code generates heatmaps and calculates the proportions of all the land use classes in both a 5 and 10-km radius around every sample and across the country to make spatial predictions. Only the 10 km radius data are used in the later analysis, but the 5 km radius was generated to test if that radius would be more appropriate, during an earlier stage of the analyses, and was left in for completeness. For documentation of the LGN22 data set, we refer to https://lgn.nl/documentatie and for BRP to https://nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/44e6d4d3-8fc5-47d6-8712-33dd6d244eef, both of these online resources are in Dutch but can be readily translated. A list of the variables that were included from these datasets during model selection can be found in Table S3. Alongside land-use data, the code extracts weather data from datafiles that can be downloaded from https://cds.climate.copernicus.eu/datasets/sis-agrometeorological-indicators?tab=download for the Netherlands during the sampling window, dates and dimensions are listed within the code. The Weather_schimmelradar folder contains four subfolders for each weather variable that was considered during modelling: temperature, wind speed, precipitation and humidity. Each of these subfolders contains 44 .nc files that each cover the daily mean of the respective weather variable across the Netherlands for each of the 44 days of the sampling window the citizen scientists were given.

    All spatial objects weather + land use are eventually merged into one predictor raster "Predictor_raster_NL.tif". The land use fractions and weather data are subsequently integrated with the air sample data into a single spatial data frame along with the resistance data and saved into an R object "Schimmelradar360spat_focal.RData". The script concludes by merging the cyp51a haplotype data with this object as well, to create two different objects: "Schimmelradar360spat_focal_TR_VORI.RData" for the haplotype data of the voriconazole resistant isolates and "Schimmelradar360spat_focal_TR_ITRA.RData" including the haplotype data of itraconazole resistant isolates. These two datasets are modeled separately in scripts 5,9 and 6,8, respectively. This final section of the script also generates summary table S2, which summarises the frequency of the cyp51a haplotypes per azole treatment.

    If the relevant objects are loaded

  9. Z

    VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuseppe Amato; Paolo Bolettieri; Fabio Carrara; Fabrizio Falchi; Claudio Gennaro; Nicola Messina; Lucia Vadicamo; Claudio Vairo (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from V3C1+V3C2 Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8188569
    Explore at:
    Dataset updated
    Feb 12, 2024
    Dataset provided by
    CNR-ISTI
    Authors
    Giuseppe Amato; Paolo Bolettieri; Fabio Carrara; Fabrizio Falchi; Claudio Gennaro; Nicola Messina; Lucia Vadicamo; Claudio Vairo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a diverse set of features extracted from the V3C1+V3C2 dataset, sourced from the Vimeo Creative Commons Collection. These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

    The original V3C1+V3C2 dataset, provided by NIST, can be downloaded using the instructions provided at https://videobrowsershowdown.org/about-vbs/existing-data-and-tools/.

    It comprises 7,235 video files, amounting for 2,300h of video content and encompassing 2,508,113 predefined video segments.

    We subdivided the predefined video segments longer than 10 seconds into multiple segments, with each segment spanning no longer than 16 seconds. As a result, we obtained a total of 2,648,219 segments. For each segment, we extracted one frame, specifically the middle one, and computed several features, which are described in detail below.

    This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

    @inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

    This repository comprises the following files:

    msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

    extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original V3C videos available.

    features-aladin.tar.gz† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

    features-clip-laion.tar.gz† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

    features-clip-openai.tar.gz† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

    features-clip2video.tar.gz† contains CLIP2Video [Fang H. et al. 2021] extracted for all the video segments. In particular 1) we concatenate consecutive short segments so to create segments at least 3 seconds long; 2) we downsample the obtained segments to 2.5 fps; 3) we feed the network with the first min(36, n) frames, where n is the number of frames of the segment. Notice that the minimum processed length consists of 7 frames, given that the segment is no shorter than 3s.

    objects-frcnn-oiv4.tar.gz* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

    objects-mrcnn-lvis.tar.gz* contains the objects detected using Mask R-CNN He et al. 2017.

    objects-vfnet64-coco.tar.gz* contains the objects detected using VfNet Zhang et al. 2021.

    *Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

    *Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

    "object_class_names": vector with the class name of each detected object.

    "object_scores": scores corresponding to each detected object.

    "object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

    †Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the V3C1+V3C2 dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used. Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

    We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

    References:

    [Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

    [Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

    [Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

    [He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

    [Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

    [Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

    [Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

    [Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

    [Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

    [Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart

Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Search
Clear search
Close search
Google apps
Main menu