17 datasets found

Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
f
Table of rcprd functions.
plos.figshare.com
xls
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin (2025). Table of rcprd functions. [Dataset]. http://doi.org/10.1371/journal.pone.0327229.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327229.t001
Dataset updated
Aug 19, 2025
Dataset provided by
PLOS ONE
Authors
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Clinical Practice Research Datalink (CPRD) is a large and widely used resource of electronic health records from the UK, linking primary care data to hospital data, death registration data, cancer registry data, deprivation data and mental health services data. Extraction and management of CPRD data is a computationally demanding process and requires a significant amount of work, in particular when using R. The rcprd package simplifies the process of extracting and processing CPRD data in order to build datasets ready for statistical analysis. Raw CPRD data is provided in thousands of.txt files, making querying this data cumbersome and inefficient. rcprd saves the relevant information into an SQLite database stored on the hard drive which can then be queried efficiently to extract required information about individuals. rcprd follows a four-stage process: 1) Definition of a cohort, 2) Read in medical/prescription data and save into an SQLite database, 3) Query this SQLite database for specific codes and tests to create variables for each individual in the cohort, 4) Combine extracted variables into a dataset ready for statistical analysis. Functions are available to extract common variable types (e.g., history of a condition, or time until an event occurs, relative to an index date), and more general functions for database queries, allowing users to define their own variables for extraction. The entire process can be done from within R, with no knowledge of SQL required. This manuscript showcases the functionality of rcprd by running through an example using simulated CPRD Aurum data. rcprd will reduce the duplication of time and effort among those using CPRD data for research, allowing more time to be focused on other aspects of research projects.

Seshat-NLP Dataset Pre-Release

zenodo.org
data.niaid.nih.gov

bin

Updated Mar 18, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Jakob Hauser; R. Maria del Rio Chanona; Jakob Hauser; R. Maria del Rio Chanona (2024). Seshat-NLP Dataset Pre-Release [Dataset]. http://doi.org/10.5281/zenodo.10829962

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10829962

Dataset updated

Mar 18, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Jakob Hauser; R. Maria del Rio Chanona; Jakob Hauser; R. Maria del Rio Chanona

Description

This is a pre-release of Seshat-NLP, a dataset of labelled text segments derived from the Seshat Databank. These text segments were originally used in the Seshat Databank to justify the coding of historical "facts". A data point in the Seshat Databank would describe a property of a past society at a certain time (-range). We use these data points with their textual justifications to extract a NLP dataset of text segments accompanied by topic labels.

General Overview

The Dataset is organised around unique text segments (i.e.: each row one unique segment), these segments are connected with labels that designate the historical information that is contained within the text. Each segment has at least one 4-tuple of labels associated with it but can have more. The labels are ("variable_name", "variable_id", "value", and "polity_id").

Below is a simplified example row in our dataset (exemplary data!):

Description	Labels ("variable", "var_id", "value", "polity")	Reference
Thebes was the capital …	[("Capital", "…","Thebes", "Egypt Middle Kingdom"),…]	{"Title" : "The Oxford Encyclopedia of …", "Author" : "…", "DOI" : "…", …}

Note on Source Literature Text Segments

Our dataset partially consists of segments taken from scientific literature on history, we also pair these segments with labels that denote their content. We are currently looking into the legal considerations of releasing such data. In the meanwhile, we have added information to our dataset that allows the identification of the pertaining documents for each description.

In Depth Explanation of the Dataset

List of files in the release:

Seshat_NLP.sql

This file is a PostgreSQL dump that can be used to instantiate the PostgreSQL table with all the data.
The table zenodoexport has the following columns:

Column Name	Column Description
id	row identifier
description	textual justification of coded value
labels	labels for description
reference_information	information required to retrieve documents
description_hash	utility column
zodero_id	utility column

Hierarchy_graph.gexf

The hierarchy_graph.gexf file is a xml based export of the hierarchy graph that can be used to tie variables to their hierarchical position in the Seshat codebook.

Explanation of Labels Column

The labels column contains a list of 4-tuples which in order denote "variable_name", "variable_id", "value", and "polity_id".
We use this structure to allow for a single segment/description to have multiple 4-tuples of labels, this is useful when the same of description has been used to justify multiple "facts" in the original Seshat Databank.
The variable_ids can be used to tie variable labels to nodes in the hierarchy of the Seshat codebook.

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
h
DirtyWritingPrompts
huggingface.co
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nothing is Real (2024). DirtyWritingPrompts [Dataset]. https://huggingface.co/datasets/nothingiisreal/DirtyWritingPrompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2024
Dataset authored and provided by
Nothing is Real
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
11K rows of r/DirtyWritingPrompts including score (which is upvotes - downvotes) You probably want to remove the last 1K row because it is sorted from most upvoted to least and few hundred of the last samples have negative upvotes lmao. We also removed any story under 400 characters.
Population and GDP/GNI/CO2 emissions (2019, raw data)
figshare.com
txt
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Zhao (2023). Population and GDP/GNI/CO2 emissions (2019, raw data) [Dataset]. http://doi.org/10.6084/m9.figshare.22085060.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22085060.v6
Dataset updated
Feb 23, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Liang Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.

Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel

Preprocessing

With libreoffice,

remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Data curation materials in "Daily life in the Open Biologist's second job,...
zenodo.org
bin, tiff, txt
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Livia C T Scorza; Livia C T Scorza; Tomasz Zieliński; Tomasz Zieliński; Andrew J Millar; Andrew J Millar (2024). Data curation materials in "Daily life in the Open Biologist's second job, as a Data Curator" [Dataset]. http://doi.org/10.5281/zenodo.13321937
Explore at:
tiff, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13321937
Dataset updated
Sep 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Livia C T Scorza; Livia C T Scorza; Tomasz Zieliński; Tomasz Zieliński; Andrew J Millar; Andrew J Millar
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the supplementary material accompanying the manuscript "Daily life in the Open Biologist’s second job, as a Data Curator", published in Wellcome Open Research.

It contains:

- Python_scripts.zip: Python scripts used for data cleaning and organization:

-add_headers.py: adds specified headers automatically to a list of csv files, creating new output files containing a "_with_headers" suffix.

-count_NaN_values.py: counts the total number of rows containing null values in a csv file and prints the location of null values in the (row, column) format.

-remove_rowsNaN_file.py: removes rows containing null values in a single csv file and saves the modified file with a "_dropNaN" suffix.

-remove_rowsNaN_list.py: removes rows containing null values in list of csv files and saves the modified files with a "_dropNaN" suffix.

- README_template.txt: a template for a README file to be used to describe and accompany a dataset.

- template_for_source_data_information.xlsx: a spreadsheet to help manuscript authors to keep track of data used for each figure (e.g., information about data location and links to dataset description).

- Supplementary_Figure_1.tif: Example of a dataset shared by us on Zenodo. The elements that make the dataset FAIR are indicated by the respective letters. Findability (F) is achieved by the dataset unique and persistent identifier (DOI), as well as by the related identifiers for the publication and dataset on GitHub. Additionally, the dataset is described with rich metadata, (e.g., keywords). Accessibility (A) is achieved by the ease of visualization and downloading using a standardised communications protocol (https). Also, the metadata are publicly accessible and licensed under the public domain. Interoperability (I) is achieved by the open formats used (CSV; R), and metadata are harvestable using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a low-barrier mechanism for repository interoperability. Reusability (R) is achieved by the complete description of the data with metadata in README files and links to the related publication (which contains more detailed information, as well as links to protocols on protocols.io). The dataset has a clear and accessible data usage license (CC-BY 4.0).
Elapsed computation time, peak RAM usage, and total RAM usage for each...
plos.figshare.com
xls
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin (2025). Elapsed computation time, peak RAM usage, and total RAM usage for each approach. [Dataset]. http://doi.org/10.1371/journal.pone.0327229.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327229.t002
Dataset updated
Aug 19, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Elapsed computation time, peak RAM usage, and total RAM usage for each approach.
Z
Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"
datasetcatalog.nlm.nih.gov
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahbek, Carsten; Owens, Hannah (2024). Occurrence Record Dataset from "Depth Matters for Marine Biodiversity" [Dataset]. http://doi.org/10.5281/zenodo.13318673
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13318673
Dataset updated
Aug 14, 2024
Authors
Rahbek, Carsten; Owens, Hannah
Description
This is the final occurrence record dataset produced for the manuscript "Depth Matters for Marine Biodiversity". Detailed methods for the creation of the dataset, below, have been excerpted from Appendix I: Extended Methods. Detailed citations for the occurrence datasets from which these data were derived can also be foud in Appedix I of the manuscript. We assembled a list of all recognized species of fishes from the orders Scombiformes (sensu Betancur-R et al., 2017), Gadiformes, and Beloniformes by accessing FishBase (Boettiger et al., 2012; Froese & Pauly, 2017) and the Ocean Biodiversity Information System (OBIS; OBIS, 2022; Provoost & Bosch, 2019) through queries in R (R Core Team, 2021). Species were considered Atlantic if their FishBase distribution or occurrence records on OBIS included any area within the Atlantic or Mediterranean major fishing regions as defined by the Food and Agriculture Organization of the United Nations (FAO Regions 21, 27, 31, 34, 37, 41, 47, and 48; FAO, 2020) The database query script can be found on the project code repository (https://github.com/hannahlowens/3DFishRichness/blob/main/1_OccurrenceSearch.R). We then curated the list of names to resolve discrepancies in taxonomy and known distributions through comparison with the Eschmeyer Catalog of Fishes (Eschmeyer & Fricke, 2015), accessed in September of 2020, as our ultimate taxonomic authority. The resulting list of species was then mapped onto the Global Biodiversity Information Facility’s backbone taxonomy (Chamberlain et al., 2021; GBIF, 2020a) to ensure taxonomic concurrence across databases (Appendix I Table 1). The final taxonomic list was used to download occurrence records from OBIS (OBIS, 2022) and GBIF (GBIF, 2020b) in R through robis and occCite (Chamberlain et al., 2020; Provoost & Bosch, 2019; Owens et al., 2021). Once the resulting data were mapped and curated to remove records with putatively spurious coordinates, under-sampled regions and species were augmented with data from publicly available digital museum collection databases not served through OBIS or GBIF, as well as a literature search. For each species, duplicate points were removed from two- and three-dimensional species occurrence datasets separately, and inaccurate depth records were removed from 3D datasets. Inaccuracy was determined based on extreme statistical outliers (values greater than 2 or less than -2 when occurrence depths were centered and scaled), depth ranges that exceeded bathymetry at occurrence coordinates, and occurrence far outside known depth ranges compared to information from FishBase, Eschmeyer’s Catalog of Fishes, and congeneric depth ranges in the dataset. Finally, for datasets with more than 20 points remaining after cleaning, occurrence data were downsampled to the resolution of the environmental data; that is, to 1 point per 1 degree grid cell in the 2D dataset, and to one point per depth slice per 1 degree grid cell in the 3D dataset. Counts of raw and cleaned records for each species can be found in Appendix 1 Table 1. References: Betancur-R, R., Wiley, E. O., Arratia, G., Acero, A., Bailly, N., Miya, M., Lecointre, G., & Ortí, G. (2017). Phylogenetic classification of bony fishes. BMC Evolutionary Biology, 17(1), 162. https://doi.org/10.1186/s12862-017-0958-3 Boettiger, C., Lang, D. T., & Wainwright, P. C. (2012). rfishbase: exploring, manipulating and visualizing FishBase data from R. Journal of Fish Biology, 81(6), 2030–2039. https://doi.org/10.1111/j.1095-8649.2012.03464.x Chamberlain, S., Barve, V., McGlinn, D., Oldoni, D., Desmet, P., Geffert, L., & Ram, K. (2021). rgbif: Interface to the Global Biodiversity Information Facility API. https://CRAN.R-project.org/package=rgbif Eschmeyer, & Fricke, W. N. &. (2015). Taxonomic checklist of fish species listed in the CITES Appendices and EC Regulation 338/97 (Elasmobranchii, Actinopteri, Coelacanthi, and Dipneusti, except the genus Hippocampus). Catalog of Fishes, Electronic Version. Accessed September, 2020. https://www.calacademy.org/scientists/projects/eschmeyers-catalog-of-fishes FAO. (2020). FAO Major Fishing Areas. United Nations Fisheries and Aquaculture Division. https://www.fao.org/fishery/en/collection/area Froese, R., & Pauly, D. (2017). FishBase. Accessed September, 2022. www.fishbase.org GBIF.org. (2020a). GBIF Backbone Taxonomy. Accessed September, 2020. GBIF.org GBIF.org. (2020b). GBIF Occurrence Download. Accessed November, 2020. https://doi.org/10.15468 OBIS. (2020). Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. Accessed November, 2020. www.obis.org Owens, H. L., Merow, C., Maitner, B. S., Kass, J. M., Barve, V., & Guralnick, R. P. (2021). occCite: Tools for querying and managing large biodiversity occurrence datasets. Ecography, 44(8), 1228–1235. https://doi.org/10.1111/ecog.05618 Provoost, P., & Bosch, S. (2019). robis: R Client to access data from the OBIS API. https://cran.r-project.org/package=robis R Core Team. (2021). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
C
Data from: Our Block
data.cityofchicago.org
Updated Nov 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chicago Police Department (2025). Our Block [Dataset]. https://data.cityofchicago.org/Public-Safety/Our-Block/285v-myf3
Explore at:
xml, csv, kmz, kml, application/geo+json, xlsxAvailable download formats
Dataset updated
Nov 29, 2025
Authors
Chicago Police Department
Description
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
Data from: Humans exploit robust locomotion by improving the stability of...
zenodo.org
explore.openaire.eu
bin
Updated Jun 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Santuz; Alessandro Santuz; Leon Brüll; Antonis Ekizos; Antonis Ekizos; Arno Schroll; Nils Eckardt; Nils Eckardt; Armin Kibele; Armin Kibele; Michael Schwenk; Michael Schwenk; Adamantios Arampatzis; Adamantios Arampatzis; Leon Brüll; Arno Schroll (2022). Humans exploit robust locomotion by improving the stability of control signals [Dataset]. http://doi.org/10.5281/zenodo.2687682
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2687682
Dataset updated
Jun 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessandro Santuz; Alessandro Santuz; Leon Brüll; Antonis Ekizos; Antonis Ekizos; Arno Schroll; Nils Eckardt; Nils Eckardt; Armin Kibele; Armin Kibele; Michael Schwenk; Michael Schwenk; Adamantios Arampatzis; Adamantios Arampatzis; Leon Brüll; Arno Schroll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background

Is the control of movement less stable when we walk or run in challenging settings? Intuitively, one might answer that it is, given that adding constraints to locomotion (e.g. rough terrain, age-related impairments, etc.) makes movements less stable. Here, we investigated how young and old humans synergistically activate muscles during locomotion when different perturbation levels are introduced. Of these control signals, called muscle synergies, we analyzed the stability over time. Surprisingly, we found that perturbations and older age force the central nervous system to produce muscle activation patterns that are more stable. These outcomes show that robust locomotion in challenging settings is achieved by increasing the stability of control signals, whereas easier tasks allow for more unstable control.

How to use the data set

This supplementary data set contains: a) the metadata with anonymized participant information, b) the raw electromyographic (EMG) data acquired during locomotion, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via non-negative matrix factorization and f) the code written in R (R Found. for Stat. Comp.) to process the data, including the scripts to calculate the Maximum Lyapunov Exponents of motor primitives. In total, 476 trials from 86 participants are included in the supplementary data set.

The file “participant_data.dat” is available in ASCII and RData (R Found. for Stat. Comp.) format and contains:

Code: the participant’s code

Experiment: the experimental setup in which the participant was involved (E1 = walking and running, overground and treadmill; E2 = walking and running, even- and uneven-surface; E3 = unperturbed and perturbed walking, young and old)

Group: the group to which the participant was assigned (see above for the details)

Sex: the participant’s sex (M or F)

Speed: the speed at which the recordings were conducted in [m/s] (two values separated by a comma mean that recordings were done at two different speeds, i.e. walking and running)

Age: the participant’s age in years (participants were considered old if older than 65 years, but younger than 80)

Height: the participant’s height in [cm]

Mass: the participant’s body mass in [kg].

The files containing the gait cycle breakdown are available in RData (R Found. for Stat. Comp.) format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with 30 rows (one for each gait cycle) and two columns. The first column contains the touchdown incremental times in seconds. The second column contains the duration of each stance phase in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P0020,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times and the characters “P0020” indicate the participant number (in this example the 20th). Please note that the overground trials of participants P0001 and P0009 and the second uneven-surface running trial of participant P0048 only contain 22, 27 and 23 cycles, respectively.

The files containing the raw, filtered and the normalized EMG data are available in RData (R Found. for Stat. Comp.) format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with 30000 rows (one for each recorded data point) and 14 columns. The first column contains the incremental time in seconds. The remaining thirteen columns contain the raw EMG data, named with muscle abbreviations that follow those reported in the Materials and Methods section of this Supplementary Materials file. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P0053_OG_02”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P0053” indicate the participant number (in this example the 53rd), the characters “OW” indicate the locomotion type (E1: OW=overground walking, OR=overground running, TW=treadmill walking, TR=treadmill running; E2: EW=even-surface walking, ER=even-surface running, UW=uneven-surface walking, UR=uneven-surface running; E3: NW=normal walking, PW=perturbed walking), and the numbers “02” indicate the trial number (in this case the 2nd). The 10 trials per participant recorded for each overground session (i.e. 10 for walking and 10 for running) were concatenated into one. The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P0053_OG_02”.

The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData (R Found. for Stat. Comp.) format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorization and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “Time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed above to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P0012_PW_02”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P0012” indicate the participant number (in this example the 12th), ), the characters “PW” indicate the locomotion type (see above), and the numbers “02” indicate the trial number (in this case the 2nd). Motor modules are data frames with 13 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported in the Materials and Methods section of this Supplementary Materials file, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P0082_PW_02”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P0082” indicate the participant number (in this example the 82nd) ), the characters “PW” indicate the locomotion type (see above), and the numbers “02” indicate the trial number (in this case the 2nd). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.

The files containing the MLE calculated from motor primitives are available in RData (R Found. for Stat. Comp.) format, in the file named “MLE.RData”. MLE results are presented in a list of lists containing, for each trial, 1) the divergences, 2) the MLE, and 3) the value of the R² between the divergence curve and its linear interpolation made using the specified amount of points. The divergences are presented as a one-dimensional vector. MLE are one number like the R² value. Trials are named like “MLE_P0081_EW_01”, where the characters “MLE” indicate that the trial contains MLE data, the characters “P0081” indicate the participant number (in this example the 81st) ), the characters “EW” indicate the locomotion type (see above), and the numbers “01” indicate the trial number (in this case the 1st).

All the code used for the preprocessing of EMG data, the extraction of muscle synergies and the calculation of MLE is available in R (R Found. for Stat. Comp.) format. Explanatory comments are profusely present throughout the scripts (“SYNS.R”, which is the script to extract synergies, “fun_NMF.R”, which contains the NMF function, “MLE.R”, which is the script to calculate the MLE of motor primitives and “fun_MLE.R”, which contains the MLE function).
movies
kaggle.com
zip
Updated Mar 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vinay malik (2023). movies [Dataset]. https://www.kaggle.com/datasets/vinaymalik06/movies/discussion?sort=undefined
Explore at:
zip(1459362 bytes)Available download formats
Dataset updated
Mar 9, 2023
Authors
vinay malik
Description
The Kaggle Movies dataset is available in CSV format and consists of one file: "movies.csv".

The file contains data on over 10,000 movies and includes fields such as title, release date, director, cast, genre, language, budget, revenue, and rating. The file is approximately 3 MB in size and can be easily imported into popular data analysis tools such as Excel, Python, R, and Tableau.

The data is organized into rows and columns, with each row representing a single movie and each column representing a specific attribute of the movie. The file contains a header row that provides a description of each column.

The file has been cleaned and processed to remove any duplicates or inconsistencies. However, the data is provided as-is, without any warranties or guarantees of accuracy or completeness.

The "movies.csv" file in the Kaggle Movies dataset includes the following columns:

id: The unique identifier for each movie. title: The title of the movie. overview: A brief summary of the movie. release_date: The date when the movie was released (in YYYY-MM-DD format). Popularity: A numerical score indicating the relative popularity of each movie, based on factors such as user ratings, social media mentions, and box office performance. Vote Average: The average rating given to the movie by users of the IMDb website (on a scale of 0-10). Vote Count: The number of ratings given to the movie by users of the IMDb website.
Social Contacts
kaggle.com
zip
Updated Apr 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick (2020). Social Contacts [Dataset]. https://www.kaggle.com/bitsnpieces/social-contacts
Explore at:
zip(33056 bytes)Available download formats
Dataset updated
Apr 29, 2020
Authors
Patrick
Description
Inspiration

Which countries have the most social contacts in the world? In particular, do countries with more social contacts among the elderly report more deaths caused by a pandemic caused by a respiratory virus?

Context

With the emergence of the COVID-19 pandemic, reports have shown that the elderly are at a higher risk of dying than any other age groups. 8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older. Countries have also began to enforce 2km social distancing to contain the pandemic.

To this end, I wanted to explore the relationship between social contacts among the elderly and its relationship with the number of COVID-19 deaths across countries.

Content

This dataset includes a subset of the projected social contact matrices in 152 countries from surveys Prem et al. 2020. It was based on the POLYMOD study where information on social contacts was obtained using cross-sectional surveys in Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Italy (IT), Luxembourg (LU), The Netherlands (NL), and Poland (PL) between May 2005 and September 2006.

This dataset includes contact rates from study participants ages 65+ for all countries from all sources of contact (work, home, school and others).

I used this R code to extract this data:

load('../input/contacts.Rdata') # https://github.com/kieshaprem/covid19-agestructureSEIR-wuhan-social-distancing/blob/master/data/contacts.Rdata View(contacts) contacts[["ALB"]][["home"]] contacts[["ITA"]][["all"]] rowSums(contacts[["ALB"]][["all"]]) out1 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[16,]; out <- rbind(out, data.frame(x)) } out2 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[15,]; out <- rbind(out, data.frame(x)) } out3 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[14,]; out <- rbind(out, data.frame(x)) } m1 = data.frame(t(matrix(unlist(out1), nrow=16))) m2 = data.frame(t(matrix(unlist(out2), nrow=16))) m3 = data.frame(t(matrix(unlist(out3), nrow=16))) rownames(m1) = names(contacts) colnames(m1) = c("00_04", "05_09", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", "75_79") rownames(m2) = rownames(m1) rownames(m3) = rownames(m1) colnames(m2) = colnames(m1) colnames(m3) = colnames(m1) write.csv(zapsmall(m1),"contacts_75_79.csv", row.names = TRUE) write.csv(zapsmall(m2),"contacts_70_74.csv", row.names = TRUE) write.csv(zapsmall(m3),"contacts_65_69.csv", row.names = TRUE)

Rows names correspond to the 3 letter country ISO code, e.g. ITA represents Italy. Column names are the age groups of the individuals contacted in 5 year intervals from 0 to 80 years old. Cell values are the projected mean social contact rate.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1139998%2Ffa3ddc065ea46009e345f24ab0d905d2%2Fcontact_distribution.png?generation=1588258740223812&alt=media" alt="">

Acknowledgements

Thanks goes to Dr. Kiesha Prem for her correspondence and her team for publishing their work on social contact matrices.

References

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study

Projecting social contact matrices in 152 countries using contact surveys and demographic data

Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases (POLYMOD study)

Related resources

My starter notebook

http://www.socialcontactdata.org/

https://www.kaggle.com/tsubasatwi/close-contact-status-of-corona-in-japan

Facebook Data for Good Mobility Dashboard
UAV Canyelles Vineyard Dataset 2024-05-10
zenodo.org
tar
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Vera; Aldo Sollazzo; Chirag Rangholia; Esther Vera; Aldo Sollazzo; Chirag Rangholia (2025). UAV Canyelles Vineyard Dataset 2024-05-10 [Dataset]. http://doi.org/10.5281/zenodo.15111679
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15111679
Dataset updated
Apr 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Vera; Aldo Sollazzo; Chirag Rangholia; Esther Vera; Aldo Sollazzo; Chirag Rangholia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Canyelles
Description
The dataset includes the following components:

RGB: contains 423 RGB images captured using a DJI Mavic 3M UAV. The images are stored in JPG format, with metadata that includes location information, camera settings and capture dates.

SHAPE: contains the shape file used in Metashape Agisoft to crop the pointcloud, orthomosaics and DEM, ensuring a clear visualization of the vineyard rows.

POINTCLOUDS: includes both the raw and cropped pointclouds in RGB, NIR, G, RE, R and NDVI. All of them were processed using Metashape Agisoft and stored in XYZ format (.txt).

ORTHOMOSAICS: includes the original and cropped orthomosaics in RGB, NIR, G, RE, R and NDVI. All of them were generated using Metashape Agisoft and are in TIF format.

DEM: contains the original and cropped DEM images, also processed using Metashape Agisoft, and stored in TIF format.

This data is aligned with the rest of UAV Canyelles Vineyard Datasets uploaded by Noumena (UC1), so different orthomosaics / pointclouds / DEM of different dates can be analyzed jointly for cropped and uncropped files.

Data collection took place on May 10th, 2024, in Canyelles, Catalonia, Spain. The UAV was automatically flown at an altitude of 12 meters, ensuring sufficient frontal and side overlap between images.
VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...
zenodo.org
data.niaid.nih.gov
+1more
application/gzip
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from V3C1+V3C2 Dataset [Dataset]. http://doi.org/10.5281/zenodo.8188570
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8188570
Dataset updated
Feb 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a diverse set of features extracted from the V3C1+V3C2 dataset, sourced from the Vimeo Creative Commons Collection. These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

The original V3C1+V3C2 dataset, provided by NIST, can be downloaded using the instructions provided at https://videobrowsershowdown.org/about-vbs/existing-data-and-tools/.

It comprises 7,235 video files, amounting for 2,300h of video content and encompassing 2,508,113 predefined video segments.

We subdivided the predefined video segments longer than 10 seconds into multiple segments, with each segment spanning no longer than 16 seconds. As a result, we obtained a total of 2,648,219 segments. For each segment, we extracted one frame, specifically the middle one, and computed several features, which are described in detail below.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

This repository comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original V3C videos available.

features-aladin.tar.gz^† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

features-clip-laion.tar.gz^† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

features-clip-openai.tar.gz^† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

features-clip2video.tar.gz^† contains CLIP2Video [Fang H. et al. 2021] extracted for all the video segments. In particular 1) we concatenate consecutive short segments so to create segments at least 3 seconds long; 2) we downsample the obtained segments to 2.5 fps; 3) we feed the network with the first min(36, n) frames, where n is the number of frames of the segment. Notice that the minimum processed length consists of 7 frames, given that the segment is no shorter than 3s.

objects-frcnn-oiv4.tar.gz^* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

objects-mrcnn-lvis.tar.gz^* contains the objects detected using Mask R-CNN [He et al. 2017] (trained on LVIS).

objects-vfnet64-coco.tar.gz^* contains the objects detected using VfNet [Zhang et al. 2021] (trained on COCO dataset).

*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.

"object_scores": scores corresponding to each detected object.

"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

^†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the V3C1+V3C2 dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used. Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672

Film Circulation dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

csv, png, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7887672

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

Clear search

Close search

Google apps

Main menu

Film Circulation dataset

KC_House Dataset -Linear Regression of Home Prices

Table of rcprd functions.

Seshat-NLP Dataset Pre-Release

General Overview

Note on Source Literature Text Segments

In Depth Explanation of the Dataset

List of files in the release:

Seshat_NLP.sql

Hierarchy_graph.gexf

Explanation of Labels Column

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

DirtyWritingPrompts

Population and GDP/GNI/CO2 emissions (2019, raw data)

Data curation materials in "Daily life in the Open Biologist's second job,...

Elapsed computation time, peak RAM usage, and total RAM usage for each...

Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"

Data from: Our Block

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Data from: Humans exploit robust locomotion by improving the stability of...

movies

Social Contacts

Inspiration

Context

Content

Acknowledgements

References

Related resources

UAV Canyelles Vineyard Dataset 2024-05-10

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

Film Circulation dataset