17 datasets found
  1. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  2. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  3. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data including the annual PM2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  4. Protocol data (R version)

    • figshare.com
    application/gzip
    Updated Oct 16, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesse Gillis (2020). Protocol data (R version) [Dataset]. http://doi.org/10.6084/m9.figshare.13020569.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 16, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jesse Gillis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We published 3 protocols illustrating how MetaNeighbor can be used to quantify cell type replicability across single cell transcriptomic datasets.The data files included here are needed to run the R version of the protocols available on Github (https://github.com/gillislab/MetaNeighbor-Protocol) in RMarkdown (.Rmd) and Jupyter (.ipynb) notebook format. To run the protocols, download the protocols on Github, download the data on Figshare, place the data and protocol files in the same directory, then run the notebooks in Rstudio or Jupyter.The scripts used to generate the data are included in the Github directory. Briefly: - full_biccn_hvg.rds contains a single cell transcriptomic dataset published by the Brain Initiative Cell Census Network (in SingleCellExperiment format). It combines data from 7 datasets obtained in the mouse primary motor cortex (https://www.biorxiv.org/content/10.1101/2020.02.29.970558v2). Note that this dataset only contains highly variable genes. - biccn_hvgs.txt: highly variable genes from the BICCN dataset described above (computed with the MetaNeighbor library). - biccn_gaba.rds: same dataset as full_biccn_hvg.rds, but restricted to GABAergic neurons. The dataset contains all genes common to the 7 BICCN datasets (not just highly variable genes). - go_mouse.rds: gene ontology annotations, stored as a list of gene symbols (one element per gene set).- functional_aurocs.txt: results of the MetaNeighbor functional analysis in protocol 3.

  5. RD Dataset

    • figshare.com
    zip
    Updated Sep 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seung Seog Han (2022). RD Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.15170853.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 16, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Seung Seog Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ** RD DATASET ** RD dataset was created by the images from the melanoma community on the internet (https://reddit.com/r/melanoma). Consecutive images were included using a python library (https://github.com/aliparlakci/bulk-downloader-for-reddit) from Jan 25, 2020, to July 30, 2021. The ground truth was voted by four dermatologists and one plastic surgeon while referring to the chief complaint and brief history. A total of 1,282 images (1,201 cases) were finally included. Because of the deleted cases by users, the links of 860 cases are valid in July 2021.

    1. RD_RAW.xlsx The download links and ground truth of the RD dataset are included in this excel file. In addition, the raw data of the AI (Model Dermatology Build2021 - https://modelderm.com) and 32 laypersons were included.

    2. v1_public.zip "v1_public.zip" includes the 1,282 lesional images (full-size). The 24 images that were excluded from the study are also available.

    3. v1_private.zip is not available here. Wide field images are not available here. If the archive is needed for research purpose, please email to Dr. Han Seung Seog (whria78@gmail.com) or Dr Cristian Navarrete-Dechent (ctnavarr@gmail.com).

    References - The Degradation of Performance of a State-of-the-art Skin Image Classifier When Applied to Patient-driven Internet Search - Scientific Report (in-press)

    ** Background normal test with the ISIC images ** ISIC dataset (https://www.isic-archive.com; Gallery -> 2018 JID Editorial images; 99 images; ISIC_0024262 and ISIC_0024261 are identical images and ISIC_0024262 was skipped) was used for the background normal test. We defined 10% area rectangle crop to “specialist-size crop”, and 5% area rectangle crop to “layperson-size crop” a) S-crops.zip: specialist-size crops Format: CROPNO_AGE(0~99)_GENDER(1=male,0=female)[m]_FILENAME.png b) L-crops.zip: layperson-size crops Format: CROPNO_AGE(0~99)_GENDER(1=male,0=female)[m]_FILENAME.png c) result_S.zip: Background normal test result using the specialist-size crops d) result_L.zip; Background normal test result using the layperson-size crops

    Reference - Automated Dermatological Diagnosis: Hype or Reality? - https://doi.org/10.1016/j.jid.2018.04.040 - Multiclass Artificial Intelligence in Dermatology: Progress but Still Room for Improvement - https://doi.org/10.1016/j.jid.2020.06.040

  6. [Superseded] Intellectual Property Government Open Data 2019

    • researchdata.edu.au
    • data.gov.au
    Updated Jun 6, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
    Explore at:
    Dataset updated
    Jun 6, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    IP Australia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is IPGOD?\r

    The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

    How do I use IPGOD?\r

    IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

    IP Data Platform\r

    IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

    References\r

    \r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

    Updates\r

    \r

    Tables and columns\r

    \r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

    Data quality improvements\r

    \r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  7. Data from: A Phanerozoic gridded dataset for palaeogeographic...

    • zenodo.org
    • portalcientifico.uvigo.gal
    • +1more
    zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lewis A. Jones; Lewis A. Jones; Mathew Domeier; Mathew Domeier (2024). A Phanerozoic gridded dataset for palaeogeographic reconstructions [Dataset]. http://doi.org/10.5281/zenodo.11384745
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lewis A. Jones; Lewis A. Jones; Mathew Domeier; Mathew Domeier
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Time period covered
    May 29, 2024
    Description

    This repository provides access to five pre-computed reconstruction files as well as the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km, respectively. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file).

    The following five Global Plate Models are provided (abbreviation, temporal coverage, reference) within the GPMs folder:

    • WR13, 0–550 Ma, (Wright et al., 2013)
    • MA16, 0–410 Ma, (Matthews et al., 2016)
    • TC16, 0–540 Ma, (Torsvik and Cocks, 2016)
    • SC16, 0–1100 Ma, (Scotese, 2016)
    • ME21, 0–1000 Ma, (Merdith et al., 2021)

    In addition, the H3 grids for resolutions 2, 3, and 4 are provided within the grids folder. Finally, we also provide two scripts (python and R) within the code folder which can be used to generate reconstructed coordinates for user data from the reconstruction files.

    For access to the code used to generate these files:

    https://github.com/LewisAJones/PhanGrids

    For more information, please refer to the article describing the data:

    Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. (2024).

    For any additional queries, contact:

    Lewis A. Jones (lewisa.jones@outlook.com) or Mathew M. Domeier (mathewd@uio.no)

    If you use these files, please cite:

    Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. DOI: 10.5281/zenodo.10069221

    References

    1. Matthews, K. J., Maloney, K. T., Zahirovic, S., Williams, S. E., Seton, M., & Müller, R. D. (2016). Global plate boundary evolution and kinematics since the late Paleozoic. Global and Planetary Change, 146, 226–250. https://doi.org/10.1016/j.gloplacha.2016.10.002.
    2. Merdith, A. S., Williams, S. E., Collins, A. S., Tetley, M. G., Mulder, J. A., Blades, M. L., Young, A., Armistead, S. E., Cannon, J., Zahirovic, S., & Müller, R. D. (2021). Extending full-plate tectonic models into deep time: Linking the Neoproterozoic and the Phanerozoic. Earth-Science Reviews, 214, 103477. https://doi.org/10.1016/j.earscirev.2020.103477.
    3. Scotese, C. R. (2016). Tutorial: PALEOMAP paleoAtlas for GPlates and the paleoData plotter program: PALEOMAP Project, Technical Report.
    4. Torsvik, T. H., & Cocks, L. R. M. (2017). Earth history and palaeogeography. Cambridge University Press. https://doi.org/10.1017/9781316225523.
    5. Wright, N., Zahirovic, S., Müller, R. D., & Seton, M. (2013). Towards community-driven paleogeographic reconstructions: Integrating open-access paleogeographic and paleobiology data with plate tectonics. Biogeosciences, 10, 1529–1541. https://doi.org/10.5194/bg-10-1529-2013.
  8. Dataset for "Paleo-biome dynamics shaped a large Gondwanan plant radiation"

    • zenodo.org
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Skeels; Alexander Skeels; Hervé Sauquet; Hervé Sauquet; Austin Mast; Austin Mast; Peter Weston; Peter Weston; Peter Olde; Peter Olde; Greg Jordan; Greg Jordan; Raymond Carpenter; Raymond Carpenter; Jessica Fenker; Jessica Fenker; Zoe Reynolds; Alan Lemmon; Alan Lemmon; Emily Moriarty Lemmon; Emily Moriarty Lemmon; Fritz Jose Pichardo Marcano; Fritz Jose Pichardo Marcano; Marcel Cardillo; Marcel Cardillo; Zoe Reynolds (2025). Dataset for "Paleo-biome dynamics shaped a large Gondwanan plant radiation" [Dataset]. http://doi.org/10.5281/zenodo.14743850
    Explore at:
    Dataset updated
    Jan 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Skeels; Alexander Skeels; Hervé Sauquet; Hervé Sauquet; Austin Mast; Austin Mast; Peter Weston; Peter Weston; Peter Olde; Peter Olde; Greg Jordan; Greg Jordan; Raymond Carpenter; Raymond Carpenter; Jessica Fenker; Jessica Fenker; Zoe Reynolds; Alan Lemmon; Alan Lemmon; Emily Moriarty Lemmon; Emily Moriarty Lemmon; Fritz Jose Pichardo Marcano; Fritz Jose Pichardo Marcano; Marcel Cardillo; Marcel Cardillo; Zoe Reynolds
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 25, 2025
    Description

    Datasets associated with Skeels et al., in review "Paleo-biome dynamics shaped a large Gondwanan plant radiation"

    Dataset S1 (separate file). Phylogeny of Grevilleoideae and selected outgroups estimated using a concatenated super-matrix of 458 genomic loci with IQ-Tree (.TREE format).

    Dataset S2 (separate file). Phylogeny of Grevilleoideae and selected outgroups estimated using a short-cut coalescent approach with ASTRAL-III based on 458 gene trees estimated across genomic loci with IQ-Tree (.TREE format).

    Dataset S3 (separate file). Sheet 1. Sample IDs and herbarium accession numbers. Sheet 2. Fossil calibration table including information on the fossil taxon name, locality, phylogenetic placement, age, stratigraphy, and associated references in support of each. We also estimate the best practice score (.csv format). Sheet 3. Summary statistics for the amount of missing data, invariant sites, taxa sampled, parsimony informative sites, and also alignment length, coefficient of variation in the root-to-tip distance (cvr2t) derived from cleaned DNA alignments. Table also includes information on locus filtering for completeness, clock-likeness, protein coding information and whether the locus was used in the divergence dating analysis. Sheet 4. Number of outliers (“rogue taxa”) detected at each locus using TreeShrink and PhylteR algorithms. Sheet 5. Cleaned occurrence records for all species of Grevilleoideae used in this study. Data originally from the Atlas of Living Australia (ALA) and Global Biodiversity Information Facility (GBIF) and cleaned using the R package CoordinateCleaner. Sheet 6. Cleaned occurrence records for all species of Grevilleoideae used in this study. Data originally from the Atlas of Living Australia (ALA) and Global Biodiversity Information Facility (GBIF) and cleaned using the R package CoordinateCleaner. Sheet 7. Biome occupancy table for all Grevilleoideae species derived from occurrence records and modified Koppen-Geiger biome classification (.csv format). Values refer to the proportion of unique cells (0.1 x 0.1 degree) that occurrence records are found in each biome or region. Sheet 8. Scaled pairwise environmental distances between biomes. Values scaled between 0 and 10. A = tropical, b = subtropical and temperate, c = Mediterranean, d = semi-arid, e = arid, f = South America, g = Madagascar, h = Cape of South Africa, I = Tropical Asia, j = New Caledoina, k = New Zealand. Sheet 9. Time-stratified, pairwise geographic distances between biomes. Values scaled between 0 and 10. Times from top to bottom (20-0 Ma, 40-20 Ma, 60-40 Ma, 80-60 Ma, 100-80 Ma). A = tropical, b = subtropical and temperate, c = Mediterranean, d = semi-arid, e = arid, f = South America, g = Madagascar, h = Cape of South Africa, I = Tropical Asia, j = New Caledoina, k = New Zealand. Sheet 10. Time-stratified, pairwise connectivity between biomes. Values scaled between 0 and 10. Times from top to bottom (20-0 Ma, 40-20 Ma, 60-40 Ma, 80-60 Ma, 100-80 Ma). A = tropical, b = subtropical and temperate, c = Mediterranean, d = semi-arid, e = arid, f = South America, g = Madagascar, h = Cape of South Africa, I = Tropical Asia, j = New Caledoina, k = New Zealand. Sheet 11. Branch specific estimates of diversification rate from CLaDS, geographic states from BioGeoBEARS and predictor variables for the phylogenetic generlised linear mixed model (PGLMM) including biome area, time since first biome occupation, standing biome diversity, and biome shifting. Sheet 12. The age of the first appearance of climate variables in climate space based on a principal component analysis of mean annual temperature, mean annual precipitation, temperature seasonality, and precipitation seasonality, of paleotemperature and precipitation from Valdes et al. (2019). (.xlsx format).

    Dataset S4 (separate file). Dated Phylogeny of Grevilleoideae and selected outgroups estimated using MCMCTree and the ASTRAL-III topology (.TREE format).

    Dataset S5 (separate file). Dated Phylogeny of Grevilleoideae and selected outgroups estimated using MCMCTree and the IQ-Tree topology (.TREE format).

    Dataset S6 (separate file). Modified Koppen-Geiger biomes estimated from monthly temperature and precipitation values at 20 Ma intervals from 120 Ma to 20 Ma. The interval for 20-0 is given by modified Koppen-Geiger biomes from the present-day (.ncf format). Values legend: 1 = tropical, 2 = subtropical and temperate, 3 = Mediterranean, 4 = semi-arid, 5 = arid, 6 = Polar.

    Dataset S7 (separate file). Raw assembled DNA alignments (.fasta format) from 458 Anchored Hybrid Enrichment loci for all samples of Grevilleoidaee and selected outgroups used in this study.

    Dataset S8 (separate file). Cleaned DNA alignments (.fasta format) from 458 Anchored Hybrid Enrichment loci for all samples of Grevilleoidaee and selected outgroups used in this study. Cleaning used a pipeline which removed sequences or sites with high missing data or potentially erroneous or misaligned locations using TAPER.

    Dataset S9 (separate file). Gene tree phylogenies of Grevilleoideae and selected outgroups for 458 genomic loci with IQ-Tree (.TREE format).

    Dataset S10 (separate file). Dated phylogeny of Grevilleoideae pruned to species level (.TRE format).

    Dataset S11 (separate file). Stochastic maps from best-fitting dispersal-extirpation-cladogenesis (DEC) model from BioGeoBEARS software (DEC+w+n with alternated state-space based on paleobiome reconstruction) in simmap format from the R package phytools (.rds format).

    Dataset S12 (separate file). Monthly temperature and precipitation values at 20 Ma intervals from 120 Ma to 0 Ma from Valdes et al. 2019 (.nc format).

  9. P

    S3DIS Dataset

    • paperswithcode.com
    Updated Feb 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iro Armeni; Ozan Sener; Amir R. Zamir; Helen Jiang; Ioannis Brilakis; Martin Fischer; Silvio Savarese (2021). S3DIS Dataset [Dataset]. https://paperswithcode.com/dataset/s3dis
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Iro Armeni; Ozan Sener; Amir R. Zamir; Helen Jiang; Ioannis Brilakis; Martin Fischer; Silvio Savarese
    Description

    The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the scene point cloud is annotated with one of the 13 semantic categories.

  10. P

    ABC Dataset Dataset

    • paperswithcode.com
    Updated Feb 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Koch; Albert Matveev; Zhongshi Jiang; Francis Williams; Alexey Artemov; Evgeny Burnaev; Marc Alexa; Denis Zorin; Daniele Panozzo (2022). ABC Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/abc-dataset-1
    Explore at:
    Dataset updated
    Feb 3, 2022
    Authors
    Sebastian Koch; Albert Matveev; Zhongshi Jiang; Francis Williams; Alexey Artemov; Evgeny Burnaev; Marc Alexa; Denis Zorin; Daniele Panozzo
    Description

    The ABC Dataset is a collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications. Each model is a collection of explicitly parametrized curves and surfaces, providing ground truth for differential quantities, patch segmentation, geometric feature detection, and shape reconstruction. Sampling the parametric descriptions of surfaces and curves allows generating data in different formats and resolutions, enabling fair comparisons for a wide range of geometric learning algorithms.

  11. Z

    Personal Protective Equipment Dataset (PPED)

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2022). Personal Protective Equipment Dataset (PPED) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6551757
    Explore at:
    Dataset updated
    May 17, 2022
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Personal Protective Equipment Dataset (PPED)

    This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.

    1. The dataset

    We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.

    1.1. Image collection

    We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.

    Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.

    Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.

    Light: Good lighting conditions and poor lighting conditions were studied.

    Diversity: Some images contain a single person, and some contain multiple people.

    Angle: The pictures we took can be divided into front and side.

    A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.

    1.2. Label

    We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations

    1.3. Dataset Features

    The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.

    1.4. Dataset Division

    The data set is divided into 9:1 training set and test set.

    1. Baseline Experiments

    We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.

    2.1. Environment and Configuration:

    Intel Core i7-8700 CPU

    NVIDIA GTX1060 GPU

    16 GB of RAM

    Python: 3.8.10

    pytorch: 1.9.0

    pycocotools: pycocotools-win

    Windows 10

    2.2. Applied Models

    The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.

    2.2.1. Faster R-CNN

    Faster R-CNN

    backbone: resnet50+fpn

    We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.

    We modified the dataset path, training classes and training parameters including batch size.

    We run train_res50_fpn.py start training.

    Then, the weights are trained by the training set.

    Finally, we validate the results on the test set.

    backbone: mobilenetv2

    the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.

    The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).

    2.2.2. SSD

    backbone: resnet50

    We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.

    The same training method as Faster R-CNN is applied.

    The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.

    2.2.3. YOLOv3-spp

    backbone: DarkNet53

    We modified the type information of the XML file to match our application.

    We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

    The weights used are: yolov3-spp-ultralytics-608.pt.

    The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.

    2.2.4. YOLOv5

    backbone: CSP_DarkNet

    We modified the type information of the XML file to match our application.

    We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

    The weights used are: yolov5s.

    The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.

    2.3. Evaluation

    The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.

    1. Code Sources

    Faster R-CNN (R and M)

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/faster_rcnn

    official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py

    SSD

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/ssd

    official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py

    YOLOv3-spp

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/yolov3-spp

    YOLOv5

    https://github.com/ultralytics/yolov5

  12. Data from: Earth surface evolution: a Phanerozoic gridded dataset of Global...

    • zenodo.org
    zip
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lewis A. Jones; Lewis A. Jones; Mathew Domeier; Mathew Domeier (2023). Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions [Dataset]. http://doi.org/10.5281/zenodo.10069222
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lewis A. Jones; Lewis A. Jones; Mathew Domeier; Mathew Domeier
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Area covered
    Earth
    Description

    This repository provides access to five reconstruction files as well as the code and the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file).

    The following five Global Plate Models are provided (abbreviation, temporal coverage, reference):

    • WR13, 0–550 Ma, (Wright et al., 2013)
    • MA16, 0–410 Ma, (Matthews et al., 2016)
    • TC16, 0–540 Ma, (Torsvik and Cocks, 2016)
    • SC16, 0–1100 Ma, (Scotese, 2016)
    • ME21, 0–1000 Ma, (Merdith et al., 2021)

    In addition, the H3 grids for resolutions 2, 3, and 4 are provided.

    For more information, please refer to the article describing the data:

    Jones, L.A. and Domeier, M.M. 2023. Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions. (TBC).

    For any additional queries, contact:

    Mathew M. Domeier (mathewd@uio.no) or Lewis A . Jones (lewisa.jones@outlook.com)

    If you use these files, please cite:

    Jones, L.A. and Domeier, M.M. 2023. Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions. Zenodo data repository. DOI:10.5281/zenodo.10069222

  13. s

    CODE dataset

    • figshare.scilifelab.se
    • researchdata.se
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro (2025). CODE dataset [Dataset]. http://doi.org/10.17044/scilifelab.15169716.v1
    Explore at:
    Dataset updated
    Feb 27, 2025
    Dataset provided by
    Uppsala University & UFMG
    Authors
    Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro
    License

    https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/

    Description

    Dataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group.Requesting accessResearchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request.If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted.Openly available subset:A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206.ContentThe folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format.Additional referencesThe dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are:- [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classification to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classification using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071.Code:The following github repositories perform analysis that use this dataset:- https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-predictionRelated Datasets:- CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618)Ethics declarationsThe CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.

  14. o

    Reddit Data Science Community Conversations

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Reddit Data Science Community Conversations [Dataset]. https://www.opendatabay.com/data/ai-ml/a27d0e5e-f087-4294-ba4d-f03598447dda
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset contains posts and comments extracted from the r/datascience subreddit, a highly active discussion forum on Reddit with over 600,000 contributors. It offers valuable insights into the conversations and trends within the data science community, providing raw material for various analytical endeavours. The content is directly generated by the subreddit's contributors, reflecting authentic community engagement.

    Columns

    • title: The textual title of a Reddit post.
    • score: The score or upvote count for a post or comment, indicating its popularity or agreement.
    • id: A unique identifier assigned to each post or comment.
    • url: The web address for the Reddit post or an associated external link.
    • comms_num: The total number of comments associated with a specific post.
    • created: The Unix timestamp indicating when the post or comment was created.
    • body: The main textual content of a Reddit post or comment.
    • timestamp: Another timestamp field, likely similar to 'created', marking the time of creation.

    Distribution

    The dataset is typically provided in a CSV format. * Score Distribution: Scores vary significantly, ranging from -91 to 2952. A large proportion of entries, specifically 20,526, fall within the -91.00 to 61.15 score range. Another view indicates 20,762 entries are in the 0.00 to 31.75 score range. There are 21,095 unique score values. * Time Coverage Distribution: The data covers a period from December 9, 2021, to April 22, 2022. There are 20,573 unique timestamp values. Activity peaks in late March 2022, with up to 2,830 entries in a single week.

    Usage

    This dataset is ideal for: * Analysing discussion topics prevalent within the r/datascience subreddit. * Understanding the tone of conversations among data science professionals and enthusiasts. * Identifying the dominant sentiment expressed in posts and comments. * Exploring the lexical particularities unique to the data science community's discussions. * Tracking trends and shifts in popular topics and opinions over time.

    Coverage

    The dataset offers global coverage regarding the community discussions. It spans a distinct time range from December 9, 2021, to April 22, 2022. The content reflects the diverse perspectives of over 600,000 contributors to the r/datascience subreddit, providing a wide demographic scope of individuals interested in data science.

    License

    CC0

    Who Can Use It

    • Data scientists and machine learning engineers for natural language processing (NLP) tasks such as topic modeling, sentiment analysis, or text classification.
    • Social media analysts and researchers studying online community behaviour, trends, and user engagement patterns.
    • Linguists and computational linguists examining the specific language usage within professional online forums.
    • Academic researchers interested in the evolution of discussions within the data science field.

    Dataset Name Suggestions

    • Reddit Data Science Community Conversations
    • r/datascience Subreddit Activity Log
    • Data Science Forum Discussions Archive
    • Reddit Data Science Posts and Comments

    Attributes

    Original Data Source: Data Science on Reddit

  15. g

    IP Australia - [Superseded] Intellectual Property Government Open Data 2019...

    • gimi9.com
    Updated Jul 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). IP Australia - [Superseded] Intellectual Property Government Open Data 2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/au_intellectual-property-government-open-data-2019
    Explore at:
    Dataset updated
    Jul 20, 2018
    Area covered
    Australia
    Description

    What is IPGOD? The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD. # How do I use IPGOD? IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar. # IP Data Platform IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform # References The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset. * Patents * Trade Marks * Designs * Plant Breeder’s Rights # Updates ### Tables and columns Due to the changes in our systems, some tables have been affected. * We have added IPGOD 225 and IPGOD 325 to the dataset! * The IPGOD 206 table is not available this year. * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use. ### Data quality improvements Data quality has been improved across all tables. * Null values are simply empty rather than '31/12/9999'. * All date columns are now in ISO format 'yyyy-mm-dd'. * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0. * All tables are encoded in UTF-8. * All tables use the backslash \ as the escape character. * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  16. Smooth numbers in large prime gaps

    • zenodo.org
    application/gzip
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert M. Guralnick; Robert M. Guralnick; John Shareshian; Russ Woodroofe; Russ Woodroofe; John Shareshian (2023). Smooth numbers in large prime gaps [Dataset]. http://doi.org/10.5281/zenodo.5914768
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robert M. Guralnick; Robert M. Guralnick; John Shareshian; Russ Woodroofe; Russ Woodroofe; John Shareshian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains numbers from 25 up to 1 quadrillion (1015) that are smooth relative to the gap to the preceding prime. More precisely, we list all numbers n so that

    r + pan

    where r is the largest prime smaller than n - 1, and pa is the largest prime-power divisor of n. The dataset is the result of a 10 day computation using 15 cores on an Intel Xeon system, running code hosted at GitHub (see "Related identifiers"). The GitHub code checks additional conditions when r is n - 2 and n - 1 is a power of 2, but it is easy and quick to check that when (up to 1015) n = 2k + 1, the second largest prime r2 satisfies r2 + pa > n. Thus, this additional check makes no difference in the output.

    Our motivations for computing this data are described in our paper On invariable generation of alternating groups by elements of prime and prime power order (arXiv:2201.12371). Any number n in the range which is not of the given form has the associated alternating group An generated by any element of order r together with any element having a certain cycle structure (and of order pa).

    Description / specification

    The data is stored as compressed text-based input to a computer algebra system, specifically in gzipped GAP format. The file out-k.g.gz holds numbers in the range from (k - 1)⋅1012 to k⋅1012. The first line of each file sets the variable invgen_oversmooth_range to be the range (thus, [(k - 1)⋅1012 .. k⋅1012]). The subsequent lines set invgen_oversmooth to a list of pairs of numbers [n, pa], where n is a smooth number as described above, and pa is the largest prime-power of n. The largest prime preceding n - 1 is given in a GAP comment.

    Thus, the first few lines of out-0.g.gz (when uncompressed) appear as

    invgen_oversmooth_range:=[25..1000000000000];
    invgen_oversmooth := [
     [ 30, 5 ], # bp 23 
     [ 60, 5 ], # bp 53 
     [ 126, 9 ], # bp 113 
     [ 210, 7 ], # bp 199 
     [ 252, 9 ], # bp 241 
     [ 308, 11 ], # bp 293 
     [ 330, 11 ], # bp 317 
     [ 420, 7 ], # bp 409 
    ...

    where [25 .. 1000000000000] is the range considered, and for example "[ 30, 5 ], # bp 23" represents that 23 is the largest prime preceding 30 - 1, 5 is the largest prime-power divisor of 30, and 23 + 5 ≤ 30.

    We created the data in GAP files for ease of inputting into a GAP program in our own use of the data. It is easy to convert the GAP files to another format via standard technique such as regular expression-based search and replace. For example, on macOS or Linux, the following command will convert the list in out-0.g.gz to a CSV file, which it will display on the terminal.

    zcat out_quadrillion/out-0.g.gz | sed -En 's/ \[ ([0-9]+), ([0-9]+) \], # bp ([0-9]+)/\1,\2,\3/gp' | less

  17. Global map of tree density

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crowther, T. W.; Glick, H. B.; Covey, K. R.; Bettigole, C.; Maynard, D. S.; Thomas, S. M.; Smith, J. R.; Hintler, G.; Duguid, M. C.; Amatulli, G.; Tuanmu, M. N.; Jetz, W.; Salas, C.; Stam, C.; Piotto, D.; Tavani, R.; Green, S.; Bruce, G.; Williams, S. J.; Wiser, S. K.; Huber, M. O.; Hengeveld, G. M.; Nabuurs, G. J.; Tikhonova, E.; Borchardt, P.; Li, C. F.; Powrie, L. W.; Fischer, M.; Hemp, A.; Homeier, J.; Cho, P.; Vibrans, A. C.; Umunay, P. M.; Piao, S. L.; Rowe, C. W.; Ashton, M. S.; Crane, P. R.; Bradford, M. A. (2023). Global map of tree density [Dataset]. http://doi.org/10.6084/m9.figshare.3179986.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Crowther, T. W.; Glick, H. B.; Covey, K. R.; Bettigole, C.; Maynard, D. S.; Thomas, S. M.; Smith, J. R.; Hintler, G.; Duguid, M. C.; Amatulli, G.; Tuanmu, M. N.; Jetz, W.; Salas, C.; Stam, C.; Piotto, D.; Tavani, R.; Green, S.; Bruce, G.; Williams, S. J.; Wiser, S. K.; Huber, M. O.; Hengeveld, G. M.; Nabuurs, G. J.; Tikhonova, E.; Borchardt, P.; Li, C. F.; Powrie, L. W.; Fischer, M.; Hemp, A.; Homeier, J.; Cho, P.; Vibrans, A. C.; Umunay, P. M.; Piao, S. L.; Rowe, C. W.; Ashton, M. S.; Crane, P. R.; Bradford, M. A.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Crowther_Nature_Files.zip This description pertains to the original download. Details on revised (newer) versions of the datasets are listed below. When more than one version of a file exists in Figshare, the original DOI will take users to the latest version, though each version technically has its own DOI. -- Two global maps (raster files) of tree density. These maps highlight how the number of trees varies across the world. One map was generated using biome-level models of tree density, and applied at the biome scale. The other map was generated using ecoregion-level models of tree density, and applied at the ecoregion scale. For this reason, transitions between biomes or between ecoregions may be unrealistically harsh, but large-scale estimates are robust (see Crowther et al 2015 and Glick et al 2016). At the outset, this study was intended to generate reliable estimates at broad spatial scales, which inherently comes at the cost of fine-scale precision. For this reason, country-scale (or larger) estimates are generally more robust than individual pixel-level estimates. Additionally, due to data limitations, estimates for Mangroves and Tropical coniferous forest (as identified by WWF and TNC) were generated using models constructed from Topical moist broadleaf forest data and Temperate coniferous forest data, respectively. Because we used ecological analogy, the estimates for these two biomes should be considered less reliable than those of other biomes . These two maps initially appeared in Crowther et al (2015), with the biome map being featured more prominently. Explicit publication of the data is associated with Glick et al (2016). As they are produced, updated versions of these datasets, as well as alternative formats, will be made available under Additional Versions (see below).

    Methods: We collected over 420,000 ground-sources estimates of tree density from around the world. We then constructed linear regression models using vegetative, climatic, topographic, and anthropogenic variables to produce forest tree density estimates for all locations globally. All modeling was done in R. Mapping was done using R and ArcGIS 10.1.

    Viewing Instructions: Load the files into an appropriate geographic information system (GIS). For the original download (ArcGIS geodatabase files), load the files into ArcGIS to view or export the data to other formats. Because these datasets are large and have a unique coordinate system that is not read by many GIS, we suggest loading them into an ArcGIS dataframe whose coordinate system matches that of the data (see File Format). For GeoTiff files (see Additional Versions), load them into any compatible GIS or image management program.

    Comments: The original download provides a zipped folder that contains (1) an ArcGIS File Geodatabase (.gdb) containing one raster file for each of the two global models of tree density – one based on biomes and one based on ecoregions; (2) a layer file (.lyr) for each of the global models with the symbology used for each respective model in Crowther et al (2015); and an ArcGIS Map Document (.mxd) that contains the layers and symbology for each map in the paper. The data is delivered in the Goode homolosine interrupted projected coordinate system that was used to compute biome, ecoregion, and global estimates of the number and density of trees presented in Crowther et al (2015). To obtain maps like those presented in the official publication, raster files will need to be reprojected to the Eckert III projected coordinate system. Details on subsequent revisions and alternative file formats are list below under Additional Versions.----------

    Additional Versions: Crowther_Nature_Files_Revision_01.zip contains tree density predictions for small islands that are not included in the data available in the original dataset. These predictions were not taken into consideration in production of maps and figures presented in Crowther et al (2015), with the exception of the values presented in Supplemental Table 2. The file structure follows that of the original data and includes both biome- and ecoregion-level models.

    Crowther_Nature_Files_Revision_01_WGS84_GeoTiff.zip contains Revision_01 of the biome-level model, but stored in WGS84 and GeoTiff format. This file was produced by reprojecting the original Goode homolosine files to WGS84 using nearest neighbor resampling in ArcMap. All areal computations presented in the manuscript were computed using the Goode homolosine projection. This means that comparable computations made with projected versions of this WGS84 data are likely to differ (substantially at greater latitudes) as a product of the resampling. Included in this .zip file are the primary .tif and its visualization support files.

    References:

    Crowther, T. W., Glick, H. B., Covey, K. R., Bettigole, C., Maynard, D. S., Thomas, S. M., Smith, J. R., Hintler, G., Duguid, M. C., Amatulli, G., Tuanmu, M. N., Jetz, W., Salas, C., Stam, C., Piotto, D., Tavani, R., Green, S., Bruce, G., Williams, S. J., Wiser, S. K., Huber, M. O., Hengeveld, G. M., Nabuurs, G. J., Tikhonova, E., Borchardt, P., Li, C. F., Powrie, L. W., Fischer, M., Hemp, A., Homeier, J., Cho, P., Vibrans, A. C., Umunay, P. M., Piao, S. L., Rowe, C. W., Ashton, M. S., Crane, P. R., and Bradford, M. A. 2015. Mapping tree density at a global scale. Nature, 525(7568): 201-205. DOI: http://doi.org/10.1038/nature14967Glick, H. B., Bettigole, C. B., Maynard, D. S., Covey, K. R., Smith, J. R., and Crowther, T. W. 2016. Spatially explicit models of global tree density. Scientific Data, 3(160069), doi:10.1038/sdata.2016.69.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Organization logo

Film Circulation dataset

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
csv, png, binAvailable download formats
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.


Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

Search
Clear search
Close search
Google apps
Main menu