53 datasets found
  1. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Film University Babelsberg KONRAD WOLF
    Authors
    Loist, Skadi; Samoilova, Evgenia (Zhenya)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  2. d

    Street Network Database SND

    • catalog.data.gov
    • data.seattle.gov
    • +2more
    Updated Oct 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Seattle ArcGIS Online (2025). Street Network Database SND [Dataset]. https://catalog.data.gov/dataset/street-network-database-snd-1712b
    Explore at:
    Dataset updated
    Oct 4, 2025
    Dataset provided by
    City of Seattle ArcGIS Online
    Description

    The pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer. Appropriate uses include: Cartography - Used to depict the City's transportation network location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event location reference - Used as source for derived reference data used to locate event and linear referencing.Data source is TRANSPO.SNDSEG_PV. Updated weekly.

  3. ECMWF Reanalysis v5

    • ecmwf.int
    application/x-grib
    Updated Dec 31, 1969
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (1969). ECMWF Reanalysis v5 [Dataset]. https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5
    Explore at:
    application/x-grib(1 datasets)Available download formats
    Dataset updated
    Dec 31, 1969
    Dataset authored and provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    land and oceanic climate variables. The data cover the Earth on a 31km grid and resolve the atmosphere using 137 levels from the surface up to a height of 80km. ERA5 includes information about uncertainties for all variables at reduced spatial and temporal resolutions.

  4. Dictionary of English Words and Definitions

    • kaggle.com
    zip
    Updated Sep 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Dictionary of English Words and Definitions [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/dictionary-of-english-words-and-definitions
    Explore at:
    zip(6401928 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.

    Key Features:

    • Words: A diverse set of English words, including both rare and frequently used terms.
    • Definitions: Each word is accompanied by a detailed definition that explains its meaning and contextual usage.

    Total Number of Words: 42,052

    Applications

    This dataset is well-suited for a range of use cases, including:

    • Natural Language Processing (NLP): Enhance text understanding models by providing contextual meaning and word associations.
    • Vocabulary Building: Create educational tools or games that help users expand their vocabulary.
    • Lexical Studies: Perform academic research on word usage, trends, and lexical semantics.
    • Dictionary and Thesaurus Development: Serve as a resource for building dictionary or thesaurus applications, where users can search for words and definitions.

    Data Structure

    • Word: The column containing the English word.
    • Definition: The column providing a comprehensive definition of the word.

    Potential Use Cases

    • Language Learning: This dataset can be used to develop applications or tools aimed at enhancing vocabulary acquisition for language learners.
    • NLP Model Training: Useful for tasks such as word embeddings, definition generation, and contextual learning.
    • Research: Analyze word patterns, rare vocabulary, and trends in the English language.

    This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!

  5. d

    HSIP Law Enforcement Locations in New Mexico

    • catalog.data.gov
    • gstore.unm.edu
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). HSIP Law Enforcement Locations in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-law-enforcement-locations-in-new-mexico
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Area covered
    New Mexico
    Description

    Law Enforcement Locations Any location where sworn officers of a law enforcement agency are regularly based or stationed. Law Enforcement agencies "are publicly funded and employ at least one full-time or part-time sworn officer with general arrest powers". This is the definition used by the US Department of Justice - Bureau of Justice Statistics (DOJ-BJS) for their Law Enforcement Management and Administrative Statistics (LEMAS) survey. Although LEMAS only includes non Federal Agencies, this dataset includes locations for federal, state, local, and special jurisdiction law enforcement agencies. Law enforcement agencies include, but are not limited to, municipal police, county sheriffs, state police, school police, park police, railroad police, federal law enforcement agencies, departments within non law enforcement federal agencies charged with law enforcement (e.g., US Postal Inspectors), and cross jurisdictional authorities (e.g., Port Authority Police). In general, the requirements and training for becoming a sworn law enforcement officer are set by each state. Law Enforcement agencies themselves are not chartered or licensed by their state. County, city, and other government authorities within each state are usually empowered by their state law to setup or disband Law Enforcement agencies. Generally, sworn Law Enforcement officers must report which agency they are employed by to the state. Although TGS's intention is to only include locations associated with agencies that meet the above definition, TGS has discovered a few locations that are associated with agencies that are not publicly funded. TGS deleted these locations as we became aware of them, but some may still exist in this dataset. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset; however, some personal homes are included due to the fact that the New Mexico Mounted Police work out of their homes. TGS has made a concerted effort to include all local police; county sheriffs; state police and/or highway patrol; Bureau of Indian Affairs; Bureau of Land Management; Bureau of Reclamation; U.S. Park Police; Bureau of Alcohol, Tobacco, Firearms, and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. FBI entities are intended to be excluded from this dataset, but a few may be included. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, the NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 08/14/2006 and the newest record dates from 10/23/2009

  6. o

    Public Health Portfolio (Directly Funded Research - Programmes and Training...

    • nihr.opendatasoft.com
    • nihr.aws-ec2-eu-central-1.opendatasoft.com
    csv, excel, json
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Public Health Portfolio (Directly Funded Research - Programmes and Training Awards) [Dataset]. https://nihr.opendatasoft.com/explore/dataset/phof-datase/
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Nov 4, 2025
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This Public Health Portfolio (Directly Funded Research - Programme and Training Awards) dataset contains NIHR directly funded research awards where the funding is allocated to an award holder or host organisation to carry out a specific piece of research or complete a training award. The NIHR also invests significantly in centres of excellence, collaborations, services and facilities to support research in England. Collectively these form NIHR infrastructure support. NIHR infrastructure supported projects are available in the Public Health Portfolio (Infrastructure Support) dataset which you can find here.NIHR directly funded research awards (Programmes and Training Awards) that were funded between January 2006 and the present extraction date are eligible for inclusion in this dataset. An agreed inclusion/exclusion criteria is used to categorise awards as public health awards (see below). Following inclusion in the dataset, public health awards are second level coded to one of the four Public Health Outcomes Framework domains. These domains are: (1) wider determinants (2) health improvement (3) health protection (4) healthcare and premature mortality.More information on the Public Health Outcomes Framework domains can be found here.This dataset is updated quarterly to include new NIHR awards categorised as public health awards. Please note that for those Public Health Research Programme projects showing an Award Budget of £0.00, the project is undertaken by an on-call team for example, PHIRST, Public Health Review Team, or Knowledge Mobilisation Team, as part of an ongoing programme of work.Inclusion CriteriaThe NIHR Public Health Overview project team worked with colleagues across NIHR public health research to define the inclusion criteria for NIHR public health research. NIHR directly funded research awards are categorised as public health if they are determined to be ‘investigations of interventions in, or studies of, populations that are anticipated to have an effect on health or on health inequity at a population level.’ This definition of public health is intentionally broad to capture the wide range of NIHR public health research across prevention, health improvement, health protection, and healthcare services (both within and outside of NHS settings). This dataset does not reflect the NIHR’s total investment in public health research. The intention is to showcase a subset of the wider NIHR public health portfolio. This dataset includes NIHR directly funded research awards categorised as public health awards. This dataset does not include public health awards or projects funded by any of the three NIHR Research Schools or NIHR Health Protection Research Units.DisclaimersUsers of this dataset should acknowledge the broad definition of public health that has been used to develop the inclusion criteria for this dataset. Please note that this dataset is currently subject to a limited data quality review. We are working to improve our data collection methodologies. Please also note that some awards may also appear in other NIHR curated datasets. Further InformationFurther information on the individual awards shown in the dataset can be found on the NIHR’s Funding & Awards website here. Further information on individual NIHR Research Programme’s decision making processes for funding health and social care research can be found here.Further information on NIHR’s investment in public health research can be found as follows:The NIHR is one of the main funders of public health research in the UK. Public health research falls within the remit of a range of NIHR Directly Funded Research (Programmes and Training Awards), and NIHR Infrastructure Support. NIHR School for Public Health here.NIHR Public Health Policy Research Unit here. NIHR Health Protection Research Units here.NIHR Public Health Research Programme Health Determinants Research Collaborations (HDRC) here.NIHR Public Health Research Programme Public Health Intervention Responsive Studies Teams (PHIRST) here.

  7. MatSeg: Material State Segmentation Dataset and Benchmark

    • zenodo.org
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MatSeg Dataset and benchmark for zero-shot material state segmentation.

    MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

    MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

    MatSeg3D_part_*.zip: contain synthethc 3D scenes

    MatSeg2D_part_*.zip: contain syntethc 2D scenes

    Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

    Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

    The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

    Additional permanent sources for downloading the dataset and metadata: 1, 2

    Evaluation scripts for the Benchmark are now available at:

    https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

    Description

    Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

    Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

    The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

    The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

    Benchmark

    The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

    Evaluation scripts for the Benchmark are now available at: 1 and 2.

    Synthetic Dataset

    The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

    License

    This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

    The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

    Example Usage:

    An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

    This include an evaluation script on the MatSeg benchmark.

    Training script using the MatSeg dataset.

    And weights of a trained model

    Paper:

    More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
    Zero-Shot Material State Segmentation"

    Croissant metadata and additional sources for downloading the dataset are available at 1,2

  8. d

    Census_sum_15

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Census_sum_15 [Dataset]. https://catalog.data.gov/dataset/census-sum-15
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The GIS layer "Census_sum_15" provides a standardized tool for examining spatial patterns in abundance and demographic trends of the southern sea otter (Enhydra lutris nereis), based on data collected during the spring 2015 range-wide census. The USGS range-wide sea otter census has been undertaken twice a year since 1982, once in May and once in October, using consistent methodology involving both ground-based and aerial-based counts. The spring census is considered more accurate than the fall count, and provides the primary basis for gauging population trends by State and Federal management agencies. This Shape file includes a series of summary statistics derived from the raw census data, including sea otter density (otters per square km of habitat), linear density (otters per km of coastline), relative pup abundance (ratio of pups to independent animals) and 5-year population trend (calculated as exponential rate of change). All statistics are calculated and plotted for small sections of habitat in order to illustrate local variation in these statistics across the entire mainland distribution of sea otters in California (as of 2015). Sea otter habitat is considered to extend offshore from the mean low tide line and out to the 60m isobath: this depth range includes over 99% of sea otter feeding dives, based on dive-depth data from radio tagged sea otters (Tinker et al 2006, 2007). Sea otter distribution in California (the mainland range) is considered to comprise this band of potential habitat stretching along the coast of California, and bounded to the north and south by range limits defined as "the points farthest from the range center at which 5 or more otters are counted within a 10km contiguous stretch of coastline (as measured along the 10m bathymetric contour) during the two most recent spring censuses, or at which these same criteria were met in the previous year". The polygon corresponding to the range definition was then sub-divided into onshore/offshore strips roughly 500 meters in width. The boundaries between these strips correspond to ATOS (As-The-Otter-Swims) points, which are arbitrary locations established approximately every 500 meters along a smoothed 5 fathom bathymetric contour (line) offshore of the State of California.

  9. S

    Data set on Task unpacking effects in time estimation: The role of future...

    • scidb.cn
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shizifu; xia bi qi; Liu Xin (2023). Data set on Task unpacking effects in time estimation: The role of future boundaries and thought focus [Dataset]. http://doi.org/10.57760/sciencedb.j00052.00202
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Shizifu; xia bi qi; Liu Xin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is for the study of task decomposition effects in time estimation: the role of future boundaries and thought focus, and supplementary materials. Due to the previous research on the impact of task decomposition on time estimation, the role of time factors was often overlooked. For example, with the same decomposition, people subjectively set different time boundaries when facing difficult and easy tasks. Therefore, taking into account the time factor is bound to improve and integrate the research conclusions of decomposition effects. Based on this, we studied the impact of task decomposition and future boundaries on time estimation. Experiment 1 passed 2 (task decomposition/no decomposition) × Design an inter subject experiment with/without future boundaries, using the expected paradigm to measure the time estimation of participants; Experiment 2 further manipulates the time range of future boundaries based on Experiment 1, using 2 (task decomposition/non decomposition) × 3 (future boundaries of longer/shorter/medium range) inter subject experimental design, using expected paradigm to measure time estimation of subjects; On the basis of Experiment 2, Experiment 3 further verified the mechanism of the influence of the time range of future boundaries under decomposition conditions on time estimation. Through a single factor inter subject experimental design, a thinking focus scale was used to measure the thinking focus of participants under longer and shorter boundary conditions. Through the above experiments and measurements, we have obtained the following dataset. Experiment 1 Table Data Column Label Meaning: Task decomposition into grouped variables: 0 represents decomposition; 1 indicates no decomposition The future boundary is a grouping variable: 0 represents existence; 1 means it does not exist Zsco01: Standard score for estimating total task time A logarithm: The logarithmic value of the estimated time for all tasks Experiment 2 Table Data Column Label Meaning: The future boundary is a grouping variable: 7 represents shorter, 8 represents medium, and 9 represents longer The remaining data labels are the same as Experiment 1 Experiment 3 Table Data Column Label Meaning: Zplan represents the standard score for the focus plan score Zbar represents the standard score for attention barriers The future boundary is a grouping variable: 0 represents shorter, 1 represents longer

  10. s

    Danish Similarity Data Set

    • sprogteknologi.dk
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Language Technology, NorS, University of Copenhagen (2024). Danish Similarity Data Set [Dataset]. https://sprogteknologi.dk/dataset/danish-similarity-data-set
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Center for Sprogteknologi
    Authors
    Centre for Language Technology, NorS, University of Copenhagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Denmark
    Description

    The Danish similarity dataset is a gold standard resource for evaluation of Danish word embedding models. The dataset consists of 99 word pairs rated by 38 human judges according to their semantic similarity, i.e. the extend to which the two words are similar in meaning, in a normalized 0-1 range. Note that this dataset provides a way of measuring similarity rather than relatedness/association. Description of files included in this material: (Note: In both of the included files, rows correspond to items (word pairs) and columns to properties of each item.) All_sims_da.csv: Contains the non-normalized mean similarity scores over all judges, along with the non-normalized scores given by each of the 38 judges on the scale 0-6, where 0 is given to the most dissimilar items and 6 to the most similar items. Gold_sims_da.csv: Contains the similarity gold standard for each item, which is the normalized mean similarity score for a given item over all judges. Scores are normalized to a 0-1 range, where 0 denotes the minimum degree of similarity and 1 denotes the maximum degree of similarity.

  11. Z

    Data from: Multi-Profile Ultra High Definition (UHD) AVC and HEVC 4K DASH...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quinlan, Jason; Sreenan, Cormac (2020). Multi-Profile Ultra High Definition (UHD) AVC and HEVC 4K DASH Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1219787
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University College Cork
    Authors
    Quinlan, Jason; Sreenan, Cormac
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a Multi-Profile Ultra High Definition (¥emph{UHD}) DASH dataset composed of both AVC (H.264) and HEVC (H.265) video content, generated from three well known open-source 4K video clips. The representation rates and resolutions of our dataset range from 40Mbps in 4K down to 235kbps in 320x240, and are comparable to rates utilised by on demand services such as Netflix, Youtube and Amazon Prime. We provide our dataset for both real-time testbed evaluation and trace-based simulation. The real-time testbed content provides a means of evaluating DASH adaptation techniques on physical hardware, while our trace-based content offers simulation over frameworks such as ns-2 and ns-3. We also provide the original pre-DASH MP4 files and our associated DASH generation scripts, so as to provide researchers with a mechanism to create their own DASH profile content locally. Which improves the reproducibility of results and remove re-buffering issues caused by delay/jitter/losses in the Internet.

    The primary goal of our dataset is to provide the wide range of video content required for validating DASH Quality of Experience (QoE) delivery over networks, ranging from constrained cellular and satellite systems to future high speed architectures such as the proposed 5G mmwave technology.
    
  12. Z

    Trace-Share Dataset for Evaluation of Trace Meaning Preservation

    • data.niaid.nih.gov
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cermak, Milan; Madeja, Tomas (2020). Trace-Share Dataset for Evaluation of Trace Meaning Preservation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3547527
    Explore at:
    Dataset updated
    May 7, 2020
    Dataset provided by
    Institute of Computer Science, Masaryk University, Brno, Czech Republic
    Authors
    Cermak, Milan; Madeja, Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains all data used during the evaluation of trace meaning preservation. Archives are protected by password "trace-share" to avoid false detection by antivirus software.

    For more information, see the project repository at https://github.com/Trace-Share.

    Selected Attack Traces

    The following list contains trace datasets used for evaluation. Each attack was chosen to have not only a different meaning but also different statistical properties.

    dos_http_flood — the capture of GET and POST requests sent to one server by one attacker (HTTP~traffic);

    ftp_bruteforce — short and unsuccessful attempt to guess a user’s password for FTP service (FTP traffic);

    ponyloader_botnet — Pony Loader botnet used for stealing of credentials from 3 target devices reporting to single IP with a large number of intermediate addresses (DNS and HTTP traffic);

    scan — the capture of nmap tool that scans given subnet using ICMP echo and TCP SYN requests (consist of ARP, ICMP, and TCP traffic);

    wannacry_ransomware — the capture of Wanacry ransomware that spreads in a domain with three workstations, a domain controller, and a file-sharing server (SMB and SMBv2 traffic).

    Background Traffic Data

    Publicly available dataset CSE-CIC-IDS-2018 was used as a background traffic data. The evaluation uses data from the day Thursday-01-03-2018 containing a sufficient proportion of regular traffic without any statistically significant attacks. Only traffic aimed at victim machines (range 172.31.69.0/24) is used to reduce less significant traffic.

    Evaluation Results and Dataset Structure

    Traces variants (traces.zip)

    ./traces-original/ — trace PCAP files and crawled details in YAML format;

    ./traces-normalized — normalized PCAP files and details in YAML format;

    ./traces-adjusted — adjusted PCAP files using various timestamp generation settings, combination configuration in YAML format, and lables provided by ID2T in XML format.

    Extracted alerts (alerts.zip)

    ./alerts-original/ — extracted Suricata alerts, Suricata log, and full Suricata output for all original trace files;

    ./alerts-normalized/ — extracted Suricata alerts, Suricata log, and full Suricata output for all normalized trace files;

    ./alerts-adjusted/ — extracted Suricata alerts, Suricata log, and full Suricata output for all adjusted trace files.

    Evaluation results

    *.csv files in the root directory — data contains extracted alert signatures and their count per each trace variant.

  13. u

    HLY-08-03 Raw 150 KHz ADCP Data [Sambrotto/LDEO]

    • data.ucar.edu
    • ckanprod.data-commons.k8s.ucar.edu
    • +2more
    archive
    Updated Oct 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond Sambrotto (2025). HLY-08-03 Raw 150 KHz ADCP Data [Sambrotto/LDEO] [Dataset]. http://doi.org/10.5065/D6959FKV
    Explore at:
    archiveAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Raymond Sambrotto
    Time period covered
    Jul 3, 2008 - Jul 31, 2008
    Area covered
    Description

    This dataset includes data from the 150 KHz ADCP system onboard the US Coast Guard Cutter Healy during the Bering Sea Ecosystem Study-Bering Sea Integrated Ecosystem Research Program (BEST-BSIERP) 2008 0803 (summer) cruise. BEST-BSIERP together are the Bering Sea project. The ADCP system measures currents in the depth range from about 30 to 300 m in good weather. In bad weather or in ice, the range is less, and sometimes no valid measurements are made. The individual data files have been collected into tar files by day according to the sequence number. The following is the list of file extensions and their meaning of the data files contained in the tar files: ENR - Raw Binary ADCP Data; ENS - Binary ADCP Data; ENX - Binary Ensemble Data; STA - Short Term Averaged Data; LTA - Long Term Averaged Data; N1R - Raw NMEA ASCII Data; N2R - Raw NMEA ASCII Data; NMS - Averaged Nav Data

  14. s

    Centerlines

    • data.saccounty.gov
    • data.sacog.org
    • +4more
    Updated Mar 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sacramento County GIS (2018). Centerlines [Dataset]. https://data.saccounty.gov/datasets/centerlines
    Explore at:
    Dataset updated
    Mar 15, 2018
    Dataset authored and provided by
    Sacramento County GIS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This is the official Street Centerline dataset for the County of Sacramento and the incorporated cities within. The Street Range Index table is a distinct list of street names within the Centerline dataset along with the existing address range for each street by zip code.The Street Name Index table is a distinct list of street names within the Centerline dataset.

  15. f

    Data from: Gradient Boosted Machine Learning Model to Predict H2, CH4, and...

    • figshare.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Bailey; Adam Jackson; Razvan-Antonio Berbece; Kejun Wu; Nicole Hondow; Elaine Martin (2023). Gradient Boosted Machine Learning Model to Predict H2, CH4, and CO2 Uptake in Metal–Organic Frameworks Using Experimental Data [Dataset]. http://doi.org/10.1021/acs.jcim.3c00135.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 18, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tom Bailey; Adam Jackson; Razvan-Antonio Berbece; Kejun Wu; Nicole Hondow; Elaine Martin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Predictive screening of metal–organic framework (MOF) materials for their gas uptake properties has been previously limited by using data from a range of simulated sources, meaning the final predictions are dependent on the performance of these original models. In this work, experimental gas uptake data has been used to create a Gradient Boosted Tree model for the prediction of H2, CH4, and CO2 uptake over a range of temperatures and pressures in MOF materials. The descriptors used in this database were obtained from the literature, with no computational modeling needed. This model was repeated 10 times, showing an average R2 of 0.86 and a mean absolute error (MAE) of ±2.88 wt % across the runs. This model will provide gas uptake predictions for a range of gases, temperatures, and pressures as a one-stop solution, with the data provided being based on previous experimental observations in the literature, rather than simulations, which may differ from their real-world results. The objective of this work is to create a machine learning model for the inference of gas uptake in MOFs. The basis of model development is experimental as opposed to simulated data to realize its applications by practitioners. The real-world nature of this research materializes in a focus on the application of algorithms as opposed to the detailed assessment of the algorithms.

  16. d

    Landsat Level-1 Collection 2

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Not provided (2025). Landsat Level-1 Collection 2 [Dataset]. https://catalog.data.gov/dataset/landsat-level-1-collection-2
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Not provided
    Description

    The Landsat Level-1 Collection 2 products, produced by the U.S. Geological Survey (USGS), provide high-quality, calibrated, and georeferenced satellite data that support a wide range of remote sensing applications. This dataset encompasses multispectral data captured by the Landsat 1-9 missions, spanning the period from 1972 to present and offers consistent and accurate measurements of Earth's land surface characteristics. These data are processed to established precise standards, meaning they are radiometrically and geometrically calibrated to ensure high accuracy for scientific analyses. The Level-1 products include per-pixel quality assessment information that provides essential metadata to help users select the most suitable data for their needs.

  17. g

    Street Network Database SND | gimi9.com

    • gimi9.com
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Street Network Database SND | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_street-network-database-snd-1712b
    Explore at:
    Dataset updated
    Jun 23, 2023
    Description

    The pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer. Appropriate uses include: Cartography - Used to depict the City's transportation network location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event location reference - Used as source for derived reference data used to locate event and linear referencing.Data source is TRANSPO.SNDSEG_PV. Updated weekly.

  18. Z

    Datasets of pushing complexometric titrations in the nanomolar range thanks...

    • data.niaid.nih.gov
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noclain, Angelina; Charron, Gaëlle (2024). Datasets of pushing complexometric titrations in the nanomolar range thanks to SERS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13364975
    Explore at:
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    Laboratoire Matière et Systèmes Complexes
    Authors
    Noclain, Angelina; Charron, Gaëlle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets contains the spectral information used in a publication regarding the implementation of complexometric titrations of copper monitored using Surface Enhanced Raman Spectroscopy (SERS). In this study the sensitivity of a classic complexometric titration system for Cu2+ is pushed into the nanomolar regime thanks to SERS monitoring of the endpoint.

    All the data are organised as column vectors.

    The data consists of three csv files with semi-column separators.

    “AN334_and_AN338_laser_532nm_1microM_copper_variation_of_PAN_all_data_corrected_serie_11_points” correspond to spectral data acquired under 532 nm irradiation with variation of the PAN (100, 50 and 25 nM) concentration for 1 µM of copper.

    “AN335_and_AN341_laser_532nm_500nM_copper_variation_of_PAN_all_data_corrected_serie_11_points” correspond to spectral data acquired under 532 nm irradiation with variation of the PAN (100, 50 and 25 nM) concentration for 500 nM of copper.

    “AN336_and_AN344_laser_532nm_250nM_copper_variation_of_PAN_all_data_corrected_serie_11_points” correspond to spectral data acquired under 532 nm irradiation with variation of the PAN (100, 50 and 25 nM) concentration for 250 nM of copper.

    Each dataset has 90 685 rows corresponding to 1 irradiation used (532 nm), 2 experimental replicate titration series including 1 titration + 1 blank titration for each titration and for 3 differents concentrations of PAN and 11 titration steps per titration series. For the 3 copper concentrations we obtain 396 samples.

    The columns vectors consist of a first sub-vector of spectral descriptors (identity of spectra, identity and composition of measurement samples and conditions of spectral acquisition), followed by a sub-vector of baseline-subtracted spectral intensities.

    The meaning of each column is in the .docx file titled "Structure of AN334-AN338 to AN336-AN344 datasets”.

    The other two csv fils with semi-column separators "AN339_AN342_AN345_laser_638nm_variation_of_copper_and_PAN_all_data_corrected_serie_11_points" and "AN340_AN343_AN346_laser_785nm_variation_of_copper_and_PAN_all_data_corrected_serie_11_points" correspond to spectral data acquired under 638 and 785 nm irradiation respectively. In both files, the data correspond to the 11-point experiments with PAN variation (100, 50 and 25 nM) and copper variation (1 µM, 500 and 250 nM).

    Finally, the data consists of one csv files with semi-column separators.

    “AN384_to_AN386_Laser_532_638_785nm_all_data_corrected_titration_serie_20_points” correspond to spectral data acquired under 532, 638 and 785 nm irradiation.

    The dataset has 111 241 rows corresponding to 3 irradiations used (532, 638 and 785), 2 replicate titration series + 1 blank titration and 20 titration steps per titration series.

    The columns vectors consist of a first sub-vector of spectral descriptors (identity of spectra, identity and composition of measurement samples and conditions of spectral acquisition), followed by a sub-vector of baseline-subtracted spectral intensities.

    The meaning of each column is in the .docx file titled "Structure of AN384 to AN386 datasets”.

  19. Marine Hard Substrate Dataset

    • data-search.nerc.ac.uk
    • metadata.bgs.ac.uk
    • +1more
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Geological Survey (2025). Marine Hard Substrate Dataset [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/api/records/9e32312c-b028-521b-e044-0003ba9b0d98
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset authored and provided by
    British Geological Surveyhttps://www.bgs.ac.uk/
    License

    http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations

    Time period covered
    2011
    Area covered
    Description

    The Marine Hard Substrate dataset maps areas of rock or hard substrate outcropping or within 0.5m of the sea-bed. For the purpose of this dataset, Hard Substrate was defined as the presence of either rock or clasts >64 mm (boulders or cobbles) within 0.5 m of the seabed. This definition includes sediment veneer overlying hard substrate in some areas. This is used in order to include both infaunal and epifaunal communities and is considered beneficial for habitat mappers. The interpretation was based on a variety of data sourced from within the British Geological Survey and externally. Data consulted includes archive sample and seismic records, side scan sonar, multibeam bathymetry and Olex datasets. The distribution of hard substrate at the seabed, or within 0.5 m of the seabed, is important in dictating the benthic assemblages found in certain areas. Therefore, an understanding of the distribution of these substrates is of primary importance in marine planning and designation of Marine Conservation Zones (MCZs) under the Marine and Coastal Access Act, 2009. In addition, a number of other users will value these data, including marine renewable companies, aggregate companies, the fishing and oil and gas industries. In order to address this issue it was necessary to update British Geological Survey sea-bed mapping to delineate areas where rock, boulders or cobbles are present at, or within 0.5 m of the sea-bed surface. A polygon shape file showing areas of rock or hard substrate at, or within 0.5m of the sea-bed has been developed. The dataset has been created as vector polygons and are available in a range of GIS formats, including ESRI shapefile (.shp) and OGC GeoPackage (.gpkg). More specialised formats may be available but may incur additional processing costs. This dataset has been developed in collaboration with external partners and the methodology used is detailed in the report MB0103 for DEFRA: Developing the necessary data layers for Marine Conservation Zone selection - Distribution of rock/hard substrate on the UK Continental Shelf MB0103 (Gafeira et al., 2010). This dataset was produced for use at 1:250 000 scale. However, in many cases, the detail of the mapping is far greater than expected for this scale as hard substrate delineation was done based on the best available data. This data should not be relied on for local or site-specific geology. Contact BGS Digital Data (digitaldata@bgs.ac.uk) for more information on this dataset.

  20. d

    HSIP Correctional Institutions in New Mexico

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). HSIP Correctional Institutions in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-correctional-institutions-in-new-mexico
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Area covered
    New Mexico
    Description

    Jails and Prisons (Correctional Institutions). The Jails and Prisons sub-layer is part of the Emergency Law Enforcement Sector and the Critical Infrastructure Category. A Jail or Prison consists of any facility or location where individuals are regularly and lawfully detained against their will. This includes Federal and State prisons, local jails, and juvenile detention facilities, as well as law enforcement temporary holding facilities. Work camps, including camps operated seasonally, are included if they otherwise meet the definition. A Federal Prison is a facility operated by the Federal Bureau of Prisons for the incarceration of individuals. A State Prison is a facility operated by a state, commonwealth, or territory of the US for the incarceration of individuals for a term usually longer than 1 year. A Juvenile Detention Facility is a facility for the incarceration of those who have not yet reached the age of majority (usually 18 years). A Local Jail is a locally administered facility that holds inmates beyond arraignment (usually 72 hours) and is staffed by municipal or county employees. A temporary holding facility, sometimes referred to as a "police lock up" or "drunk tank", is a facility used to detain people prior to arraignment. Locations that are administrative offices only are excluded from the dataset. This definition of Jails is consistent with that used by the Department of Justice (DOJ) in their "National Jail Census", with the exception of "temporary holding facilities", which the DOJ excludes. Locations which function primarily as law enforcement offices are included in this dataset if they have holding cells. If the facility is enclosed with a fence, wall, or structure with a gate around the buildings only, the locations were depicted as "on entity" at the center of the facility. If the facility's buildings are not enclosed, the locations were depicted as "on entity" on the main building or "block face" on the correct street segment. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset. TGS has made a concerted effort to include all correctional institutions. This dataset includes non license restricted data from the following federal agencies: Bureau of Indian Affairs; Bureau of Reclamation; U.S. Park Police; Federal Bureau of Prisons; Bureau of Alcohol, Tobacco, Firearms and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 12/27/2004 and the newest record dates from 09/08/2009

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671

Film Circulation dataset

Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Search
Clear search
Close search
Google apps
Main menu