72 datasets found

o
Net Zero Use Cases and Data Requirements
ukpowernetworks.opendatasoft.com
csv, excel, json
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Net Zero Use Cases and Data Requirements [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/top-30-use-cases/
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Jun 8, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionFollowing the identification of Local Area Energy Planning (LAEP) use cases, this dataset lists the data sources and/or information that could help facilitate this research. View our dedicated page to find out how we derived this list: Local Area Energy Plan — UK Power Networks (opendatasoft.com)

Methodological Approach Data upload: a list of datasets and ancillary details are uploaded into a static Excel file before uploaded onto the Open Data Portal.

Quality Control Statement

Quality Control Measures include: Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology

Assurance Statement The Open Data Team and Local Net Zero Team worked together to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/

Please note that "number of records" in the top left corner is higher than the number of datasets available as many datasets are indexed against multiple use cases leading to them being counted as multiple records.
d
National Land Cover Database (NLCD) 2016 Accuracy Assessment Points...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). National Land Cover Database (NLCD) 2016 Accuracy Assessment Points Conterminous United States [Dataset]. https://catalog.data.gov/dataset/national-land-cover-database-nlcd-2016-accuracy-assessment-points-conterminous-united-stat
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States, Contiguous United States
Description
The National Land Cover Database (NLCD) is a land cover monitoring program providing land cover information for the United States. NLCD2016 extended temporal coverage to 15 years (2001–2016). We collected land cover reference data for the 2011 and 2016 nominal dates to report land cover accuracy for the NLCD2016 database 2011 and 2016 land cover components. We measured land cover accuracy at Level II and Level I, and change accuracy at Level I. For both the 2011 and 2016 land cover components, single-date Level II overall accuracies (OA) were 72% (standard error of ±0.9%) when agreement was defined as match between the map label and primary reference label only and 86% (± 0.7%) when agreement also included the alternate reference label. The corresponding level I OA for both dates were 79% (± 0.9%) and 91% (± 1.0%). For land cover change, the 2011–2016 user’s and producer’s accuracies (UA and PA) were ~ 75% for forest loss. PA for water loss, grassland loss, and grass gain were > 70% when agreement included a match between the map label and either the primary or alternate reference label. Depending on agreement definition and level of the classification hierarchy, OA for the 2011 land cover component of the NLCD2016 database was about 4% to 7% higher than OA for the 2011 land cover component of the NLCD2011 database, suggesting that the changes in mapping methodologies initiated for production of the NLCD2016 database have led to improved product quality.
i07 Water Shortage Vulnerability Sections
data.cnra.ca.gov
data.ca.gov
+4more
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2025). i07 Water Shortage Vulnerability Sections [Dataset]. https://data.cnra.ca.gov/dataset/i07-water-shortage-vulnerability-sections
Explore at:
html, kml, csv, arcgis geoservices rest api, geojson, zipAvailable download formats
Dataset updated
May 29, 2025
Dataset authored and provided by
California Department of Water Resourceshttp://www.water.ca.gov/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset represents a water shortage vulnerability analysis performed by DWR using modified PLSS sections pulled from the Well Completion Report PLSS Section Summaries. The attribute table includes water shortage vulnerability indicators and scores from an analysis done by CA Department of Water Resources, joined to modified PLSS sections. Several relevant summary statistics from the Well Completion Reports are included in this table as well. This data is from the 2024 analysis.
Water Code Division 6 Part 2.55 Section 8 Chapter 10 (Assembly Bill 1668) effectively requires California Department of Water Resources (DWR), in consultation with other agencies and an advisory group, to identify small water suppliers and “rural communities” that are at risk of drought and water shortage. Following legislation passed in 2021 and signed by Governor Gavin Newsom, the Water Code Division 6, Section 10609.50 through 10609.80 (Senate Bill 552 of 2021) effectively requires the California Department of Water Resources to update the scoring and tool periodically in partnership with the State Water Board and other state agencies. This document describes the indicators, datasets, and methods used to construct this deliverable.  This is a statewide effort to systematically and holistically consider water shortage vulnerability statewide of rural communities, focusing on domestic wells and state small water systems serving between 4 and 14 connections. The indicators and scoring methodology will be revised as better data become available and stake-holders evaluate the performance of the indicators, datasets used, and aggregation and ranking method used to aggregate and rank vulnerability scores. Additionally, the scoring system should be adaptive, meaning that our understanding of what contributes to risk and vulnerability of drought and water shortage may evolve. This understanding may especially be informed by experiences gained while navigating responses to future droughts.”
A spatial analysis was performed on the 2020 Census Block Groups, modified PLSS sections, and small water system service areas using a variety of input datasets related to drought vulnerability and water shortage risk and vulnerability. These indicator values were subsequently rescaled and summed for a final vulnerability score for the sections and small water system service areas. The 2020 Census Block Groups were joined with ACS data to represent the social vulnerability of communities, which is relevant to drought risk tolerance and resources. These three feature datasets contain the units of analysis (modified PLSS sections, block groups, small water systems service areas) with the model indicators for vulnerability in the attribute table. Model indicators are calculated for each unit of analysis according to the Vulnerability Scoring documents provided by Julia Ekstrom (Division of Regional Assistance).
All three feature classes are DWR analysis zones that are based off existing GIS datasets. The spatial data for the sections feature class is extracted from the Well Completion Reports PLSS sections to be aligned with the work and analysis that SGMA is doing. These are not true PLSS sections, but a version of the projected section lines in areas where there are gaps in PLSS. The spatial data for the Census block group feature class is downloaded from the Census. ACS (American Communities Survey) data is joined by block group, and statistics calculated by DWR have been added to the attribute table. The spatial data for the small water systems feature class was extracted from the State Water Resources Control Board (SWRCB) SABL dataset, using a definition query to filter for active water systems with 3000 connections or less. None of these datasets are intended to be the authoritative datasets for representing PLSS sections, Census block groups, or water service areas. The spatial data of these feature classes is used as units of analysis for the spatial analysis performed by DWR.
These datasets are intended to be authoritative datasets of the scoring tools required from DWR according to Senate Bill 552. Please refer to the Drought and Water Shortage Vulnerability Scoring: California's Domestic Wells and State Smalls Systems documentation for more information on indicators and scoring. These estimated indicator scores may sometimes be calculated in several different ways, or may have been calculated from data that has since be updated. Counts of domestic wells may be calculated in different ways. In order to align with DWR SGMO's (State Groundwater Management Office) California Groundwater Live dashboards, domestic wells were calculated using the same query. This includes all domestic wells in the Well Completion Reports dataset that are completed after 12/31/1976, and have a 'RecordType' of 'WellCompletion/New/Production or Monitoring/NA'.
Please refer to the Well Completion Reports metadata for more information. The associated data are considered DWR enterprise GIS data, which meet all appropriate requirements of the DWR Spatial Data Standards, specifically the DWR Spatial Data Standard version 3.4, dated September 14, 2022. DWR makes no warranties or guarantees — either expressed or implied— as to the completeness, accuracy, or correctness of the data.
DWR neither accepts nor assumes liability arising from or for any incorrect, incomplete, or misleading subject data. Comments, problems, improvements, updates, or suggestions should be forwarded to GIS@water.ca.gov.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
P
GPQA Dataset
paperswithcode.com
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2025). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa
Explore at:
Dataset updated
Jan 30, 2025
Authors
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
Description
GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

(1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.
P
Oceanic Life Dataset Dataset
paperswithcode.com
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Oceanic Life Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/oceanic-life-dataset
Explore at:
Dataset updated
Mar 12, 2025
Description
Description:

👉 Download the dataset here

Welcome to the Oceanic_Life-Dataset, an extraordinary collection of 7,990 high-resolution images that reveal the vast and diverse beauty of the ocean’s inhabitants. This meticulously curated dataset highlights the vibrant and intricate life forms thriving beneath the waves, from vibrant corals to swift fish, stealthy octopuses, and apex predators like sharks. The Oceanic_Life-Dataset offers a comprehensive visual resource that allows researchers, educators, and enthusiasts to dive deep into the ocean’s mysterious ecosystems.

Download Dataset

Key Features:

Expansive Image Database: With a vast array of 7,990 professionally curated images, this dataset presents a detailed exploration of marine life, providing a broad perspective on the species populating our oceans.

Diverse Marine Species: This dataset captures an impressive variety of aquatic creatures, from the delicate corals forming vibrant underwater landscapes to graceful fish, majestic sharks, intricate cephalopods, and other remarkable ocean inhabitants. It encapsulates the stunning biodiversity of marine ecosystems in unprecedented detail.

Exceptional Image Quality: Each image has been selected with care to ensure optimal clarity and high-definition visual accuracy. Whether you’re studying marine biology or conducting research on marine habitats, these visuals provide the intricate details necessary for an in-depth analysis of marine species.

Broad Taxonomic Coverage: Spanning a wide range of oceanic life, the dataset includes images of various marine species, from tropical coral reefs to deep-sea organisms. It serves as a critical resource for biodiversity research, enabling the study of species interactions, population dynamics, and ecological roles.

Ideal for Research, Education, and Conservation:

This dataset is designed to be a powerful tool for scientific research, educational purposes, and conservation efforts. Researchers can leverage this data for machine learning models aimed at identifying species, while educators can use the images to engage students in marine biology lessons. Conservationists can also use the dataset to bring awareness to the rich diversity found in our oceans and the importance of protecting it.

Enhanced Conservation Efforts: By visually capturing the beauty and complexity of marine life, the Oceanic_Life-Dataset encourages a deeper appreciation of the underwater world. It can serve as a strong foundation for campaigns that promote marine conservation, sustainability, and environmental stewardship.

Enriched Ecological Insights: Researchers can use this dataset to explore ecological relationships, study the impact of human activity on oceanic species, and develop data-driven solutions to preserve fragile marine ecosystems. The diversity within the dataset makes it suitable for AI-based research, image classification, and the development of conservation strategies.

Applications:

Marine Biology & Ecology: This dataset supports the study of ocean ecosystems, species distribution, and habitat interactions.

Artificial Intelligence & Machine Learning: Ideal for training computer vision models to recognize and categorize marine species.

Environmental Monitoring: A valuable resource for assessing the health of marine ecosystems and identifying changes in biodiversity due to climate change or pollution.

Education & Outreach: Engages audiences of all ages by providing captivating visuals that highlight the need for ocean conservation.

This dataset is sourced from Kaggle.
f
Evaluation results for play detection.
plos.figshare.com
xls
Updated Apr 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Bischofberger; Arnold Baca; Erich Schikuta (2024). Evaluation results for play detection. [Dataset]. http://doi.org/10.1371/journal.pone.0298107.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298107.t003
Dataset updated
Apr 18, 2024
Dataset provided by
PLOS ONE
Authors
Jonas Bischofberger; Arnold Baca; Erich Schikuta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
o
Long Term Development Statement (LTDS) Table 8 >95% Fault Data
ukpowernetworks.opendatasoft.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Long Term Development Statement (LTDS) Table 8 >95% Fault Data [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ltds-table-8-gt-95-perc-fault-data/
Explore at:
Dataset updated
May 30, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction The Long Term Development Statements (LTDS) report on a 0-5 year period, describing a forecast of load on the network and envisioned network developments. The LTDS is published at the end of May and November each year.

Long Term Development Statement Table 8 indicates any Fault Level restrictions or mitigations in place at our Grid and Primary substations. Published 30 May 2025.

More information and full reports are available from the landing page below: Long Term Development Statement and Network Development Plan Landing Page

Methodological Approach

Site Functional Locations (FLOCs) are used to associate the Substation to Key characteristics of active Grid and Primary sites — UK Power Networks ID field added to identify row number for reference purposes

Quality Control Statement Quality Control Measures include:

Verification steps to match features only with confirmed functional locations. Manual review and correction of data inconsistencies. Use of additional verification steps to ensure accuracy in the methodology.

Assurance Statement The Open Data Team and Network Insights Team worked together to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
o
Long Term Development Statement (LTDS) Table 1 Circuit data
ukpowernetworks.opendatasoft.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Long Term Development Statement (LTDS) Table 1 Circuit data [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ltds-table-1-circuit-data/
Explore at:
Dataset updated
May 30, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction The Long Term Development Statement (LTDS) report on a 0-5 year period, describing a forecast of load on the network and envisioned network developments. The LTDS is published at the end of May and November each year. This is Table 1 from our current LTDS report (published 30 May 2025), showing the circuit associated with each Grid and Primary substation. More information and full reports are available from the landing page below: Long Term Development Statement and Network Development Plan Landing Page — UK Power Networks (opendatasoft.com)

Methodological Approach

Site Functional Locations (FLOCs) are used to associate the From Substation to Key characteristics of active Grid and Primary sites — UK Power Networks ID field added to identify row number for reference purposes

Quality Control Statement Quality Control Measures include:

Verification steps to match features only with confirmed functional locations. Manual review and correction of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology

Assurance Statement The Open Data Team and Network Insights Team worked together to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: Open Data Portal Glossary
A
‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by...
analyst-2.ai
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-covid-19-cases-by-population-characteristics-over-time-097d/6c8f14dd/?iid=004-510&v=presentation
Explore at:
Dataset updated
Feb 15, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘COVID-19 Cases by Population Characteristics Over Time’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/a3291d85-0076-43c5-a59c-df49480cdc6d on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Note: On January 22, 2022, system updates to improve the timeliness and accuracy of San Francisco COVID-19 cases and deaths data were implemented. You might see some fluctuations in historic data as a result of this change. Due to the changes, starting on January 22, 2022, the number of new cases reported daily will be higher than under the old system as cases that would have taken longer to process will be reported earlier.

A. SUMMARY This dataset shows San Francisco COVID-19 cases by population characteristics and by specimen collection date. Cases are included on the date the positive test was collected.

Population characteristics are subgroups, or demographic cross-sections, like age, race, or gender. The City tracks how cases have been distributed among different subgroups. This information can reveal trends and disparities among groups.

Data is lagged by five days, meaning the most recent specimen collection date included is 5 days prior to today. Tests take time to process and report, so more recent data is less reliable.

B. HOW THE DATASET IS CREATED Data on the population characteristics of COVID-19 cases and deaths are from: * Case interviews * Laboratories * Medical providers

These multiple streams of data are merged, deduplicated, and undergo data verification processes. This data may not be immediately available for recently reported cases because of the time needed to process tests and validate cases. Daily case totals on previous days may increase or decrease. Learn more.

Data are continually updated to maximize completeness of information and reporting on San Francisco residents with COVID-19.

Data notes on each population characteristic type is listed below.

Race/ethnicity * We include all race/ethnicity categories that are collected for COVID-19 cases. * The population estimates for the "Other" or “Multi-racial” groups should be considered with caution. The Census definition is likely not exactly aligned with how the City collects this data. For that reason, we do not recommend calculating population rates for these groups.

Sexual orientation * Sexual orientation data is collected from individuals who are 18 years old or older. These individuals can choose whether to provide this information during case interviews. Learn more about our data collection guidelines. * The City began asking for this information on April 28, 2020.

Gender * The City collects information on gender identity using these guidelines.

Comorbidities * Underlying conditions are reported when a person has one or more underlying health conditions at the time of diagnosis or death.

Transmission type * Information on transmission of COVID-19 is based on case interviews with individuals who have a confirmed positive test. Individuals are asked if they have been in close contact with a known COVID-19 case. If they answer yes, transmission category is recorded as contact with a known case. If they report no contact with a known case, transmission category is recorded as community transmission. If the case is not interviewed or was not asked the question, they are counted as unknown.

Homelessness Persons are identified as homeless based on several data sources: * self-reported living situation
* the location at the time of testing * Department of Public Health homelessness and health databases * Residents in Single-Room Occupancy hotels are not included in these figures.
These methods serve as an estimate of persons experiencing homelessness. They may not meet other homelessness definitions.

Skilled Nursing Facility (SNF) occupancy * A Skilled Nursing

--- Original source retains full ownership of the source dataset ---
o
Data Roadmap and Tracker
ukpowernetworks.opendatasoft.com
csv, excel, json
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Data Roadmap and Tracker [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-external-facing-tracker/
Explore at:
json, csv, excelAvailable download formats
Dataset updated
Jun 24, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction This dataset provides visibility on what datasets are under review and when they may be published on the Portal. The datasets are all triaged and have the following triage ratings:

1 - In review 2 - Published 3 - Published with restricted access 4 - Rejected

Methodological Approach This dataset is fed by a static dataset via Sharepoint, and is updated as and when there is an update.

Quality Control Statement Quality Control Measures include:

Manual review and correction of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology

Assurance Statement The Open Data Team has checked to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: Open Data Portal Glossary
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
g
National Land Cover Database (NLCD) 2019 Accuracy Assessment Points...
gimi9.com
s.cnmilf.com
+1more
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). National Land Cover Database (NLCD) 2019 Accuracy Assessment Points Conterminous United States [Dataset]. https://gimi9.com/dataset/data-gov_national-land-cover-database-nlcd-2019-accuracy-assessment-points-conterminous-united-stat/
Explore at:
Dataset updated
Mar 1, 2023
Area covered
United States
Description
The National Land Cover Database (NLCD), a product suite produced through the Multi-resolution Land Characteristics (MRLC) consortium, is an operational land cover monitoring program. The release of NLCD2019 extends the database to 18 years. We collected land cover reference data for the 2016 and 2019 components of the NLCD2019 database at Level II and Level I of the classification hierarchy. For both dates, Level II land cover overall accuracies (OA) were 77.5% ± 1% (± value is the standard error) when agreement was defined as a match between the map label and primary reference label only and increased to 87.1% ± 0.7% when agreement was defined as a match between the map label and either the primary or alternate reference label. At Level I of the classification hierarchy, land cover OA was 83.1% ± 0.9% for both 2016 and 2019 when agreement was defined as a match between the map label and primary reference label only and increased to 90.3% ± 0.7% when agreement also included the alternate reference label. The Level II and Level I OA for the 2016 land cover in the NLCD2019 database were 5% higher compared to the 2016 land cover component of the NLCD2016 database when agreement was defined as a match between the map label and primary reference label only. No improvement was realized by the NLCD2019 database when agreement also included the alternate reference label. User’s accuracies (UA) for forest loss and grass gain were 70% when agreement included either the primary or alternate label, and UA was generally 50% for all other change themes. Producer’s accuracies (PA) were 70% for grass loss and gain and water gain and generally 50% for the other change themes.
D
ARCHIVED: COVID-19 Cases by Geography Over Time
data.sfgov.org
application/rdfxml +5
Updated Jul 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Public Health - Population Health Division (2020). ARCHIVED: COVID-19 Cases by Geography Over Time [Dataset]. https://data.sfgov.org/w/d2ef-idww/ikek-yizv?cur=6pe39zMjfCR&from=f5tFBDuJcU8
Explore at:
xml, application/rdfxml, json, tsv, csv, application/rssxmlAvailable download formats
Dataset updated
Jul 17, 2020
Dataset authored and provided by
Department of Public Health - Population Health Division
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
A. SUMMARY This dataset contains COVID-19 positive confirmed cases aggregated by several different geographic areas and by day. COVID-19 cases are mapped to the residence of the individual and shown on the date the positive test was collected. In addition, 2016-2020 American Community Survey (ACS) population estimates are included to calculate the cumulative rate per 10,000 residents.

Dataset covers cases going back to 3/2/2020 when testing began. This data may not be immediately available for recently reported cases and data will change to reflect as information becomes available. Data updated daily.

Geographic areas summarized are: 1. Analysis Neighborhoods 2. Census Tracts 3. Census Zip Code Tabulation Areas

B. HOW THE DATASET IS CREATED Addresses from the COVID-19 case data are geocoded by the San Francisco Department of Public Health (SFDPH). Those addresses are spatially joined to the geographic areas. Counts are generated based on the number of address points that match each geographic area for a given date.

The 2016-2020 American Community Survey (ACS) population estimates provided by the Census are used to create a cumulative rate which is equal to ([cumulative count up to that date] / [acs_population]) * 10000) representing the number of total cases per 10,000 residents (as of the specified date).

COVID-19 case data undergo quality assurance and other data verification processes and are continually updated to maximize completeness and accuracy of information. This means data may change for previous days as information is updated.

C. UPDATE PROCESS Geographic analysis is scripted by SFDPH staff and synced to this dataset daily at 05:00 Pacific Time.

D. HOW TO USE THIS DATASET San Francisco population estimates for geographic regions can be found in a view based on the San Francisco Population and Demographic Census dataset. These population estimates are from the 2016-2020 5-year American Community Survey (ACS).

This dataset can be used to track the spread of COVID-19 throughout the city, in a variety of geographic areas. Note that the new cases column in the data represents the number of new cases confirmed in a certain area on the specified day, while the cumulative cases column is the cumulative total of cases in a certain area as of the specified date.

Privacy rules in effect To protect privacy, certain rules are in effect: 1. Any area with a cumulative case count less than 10 are dropped for all days the cumulative count was less than 10. These will be null values. 2. Once an area has a cumulative case count of 10 or greater, that area will have a new row of case data every day following. 3. Cases are dropped altogether for areas where acs_population < 1000 4. Deaths data are not included in this dataset for privacy reasons. The low COVID-19 death rate in San Francisco, along with other publicly available information on deaths, means that deaths data by geography and day is too granular and potentially risky. Read more in our privacy guidelines

Rate suppression in effect where counts lower than 20 Rates are not calculated unless the cumulative case count is greater than or equal to 20. Rates are generally unstable at small numbers, so we avoid calculating them directly. We advise you to apply the same approach as this is best practice in epidemiology.

A note on Census ZIP Code Tabulation Areas (ZCTAs) ZIP Code Tabulation Areas are special boundaries created by the U.S. Census based on ZIP Codes developed by the USPS. They are not, however, the same thing. ZCTAs are areal representations of routes. Read how the Census develops ZCTAs on their website.

Rows included for Citywide case counts Rows are included for the Citywide case counts and incidence rate every day. These Citywide rows can be used for comparisons. Citywide will capture all cases regardless of address quality. While some cases cannot be mapped to sub-areas like Census Tracts, ongoing data quality efforts result in improved mapping on a rolling bases.

Related dataset See the dataset of the most recent cumulative counts for all geographic areas here: https://data.sfgov.org/COVID-19/COVID-19-Cases-and-Deaths-Summarized-by-Geography/tpyr-dvnc

E. CHANGE LOG
9/11/2023 - data on COVID-19 cases by geography over time are no longer being updated. This data is currently through 9/6/2023 and will not include any new data after this date.
4/6/2023 - the State implemented system updates to improve the integrity of historical data.
2/21/2023 - system updates to improve reliability and accuracy of cases data were implemented.
1/31/2023 - updated “acs_population” column to reflect the 2020 Census Bureau American Community Survey (ACS) San Francisco Population estimates.
1/31/2023 - implemented system updates to streamline and improve our geo-coded data, resulting in small shifts in our case data by geography.
1/31/2023 - renamed column “last_updated_at” to “data_as_of”.
1/31/2023 - removed the “multipolygon” column. To access the multipolygon geometry column for each geography unit, refer to COVID-19 Cases and Deaths Summarized by Geography.
1/22/2022 - system updates to improve timeliness and accuracy of cases and deaths data were implemented.
4/16/2021 - dataset updated to refresh with a five-day data lag.
Above and below ground biomass carbon and soil organic carbon to 1m depth...
rwanda.africageoportal.com
climate.esri.ca
+8more
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UN Environment World Conservation Monitoring Centre (2021). Above and below ground biomass carbon and soil organic carbon to 1m depth (tonnes/ha) [Dataset]. https://rwanda.africageoportal.com/datasets/e04ea4f7ecbb4e7fb0325dfa24af6969
Explore at:
Dataset updated
May 4, 2021
Dataset provided by
World Conservation Monitoring Centrehttp://www.unep-wcmc.org/
Authors
UN Environment World Conservation Monitoring Centre
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
Description
This dataset represents above- and below-ground terrestrial carbon storage (tonnes (t) of C per hectare (ha)) for circa 2010. This layer supports analysis but, if needed, a direct download of the data can be accessed here.

The dataset was constructed by combining the most reliable publicly available datasets and overlying them with the ESA CCI landcover map for the year 2010 [ESA, 2017], assigning to each grid cell the corresponding above-ground biomass value from the biomass map that was most appropriate for the grid cell’s landcover type.

Input carbon datasets were identified through a literature review of existing datasets on biomass carbon in terrestrial ecosystems published in peer-reviewed literature. To determine which datasets to combine to produce the global carbon density map, identified datasets were evaluated based on resolution, accuracy, biomass definition and reference date (see table 1 for further information on datasets selected).

Dataset

Scope

Year

Resolution

Definition

Santoro et al. 2018

Global

2010

100 m

Above-ground woody biomass for trees that are >10 cm diameter-at-breast-height, masked to Landsat-derived canopy cover for 2010; biomass is expressed as oven-dry weight of the woody parts (stem, bark, branches and twigs) of all living trees excluding stump and roots.

Xia et al. 2014

Global

1982-2006

8 km

Above-ground grassland biomass.

Bouvet et al. 2018

Africa

2010

25 m

Above-ground woodland and savannah biomass; low woody biomass areas, which therefore exclude dense forests and deserts.

Spawn et al. 2017

Global

2010

300 m

Synthetic, global above- and below-ground biomass maps that combine recently released satellite-based data of standing forest biomass with novel estimates for non-forest biomass stocks.

After aggregating each selected dataset to a nominal scale of 300 m resolution, forest categories in the CCI ESA 2010 landcover dataset were used to extract above-ground biomass from Santoro et al. 2018 for forest areas. Woodland and savanna biomass were then incorporated for Africa from Bouvet et al. 2018., and from Santoro et al. 2018 for areas outside of Africa and outside of forest. Biomass from croplands, sparse vegetation and grassland landcover classes from CCI ESA, in addition to shrubland areas outside Africa missing from Santoro et al. 2018, were extracted from were extracted from Xia et al. 2014. and Spawn et al. 2017 averaged by ecological zone for each landcover type.

Below-ground biomass were added using root-to-shoot ratios from the 2006 IPCC guidelines for National Greenhouse Gas Inventories (IPCC, 2006). No below-ground values were assigned to croplands as ratios were unavailable. Above- and below-ground biomass were then summed together and multiplied by 0.5 to convert to carbon, generating a single above-and-below-ground biomass carbon layer.
f
Data from "Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov"
figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Miron; Rafael Gonçalves; Mark A. Musen (2023). Data from "Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov" [Dataset]. http://doi.org/10.6084/m9.figshare.12743939.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12743939.v2
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Laura Miron; Rafael Gonçalves; Mark A. Musen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This fileset provides supporting data and corpora for the empirical study described in: Laura Miron, Rafael S. Goncalves and Mark A. Musen. Obstacles to the Reuse of Metadata in ClinicalTrials.govDescription of filesOriginal data files:- AllPublicXml.zip contains the set of all public XML records in ClinicalTrials.gov (protocols and summary results information), on which all remaining analyses are based. Set contains 302,091 records downloaded on April 3, 2019.- public.xsd is the XML schema downloaded from ClinicalTrials.gov on April 3, 2019, used to validate records in AllPublicXML.BioPortal API Query Results- condition_matches.csv contains the results of querying the BioPortal API for all ontology terms that are an 'exact match' to each condition string scraped from the ClinicalTrials.gov XML. Columns={filename, condition, url, bioportal term, cuis, tuis}. - intervention_matches.csv contains BioPortal API query results for all interventions scraped from the ClinicalTrials.gov XML. Columns={filename, intervention, url, bioportal term, cuis, tuis}.Data Element Definitions- supplementary_table_1.xlsx Mapping of element names, element types, and whether elements are required in ClinicalTrials.gov data dictionaries, the ClinicalTrials.gov XML schema declaration for records (public.XSD), the Protocol Registration System (PRS), FDAAA801, and the WHO required data elements for clinical trial registrations.Column and value definitions: - CT.gov Data Dictionary Section: Section heading for a group of data elements in the ClinicalTrials.gov data dictionary (https://prsinfo.clinicaltrials.gov/definitions.html) - CT.gov Data Dictionary Element Name: Name of an element/field according to the ClinicalTrials.gov data dictionaries (https://prsinfo.clinicaltrials.gov/definitions.html) and (https://prsinfo.clinicaltrials.gov/expanded_access_definitions.html) - CT.gov Data Dictionary Element Type: "Data" if the element is a field for which the user provides a value, "Group Heading" if the element is a group heading for several sub-fields, but is not in itself associated with a user-provided value. - Required for CT.gov for Interventional Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to interventional records (only observational or expanded access) - Required for CT.gov for Observational Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to observational records (only interventional or expanded access) - Required in CT.gov for Expanded Access Records?: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to expanded access records (only interventional or observational) - CT.gov XSD Element Definition: abbreviated xpath to the corresponding element in the ClinicalTrials.gov XSD (public.XSD). The full xpath includes 'clinical_study/' as a prefix to every element. (There is a single top-level element called "clinical_study" for all other elements.) - Required in XSD? : "Yes" if the element is required according to public.XSD, "No" if the element is optional, "-" if the element is not made public or included in the XSD - Type in XSD: "text" if the XSD type was "xs:string" or "textblock", name of enum given if type was enum, "integer" if type was "xs:integer" or "xs:integer" extended with the "type" attribute, "struct" if the type was a struct defined in the XSD - PRS Element Name: Name of the corresponding entry field in the PRS system - PRS Entry Type: Entry type in the PRS system. This column contains some free text explanations/observations - FDAAA801 Final Rule FIeld Name: Name of the corresponding required field in the FDAAA801 Final Rule (https://www.federalregister.gov/documents/2016/09/21/2016-22129/clinical-trials-registration-and-results-information-submission). This column contains many empty values where elements in ClinicalTrials.gov do not correspond to a field required by the FDA - WHO Field Name: Name of the corresponding field required by the WHO Trial Registration Data Set (v 1.3.1) (https://prsinfo.clinicaltrials.gov/trainTrainer/WHO-ICMJE-ClinTrialsgov-Cross-Ref.pdf)Analytical Results:- EC_human_review.csv contains the results of a manual review of random sample eligibility criteria from 400 CT.gov records. Table gives filename, criteria, and whether manual review determined the criteria to contain criteria for "multiple subgroups" of participants.- completeness.xlsx contains counts and percentages of interventional records missing fields required by FDAAA801 and its Final Rule.- industry_completeness.xlsx contains percentages of interventional records missing required fields, broken up by agency class of trial's lead sponsor ("NIH", "US Fed", "Industry", or "Other"), and before and after the effective date of the Final Rule- location_completeness.xlsx contains percentages of interventional records missing required fields, broken up by whether record listed at least one location in the United States and records with only international location (excluding trials with no listed location), and before and after the effective date of the Final RuleIntermediate Results:- cache.zip contains pickle and csv files of pandas dataframes with values scraped from the XML records in AllPublicXML. Downloading these files greatly speeds up running analysis steps from jupyter notebooks in our github repository.
g
Accuracy turns year 2024 | gimi9.com
gimi9.com
Updated Dec 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Accuracy turns year 2024 | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_c_a944-accuratezza-spire-anno-2024/
Explore at:
Dataset updated
Dec 14, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset shows the percentage of accuracy of traffic data collected from coils, for the year 2024. The 100% percentage means that in the reference time slot the coil has correctly detected the data. 0% means that the coil did not detect the data, for any reason, for the entire duration of the time slot. Intermediate percentages indicate partial detections within the reference time slot.The dataset must be used in parallel with the dataset Vehicle detection through turns - year 2024 where the traffic data are contained. To identify the accuracy of such traffic data, refer to the combination of date and code and the specific time slot in this dataset.
d
Aboriginal Settlements - Cultural Text (DPLH-016) - Datasets -...
catalogue.data.wa.gov.au
Updated Dec 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Aboriginal Settlements - Cultural Text (DPLH-016) - Datasets - data.wa.gov.au [Dataset]. https://catalogue.data.wa.gov.au/dataset/caboriginal-settlements-cultural-text-dop-049
Explore at:
Dataset updated
Dec 12, 2019
Area covered
Western Australia
Description
Data Downloads GeoJSON GeoJSON This resource provides the latest snapshot of... Data Licensing Agreement For the use of digital information acquired from DataWA This agreement is made this day: Between: The Department of Planning, Lands and Heritage of 140 William Street, Perth, Western Australia (the Licensor) And The user of DataWA (data.wa.gov.au) (the Licensee) DEFINITIONS In this agreement the following definitions apply: Information means the data, datasets and information that are on the website of DataWA (being data.wa.gov.au). Permitted Purpose means the use for internal business or personal purposes only and not for any external or further display, distribution, sale, licence, hire, let or trade to a third party, regardless of charge or not. LICENCE CONDITIONS The Licensor grants to the Licensee a licence to use the Information on the terms and conditions set out in this agreement. The Licensee can only use the Information for the Permitted Use. The Information shall not be used for any purpose other than the Permitted Use, or be dispatched to any other user or agent. The Information shall at all times remain the property of the Licensor. All products produced by the Licensee from the use of the Information shall bear a logo and text acknowledging the Licensor as the source of the Information. The Licensor and all of its respective servants, agents and officers, shall not be held liable for any action, proceeding, claim, suit or demand arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee will indemnify and keep indemnified the Licensor and all of its respective servants, agents and officers from and against all actions, proceedings, claims, suits or demands whatsoever which may at any time be brought, maintained or made against the Licensor and/or any of its respective servants, agents or officers arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee acknowledges and understands that the Licensor has, in good faith, made every effort to ensure that the Information is complete, current and reliable. However, the Licensor makes no warranty or representation about the accuracy, adequacy or completeness of the Information and that, before relying on the Information in any important matter, the Licensee should carefully evaluate the accuracy, completeness and relevance of the Information for its purposes and should obtain appropriate professional advice relevant to its particular circumstances. " > Download File Geodatabase FGDB This resource provides the latest snapshot of... Data Licensing Agreement For the use of digital information acquired from DataWA This agreement is made this day: Between: The Department of Planning, Lands and Heritage of 140 William Street, Perth, Western Australia (the Licensor) And The user of DataWA (data.wa.gov.au) (the Licensee) DEFINITIONS In this agreement the following definitions apply: Information means the data, datasets and information that are on the website of DataWA (being data.wa.gov.au). Permitted Purpose means the use for internal business or personal purposes only and not for any external or further display, distribution, sale, licence, hire, let or trade to a third party, regardless of charge or not. LICENCE CONDITIONS The Licensor grants to the Licensee a licence to use the Information on the terms and conditions set out in this agreement. The Licensee can only use the Information for the Permitted Use. The Information shall not be used for any purpose other than the Permitted Use, or be dispatched to any other user or agent. The Information shall at all times remain the property of the Licensor. All products produced by the Licensee from the use of the Information shall bear a logo and text acknowledging the Licensor as the source of the Information. The Licensor and all of its respective servants, agents and officers, shall not be held liable for any action, proceeding, claim, suit or demand arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee will indemnify and keep indemnified the Licensor and all of its respective servants, agents and officers from and against all actions, proceedings, claims, suits or demands whatsoever which may at any time be brought, maintained or made against the Licensor and/or any of its respective servants, agents or officers arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee acknowledges and understands that the Licensor has, in good faith, made every effort to ensure that the Information is complete, current and reliable. However, the Licensor makes no warranty or representation about the accuracy, adequacy or completeness of the Information and that, before relying on the Information in any important matter, the Licensee should carefully evaluate the accuracy, completeness and relevance of the Information for its purposes and should obtain appropriate professional advice relevant to its particular circumstances. " > Download Geopackage GeoPackage This resource provides the latest snapshot of... Data Licensing Agreement For the use of digital information acquired from DataWA This agreement is made this day: Between: The Department of Planning, Lands and Heritage of 140 William Street, Perth, Western Australia (the Licensor) And The user of DataWA (data.wa.gov.au) (the Licensee) DEFINITIONS In this agreement the following definitions apply: Information means the data, datasets and information that are on the website of DataWA (being data.wa.gov.au). Permitted Purpose means the use for internal business or personal purposes only and not for any external or further display, distribution, sale, licence, hire, let or trade to a third party, regardless of charge or not. LICENCE CONDITIONS The Licensor grants to the Licensee a licence to use the Information on the terms and conditions set out in this agreement. The Licensee can only use the Information for the Permitted Use. The Information shall not be used for any purpose other than the Permitted Use, or be dispatched to any other user or agent. The Information shall at all times remain the property of the Licensor. All products produced by the Licensee from the use of the Information shall bear a logo and text acknowledging the Licensor as the source of the Information. The Licensor and all of its respective servants, agents and officers, shall not be held liable for any action, proceeding, claim, suit or demand arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee will indemnify and keep indemnified the Licensor and all of its respective servants, agents and officers from and against all actions, proceedings, claims, suits or demands whatsoever which may at any time be brought, maintained or made against the Licensor and/or any of its respective servants, agents or officers arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee acknowledges and understands that the Licensor has, in good faith, made every effort to ensure that the Information is complete, current and reliable. However, the Licensor makes no warranty or representation about the accuracy, adequacy or completeness of the Information and that, before relying on the Information in any important matter, the Licensee should carefully evaluate the accuracy, completeness and relevance of the Information for its purposes and should obtain appropriate professional advice relevant to its particular circumstances. " > Download Shapefile SHP This resource provides the latest snapshot of... Data Licensing Agreement For the use of digital information acquired from DataWA This agreement is made this day: Between: The Department of Planning, Lands and Heritage of 140 William Street, Perth, Western Australia (the Licensor) And The user of DataWA (data.wa.gov.au) (the Licensee) DEFINITIONS In this agreement the following definitions apply: Information means the data, datasets and information that are on the website of DataWA (being data.wa.gov.au). Permitted Purpose means the use for internal business or personal purposes only and not for any external or further display, distribution, sale, licence, hire, let or trade to a third party, regardless of charge or not. LICENCE CONDITIONS The Licensor grants to the Licensee a licence to use the Information on the terms and conditions set out in this agreement. The Licensee can only use the Information for the Permitted Use. The Information shall not be used for any purpose other than the Permitted Use, or be dispatched to any other user or agent. The Information shall at all times remain the property of the Licensor. All products produced by the Licensee from the use of the Information shall bear a logo and text acknowledging the Licensor as the source of the Information. The Licensor and all of its respective servants, agents and officers, shall not be held liable for any action, proceeding, claim, suit or demand arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee will indemnify and keep indemnified the Licensor and all of its respective servants, agents and officers from and against all actions, proceedings, claims, suits or demands whatsoever which may at any time be brought, maintained or made against the Licensor and/or any of its respective servants, agents or officers arising from or otherwise relating to the interpretation, accuracy or use of the Information by the Licensee. The Licensee acknowledges and understands that the Licensor has, in good faith, made every effort to ensure that the Information is complete, current and reliable. However, the Licensor makes no warranty or representation about
H
Dictionary of Titles
dataverse.harvard.edu
search.dataone.org
Updated Apr 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahad Althobaiti; Ahmad Alabdulkareem; Judy Hanwen Shen; Iyad Rahwan; Esteban Moro; Alex Rutherford (2022). Dictionary of Titles [Dataset]. http://doi.org/10.7910/DVN/DQW8IP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DQW8IP
Dataset updated
Apr 6, 2022
Dataset provided by
Harvard Dataverse
Authors
Shahad Althobaiti; Ahmad Alabdulkareem; Judy Hanwen Shen; Iyad Rahwan; Esteban Moro; Alex Rutherford
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Hand transcribed content from the United States Bureau of Labour Statistics Dictionary of Titles (DoT). The DoT is a record of occupations and a description of the tasks performed. Five editions exist from 1939, 1949, 1965, 1977 and 1991. The DoT was replaced by O*NET structured data on jobs, workers and their characteristics. However, apart from the 1991 data, the data in the DoT is not easily ingestible, existing only in scalar PDF documents. Attempts at Optical Character Recognition led to low accuracy. For that reason we present here hand transcribed textual data from these documents. Various data are available for each occupation e.g. numerical codes, references to other occupations as well as the free text description. For that reason the data for each edition is presented in 'long' format with a variable number of lines, with a blank line between occupations. Consult the transcription instructions for more details. Structured meta-data (see here) on occupations is also available for the 1965, 1977 and 1991 editions. For the 1965, 1977 and 1991 editions, this data can be extracted from the numerical codes with the occupational entries, the key for these codes is found in the 1965 edition in separate tables exist which were transcribed. The instructions provided to transcribers for this edition are also added to the repository. The original documents are freely available in PDF format (e.g. here) This data accompanies the paper 'Longitudinal Complex Dynamics of Labour Markets Reveal Increasing Polarisation' by Althobaiti et al

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Net Zero Use Cases and Data Requirements [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/top-30-use-cases/

Net Zero Use Cases and Data Requirements

Explore at:

excel, json, csvAvailable download formats

Dataset updated

Jun 8, 2025

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IntroductionFollowing the identification of Local Area Energy Planning (LAEP) use cases, this dataset lists the data sources and/or information that could help facilitate this research. View our dedicated page to find out how we derived this list: Local Area Energy Plan — UK Power Networks (opendatasoft.com)

Methodological Approach Data upload: a list of datasets and ancillary details are uploaded into a static Excel file before uploaded onto the Open Data Portal.

Quality Control Statement

Quality Control Measures include: Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology

Assurance Statement The Open Data Team and Local Net Zero Team worked together to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/

Please note that "number of records" in the top left corner is higher than the number of datasets available as many datasets are indexed against multiple use cases leading to them being counted as multiple records.

Clear search

Close search

Google apps

Main menu

Net Zero Use Cases and Data Requirements

National Land Cover Database (NLCD) 2016 Accuracy Assessment Points...

i07 Water Shortage Vulnerability Sections

Film Circulation dataset

Best Books Ever Dataset

GPQA Dataset

Oceanic Life Dataset Dataset

Evaluation results for play detection.

Long Term Development Statement (LTDS) Table 8 >95% Fault Data

Long Term Development Statement (LTDS) Table 1 Circuit data

‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by...

Data Roadmap and Tracker

LScDC (Leicester Scientific Dictionary-Core)

National Land Cover Database (NLCD) 2019 Accuracy Assessment Points...

ARCHIVED: COVID-19 Cases by Geography Over Time

Above and below ground biomass carbon and soil organic carbon to 1m depth...

Data from "Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov"

Accuracy turns year 2024 | gimi9.com

Aboriginal Settlements - Cultural Text (DPLH-016) - Datasets -...

Dictionary of Titles

Net Zero Use Cases and Data Requirements