100+ datasets found
  1. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  2. i

    Research Data Management metadata-driven studies survey corpus and coding -...

    • rdm.inesctec.pt
    Updated Jan 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Research Data Management metadata-driven studies survey corpus and coding - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2020-003
    Explore at:
    Dataset updated
    Jan 9, 2020
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset resulted from a survey over research data management metadriven studies. A summary of RDM studies was outlined that use at least one technique to engage researchers in the development of tools, or to improve and assess their metadata practices.To this end the Scopus database was searched, in January 2019, and 219 RDM entries that feature the concept “metadata” in the title or as a keyword were obtained. For a broader coverage of publications the list of 301 publications provided by the Perrier et al. scoping review was also assessed. The final corpus of analysis consisted of 14 studies that were coded according to their RDM context, motivation, metadata context, participants domain, methodological approach, metadata practices of participants, their main findings and recommendations. The publications with tha labels LabTrove, Archaeological_Digging and Publishing_Pushing were collected through the Perrier et al. scoping review dataset (https://doi.org/10.1371/journal.pone.0178261.s003).

  3. Dataset relating a study on Geospatial Open Data usage and metadata quality

    • zenodo.org
    csv
    Updated Jun 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfonso Quarati; Alfonso Quarati (2023). Dataset relating a study on Geospatial Open Data usage and metadata quality [Dataset]. http://doi.org/10.5281/zenodo.4584542
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alfonso Quarati; Alfonso Quarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

    The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).

    Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

    The header for each CSV file is:

    [ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]

    where for each row (a portal's dataset) the following fields are defined as follows:

    • portalid: portal identifier
    • id: dataset identifier
    • downloaddate: date of data collection
    • overallq: overall quality values computed by applying the methodology presented in [1]
    • qvalues: json object containing the quality values computed for the 17 metrics presented in [1]
    • assessdate: date of quality assessment
    • dviews: number of total views for the dataset
    • downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)
    • engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)
    • admindomain: 1 (national), 3 (international)
    • metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema

    [1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909

  4. Data from: LCA Domain Metadata Schema Inventory

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). LCA Domain Metadata Schema Inventory [Dataset]. https://catalog.data.gov/dataset/lca-domain-metadata-schema-inventory-7e1b6
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This excel workbook is a compilation of the major metadata schemas for life cycle assessment. Resources in this dataset:Resource Title: LCADomain_MetadataSchema_Inventory_v1_0_2. File Name: LCADomain_MetadataSchema_Inventory_v1_0_2.xlsm

  5. Z

    Dataset relating to the study "Open government data: usage trends and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quarati, Alfonso (2021). Dataset relating to the study "Open government data: usage trends and metadata quality" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4054742
    Explore at:
    Dataset updated
    Oct 8, 2021
    Dataset authored and provided by
    Quarati, Alfonso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open Government Data (OGD) has the potential to support social and economic progress. However, this potential can be frustrated if this data remains unused. Although the literature suggests that OGD datasets' metadata quality is one of the main factors affecting their use, to the best of our knowledge, no quantitative study provided evidence of this relationship. Considering about 400,000 datasets of 28 national, municipal, and international OGD portals, we have programmatically analyzed their usage, their metadata quality, and the relationship between the two. Our analysis has highlighted three main findings. First of all, regardless of their size, the software platform adopted, and their administrative and territorial coverage, most OGD datasets are underutilized. Second, OGD portals pay varying attention to the quality of their datasets’ metadata. Third, we did not find clear evidence that datasets usage is positively correlated to better metadata publishing practices. Finally, we have considered other factors, such as datasets’ category, and some demographic characteristics of the OGD portals, and analyzed their relationship with datasets usage, obtaining partially affirmative answers.

    The dataset consists of three zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 400,000 datasets belonging to the 8 national, 4 international, and 16 US municipalities OGD portals considered in the study.

    Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

    Portal #Datasets Platform

    US 261,514 CKAN

    France 39,412 Other

    Colombia 9,795 Socrata

    IE 9,598 CKAN

    Slovenia 4,892 CKAN

    Poland 1,032 Other

    Latvia 336 CKAN

    Puerto Rico 178 Socrata

    New York, NY 2,771 Socrata

    Baltimore, MD 2,617 Socrata

    Austin, TX 2,353 Socrata

    Chicago, IL 1,368 Socrata

    San Francisco, CA 1,001 Socrata

    Dallas, TX 1,001 Socrata

    Los Angeles, CA 943 Socrata

    Seattle, WA 718 Socrata

    Providence, RI 288 Socrata

    Honolulu, HI 244 Socrata

    New Orleans, LA 215 Socrata

    Buffalo, NY 213 Socrata

    Nashville, TN 172 Socrata

    Boston, MA 170 CKAN

    Albuquerque, NM 60 CKAN

    Albany, NY 50 Socrata

    HDX 17,325 CKAN

    EUODP 14,058 CKAN

    NASA 9,664 Socrata

    World Bank Finances 2,177 Socrata

    The three datasets share the same table structure:

    Table Fields

    portalid: portal identifier

    id: dataset identifier

    engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)

    admindomain: 1 (National), 2 (US), 3 (International)

    downloaddate: date of data collection

    views: number of total views for the dataset

    downloads: number of total downloads for the dataset

    overallq: overall quality values computed by applying the methodology presented by Neumaier et al. in [1]

    qvalues: json object containing the quality values computed for the 17 metrics presented in by Neumaier et al. [1]

    assessdate: date of quality assessment

    metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema

    [1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909

  6. Z

    Dataset for "Good practice versus reality: A landscape analysis of Research...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    El Hounsri, Anas (2025). Dataset for "Good practice versus reality: A landscape analysis of Research Software metadata adoption in European Open Science Clusters" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14770577
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    El Hounsri, Anas
    Garijo, Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was collected using this GitHub repository Repositories-Extraction, collected the links to the repositories from each scientific cluster, and using the GitHub repository Metadata-Extraction, we were able to extract the relevant information needed to answer our research questions (RQ):

    RQ1: How do communities describe Research Software metadata in their code repositories?

    RQ2: What is the adoption of archival infrastructures across disciplines?

    RQ3: How do software projects adopt versioning?

    RQ4: How comprehensive is the metadata provided in code repositories? Specifically:

    What is the adoption of open licenses?

    Do research projects include a description?

    How well documented are research projects? (i.e., in terms of installation instructions, requirements and documentation availability

    RQ5: What are the most common citation practices among communities?

    The dataset has two types of information, for example for one cluster we can say "ENVRI", for each RQ you will find "analysis_envri_rq1.json" which contains the information extracted using SOMEF and processed to extract the relevant information, and you will find "results_envri_rq1.json" which is the calculations of the percentages of each relevant files to the RQ.

  7. e

    MAGGOT : Metadata Management Tool for Data Storage Spaces - Dataset - B2FIND...

    • b2find.eudat.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MAGGOT : Metadata Management Tool for Data Storage Spaces - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6ecfa518-cc74-52b4-bd07-0f545d477739
    Explore at:
    Description

    The main functionalities of Maggot were established according to a well-defined need (See Background) Documente with Metadata your datasets produced within a collective of people, thus making it possible : o answer certain questions of the Data Management Plan (DMP) concerning the organization, documentation, storage and sharing of data in the data storage space, to meet certain data and metadata requirements, listed for example by the Open Research Europe in accordance with the FAIR principles. Search datasets by their metadata : Indeed, the descriptive metadata thus produced can be associated with the corresponding data directly in the storage space then it is possible to perform a search on the metadata in order to find one or more sets of data. Only descriptive metadata is accessible by default. Publish the metadata of datasets along with their data files into an Europe-approved repository PHP, 7.4.33 Mongodb, 6.0.14 Python, 3.8.10 Docker, 20.10.12

  8. Open Data Portal Catalogue

    • open.canada.ca
    • datasets.ai
    • +1more
    csv, json, jsonl, png +2
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
    Explore at:
    csv, sqlite, json, png, jsonl, xlsxAvailable download formats
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
    Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.

  9. Quality and completeness scores for curated and non-curated datasets

    • springernature.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graham Smith; Iain Hrynaszkiewicz; Rebecca Grant (2023). Quality and completeness scores for curated and non-curated datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6200357.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Graham Smith; Iain Hrynaszkiewicz; Rebecca Grant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data contain aggregated survey responses assessing the quality and completeness of metadata for datasets deposited in public repositories and for the same datasets after professional curation.Responses were provided by 10 professional editors representing life, social and physical sciences. Each were randomly assigned four datasets to assess, half (20) of which had been curated according to the standards of Springer Nature's Research Data Support service and half (20) which had not.Curated datasets were shared privately with research participants. The versions that did not receive curation via Springer Nature's Research Data Support are openly accessible.Single-blind testing was employed; the researchers were not made aware which datasets had been curated and which had not, and it was ensured that no participant assessed the same dataset before and after curation. Responses were collected via an online survey. The relevant question and scoring is provided below:Rate the overall quality and completeness of the metadata for the dataset (with regards to finding and accessing and citing the data, not reusing the data)1 = not complete, 5 = very complete

  10. Meta updated stocks complete dataset

    • kaggle.com
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Atif Latif (2025). Meta updated stocks complete dataset [Dataset]. https://www.kaggle.com/datasets/matiflatif/meta-stocks-complete-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Atif Latif
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains daily stock data for Meta Platforms, Inc. (META), formerly Facebook Inc., from May 19, 2012, to January 20, 2025. It offers a comprehensive view of Meta’s stock performance and market fluctuations during a period of significant growth, acquisitions, and technological advancements. This dataset is valuable for financial analysis, market prediction, machine learning projects, and evaluating the impact of Meta’s business decisions on its stock price.

    Content

    The dataset includes the following key features:

    Open: Stock price at the start of the trading day. High: Highest stock price during the trading day. Low: Lowest stock price during the trading day. Close: Stock price at the end of the trading day. Adj Close: Adjusted closing price, accounting for corporate actions like stock splits, dividends, and other financial adjustments. Volume: Total number of shares traded during the trading day.

    Variables

    Date: The date of the trading day, formatted as YYYY-MM-DD. Open: The stock price at the start of the trading day. High: The highest price reached by the stock during the trading day. Low: The lowest price reached by the stock during the trading day. Close: The stock price at the end of the trading day. Adj Close: The adjusted closing price, which reflects corporate actions like stock splits and dividend payouts. Volume: The total number of shares traded on that specific day.

    Acknowledgements

    This dataset was sourced from reliable public APIs such as Yahoo Finance or Alpha Vantage. It is provided for educational and research purposes and is not affiliated with Meta Platforms, Inc. Users are encouraged to adhere to the terms of use of the original data provider.

  11. d

    US Restaurant POI dataset with metadata

    • datarade.ai
    .csv
    Updated Jul 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geolytica (2022). US Restaurant POI dataset with metadata [Dataset]. https://datarade.ai/data-products/us-restaurant-poi-dataset-with-metadata-geolytica
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jul 30, 2022
    Dataset authored and provided by
    Geolytica
    Area covered
    United States of America
    Description

    Point of Interest (POI) is defined as an entity (such as a business) at a ground location (point) which may be (of interest). We provide high-quality POI data that is fresh, consistent, customizable, easy to use and with high-density coverage for all countries of the world.

    This is our process flow:

    Our machine learning systems continuously crawl for new POI data
    Our geoparsing and geocoding calculates their geo locations
    Our categorization systems cleanup and standardize the datasets
    Our data pipeline API publishes the datasets on our data store
    

    A new POI comes into existence. It could be a bar, a stadium, a museum, a restaurant, a cinema, or store, etc.. In today's interconnected world its information will appear very quickly in social media, pictures, websites, press releases. Soon after that, our systems will pick it up.

    POI Data is in constant flux. Every minute worldwide over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist. And over 94% of all businesses have a public online presence of some kind tracking such changes. When a business changes, their website and social media presence will change too. We'll then extract and merge the new information, thus creating the most accurate and up-to-date business information dataset across the globe.

    We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via our data update pipeline.

    Customers requiring regularly updated datasets may subscribe to our Annual subscription plans. Our data is continuously being refreshed, therefore subscription plans are recommended for those who need the most up to date data. The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.

    Data samples may be downloaded at https://store.poidata.xyz/us

  12. n

    Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

    • narcis.nl
    • data.mendeley.com
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
    Explore at:
    Dataset updated
    Apr 19, 2022
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Suzuki, T (via Mendeley Data)
    Description

    Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

  13. Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata

    • datarade.ai
    .csv
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WIRESTOCK (2023). Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata [Dataset]. https://datarade.ai/data-products/wirestock-s-ai-ml-image-training-data-4-5m-files-with-metadata-wirestock
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jul 18, 2023
    Dataset provided by
    Wirestock, Inc.
    Authors
    WIRESTOCK
    Area covered
    Estonia, Chile, Swaziland, Sudan, Peru, Pakistan, New Caledonia, Jersey, Georgia, Belarus
    Description

    Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.

    The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.

    The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.

    This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.

    The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.

    In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.

    The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.

  14. Large books metadata dataset ~50 mill entries

    • kaggle.com
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opal Skies (2022). Large books metadata dataset ~50 mill entries [Dataset]. https://www.kaggle.com/opalskies/large-books-metadata-dataset-50-mill-entries/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Opal Skies
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    The main files are books.json, works.json, and authors.json. The identifiers can be used to locate relevant data across files and the data contained in these files complement each other.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  15. f

    Meta-analysis dataset metadata

    • figshare.com
    docx
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maggie Jones (2023). Meta-analysis dataset metadata [Dataset]. http://doi.org/10.6084/m9.figshare.24605280.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    figshare
    Authors
    Maggie Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptions of data derived from previously published studies and used in a meta-analysis as desribed in "Prey responses to direct and indirect predation risk cues reveal the importance of multiple information sources."

  16. Rotten Tomatoes Movie Dataset – Clean Movie Metadata

    • crawlfeeds.com
    csv, zip
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Rotten Tomatoes Movie Dataset – Clean Movie Metadata [Dataset]. https://crawlfeeds.com/datasets/rotten-tomatoes-movie-dataset-clean-movie-metadata
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    We provide a high-quality Rotten Tomatoes movie dataset that includes key metadata for thousands of movies. This dataset is ideal for anyone working with movie-related platforms, entertainment analytics, content curation, or movie discovery tools.

    Our collection is structured, clean, and designed to support real-time apps, dashboards, and research use cases.

    What the Dataset Includes

    Each record in the dataset contains core information pulled directly from Rotten Tomatoes, including:

    • Movie Name – The official title of the movie.

    • Poster URL – High-resolution image link to the movie poster.

    • Trailer URL – Direct link to the official trailer (when available).

    • Genre – One or more genres associated with the movie, such as Action, Drama, Comedy, or Horror.

    • Release Date – The date the movie was released to the public.

    • Actors – Main cast members listed on Rotten Tomatoes.

    • Directors – Director(s) responsible for the movie.

    • Rating – Audience or critic scores, where available.

    Broad Coverage

    This dataset spans a wide range of movies across all major genres and decades. From modern releases to timeless classics, from Hollywood blockbusters to independent films — we’ve included movies of all types with relevant data points.

    You can expect data on:

    • U.S. theatrical releases

    • Netflix, Amazon, and other streaming exclusives

    • Festival films and limited releases

    • Animated and documentary films

    Use Cases

    Here are just a few ways this dataset can be useful:

    • Movie Recommendation Engines – Use metadata and genre info to power personalized movie suggestions.

    • Entertainment Search Tools – Build searchable movie listings with visual poster previews and trailer links.

    • Data Visualization Projects – Create dashboards showing trends by genre, release periods, or actor participation.

    • AI/ML Training – Use metadata to train classification models or sentiment prediction tools.

    • Research & Academic Use – Analyze patterns in movie releases, cast dynamics, and genre evolution.

    Why Use Our Dataset?

    • Clean & ready-to-use: No raw HTML, just clean structured data.

    • Minimal but meaningful fields: Focused on useful movie attributes without clutter.

    • Updated info: Covers both classic and current titles.

    • Simple integration: Easy to use for developers, analysts, and product teams.

    If you're working on a movie-based product or looking for reliable film metadata for your project, this dataset offers an ideal foundation.

    Let us know if you’d like to explore it further.

  17. Z

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

    • data.niaid.nih.gov
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wehbe, Bilal (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
    Explore at:
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Bande, Miguel
    Cesar, Diego
    Pribbernow, Max
    Wehbe, Bilal
    Shah, Nimish
    Backe, Christian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

    Introduction

    This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

    Locations and sensors

    The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

    Data volume per session

    Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

    Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

    2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

    2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

    2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

    2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

    2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

    2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

    255 40.3 542.7 3’914’921 287.0

    Data and metadata structure

    Sensor data corpus

    The sensor data corpus comprises two processing stages:

    raw data streams stored in ROS bagfiles (aka logfiles),

    camera and sonar images (aka datafiles) extracted from the logfiles.

    The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

    ${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

    A typical logfile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

    A typical datafile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

    All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

    Metadatabase

    The metadatabase is provided in two equivalent forms:

    as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

    as a collection of CSV files in the csv/ directory for users who prefer other tools.

    The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

    An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

    Some general design remarks:

    For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

    In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

    A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

    As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

    SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;

  18. Z

    Enhancing MovieLens Dataset: Enriching Recommendations with Audio...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Moscardo (2023). Enhancing MovieLens Dataset: Enriching Recommendations with Audio Information, Transcriptions, and Metadata [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8037432
    Explore at:
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Laura Sebastia
    Vanessa Moscardo
    Victor Botti-Cebria
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Nowadays, there are lots of datasets available for training and experimentation in the field of recommender systems. Specifically, in the recommendation of audiovisual content, the MovieLens dataset is a prominent example. It is focused on the user-item relationship, providing actual interaction data between users and movies. However, although movies can be described with several characteristics, this dataset only offers limited information about the movie genres.

    In this work, we propose enriching the MovieLens dataset by incorporating metadata available on the web (such as cast, description, keywords, etc.) and movie trailers. By leveraging the trailers, we extract audio information and generate transcriptions for each trailer, introducing a crucial textual dimension to the dataset. The audio information was extracted by the waveform and frequency analysis, followed by the application of dimensionality reduction techniques. For the transcription generation, the deep learning model Whisper was used. Finally, metadata was obtained from TMDB, and the BERT model was applied to extract embeddings.

    These additional attributes enrich the original dataset, providing deeper and more precise analysis. Then, the use of this extended and enhanced dataset could drive significant advancements in recommendation systems, enhancing user experiences by providing more relevant and tailored movie recommendations based on their tastes and preferences.

  19. p

    DCAT-AP API endpoints for data.public.lu

    • data.public.lu
    html, rdf, xlsx
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Data Lëtzebuerg (2024). DCAT-AP API endpoints for data.public.lu [Dataset]. https://data.public.lu/en/datasets/dcat-ap-api-endpoints-for-data-public-lu/
    Explore at:
    rdf, html, xlsx(16280)Available download formats
    Dataset updated
    May 27, 2024
    Dataset authored and provided by
    Open Data Lëtzebuerg
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data.public.lu provides all its metadata in the DCAT and DCAT-AP formats, i.e. all data about the data stored or referenced on data.public.lu. DCAT (Data Catalog Vocabulary) is a specification designed to facilitate interoperability between data catalogs published on the Web. This specification has been extended via the DCAT-AP (DCAT Application Profile for data portals in Europe) standard, specifically for data portals in Europe. The serialisation of those vocabularies is mainly done in RDF (Resource Description Framework). The implementation of data.public.lu is based on the one of the open source udata platform. This API enables the federation of multiple Data portals together, for example, all the datasets published on data.public.lu are also published on data.europa.eu. The DCAT API from data.public.lu is used by the european data portal to federate its metadata. The DCAT standard is thus very important to guarantee the interoperability between all data portals in Europe. Usage Full catalog You can find here a few examples using the curl command line tool: To get all the metadata from the whole catalog hosted on data.public.lu curl https://data.public.lu/catalog.rdf Metadata for an organization To get the metadata of a specific organization, you need first to find its ID. The ID of an organization is the last part of its URL. For the organization "Open data Lëtzebuerg" its URL is https://data.public.lu/fr/organizations/open-data-letzebuerg/ and its ID is open-data-letzebuerg. To get all the metadata for a given organization, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/organizations/{id}/catalog.rdf Example: curl https://data.public.lu/api/1/organizations/open-data-letzebuerg/catalog.rdf Metadata for a dataset To get the metadata of a specific dataset, you need first to find its ID. The ID of dataset is the last part of its URL. For the dataset "Digital accessibility monitoring report - 2020-2021" its URL is https://data.public.lu/fr/datasets/digital-accessibility-monitoring-report-2020-2021/ and its ID is digital-accessibility-monitoring-report-2020-2021. To get all the metadata for a given dataset, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/datasets/{id}/rdf Example: curl https://data.public.lu/api/1/datasets/digital-accessibility-monitoring-report-2020-2021/rdf Compatibility with DCAT-AP 2.1.1 The DCAT-AP standard is in constant evolution, so the compatibility of the implementation should be regularly compared with the standard and adapted accordingly. In May 2023, we have done this comparison, and the result is available in the resources below (see document named 'udata 6 dcat-ap implementation status"). In the DCAT-AP model, classes and properties have a priority level which should be respected in every implementation: mandatory, recommended and optional. Our goal is to implement all mandatory classes and properties, and if possible implement all recommended classes and properties which make sense in the context of our open data portal.

  20. g

    Data warehouse and metadata holdings relevant to Australias North West Shelf...

    • gimi9.com
    Updated Sep 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Data warehouse and metadata holdings relevant to Australias North West Shelf | gimi9.com [Dataset]. https://gimi9.com/dataset/au_data-warehouse-and-metadata-holdings-relevant-to-australias-north-west-shelf/
    Explore at:
    Dataset updated
    Sep 4, 2024
    Area covered
    Australia
    Description

    From the earliest stages of planning the North West Shelf Joint Environmental Management Study it was evident that good management of the scientific data to be used in the research would be important for the success of the Study. A comprehensive review of data sets and other information relevant to the marine ecosystems, the geology, infrastructure and industries of the North West Shelf area had been completed (Heyward et al. 2006). The Data Management Project was established to source and prepare existing data sets for use, requiring the development and use of a range of tools: metadata systems, data visualisation and data delivery applications. These were made available to collaborators to allow easy access to data obtained and generated by the Study. The CMAR MarLIN metadata system was used to document the 285 data sets, those which were identified as potentially useful for the Study and the software and information products generated by and for the Study. This report represents a hard copy atlas of all NWSJEMS data products and the existing data sets identified for potential use as inputs to the Study. It comprises summary metadata elements describing the data sets, their custodianship and how the data sets might be obtained. The identifiers of each data set can be used to refer to the full metadata records in the on-line MarLIN system.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV

Dataset metadata of known Dataverse installations, August 2023

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

Search
Clear search
Close search
Google apps
Main menu