This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.
The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).
Data collection occurred in the period: 2019-12-19 -- 2019-12-23.
The header for each CSV file is:
[ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]
where for each row (a portal's dataset) the following fields are defined as follows:
portalid: portal identifier
id: dataset identifier
downloaddate: date of data collection
metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema
overallq: overall quality values computed by applying the methodology presented in [1]
qvalues: json object containing the quality values computed for the 17 metrics presented in [1]
assessdate: date of quality assessment
dviews: number of total views for the dataset
downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)
engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)
admindomain: 1 (national), 2 (international)
[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.
The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.
The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.
This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.
The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.
In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.
The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).
This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:
2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).
3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.
4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.
0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.
The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.
There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset
level. This is also referred to as the package
in some CKAN documentation. This is the main
table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db
database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Note: Please use the following view to be able to see the entire Dataset Description: https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Portal-Manifest-Metadata/x2z6-swxe
Dataset Description Outline (5 sections)
• INTRODUCTION
• WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF?
• WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA?
• HOW DOES THE PORTAL MANIFEST METADATA DATASET RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT OPEN DATA?
• IMPORTANT NOTES
INTRODUCTION • All of DEEP’s paper hazardous waste manifest records were recently scanned and “indexed”. • Indexing consisted of 6 basic pieces of information or “metadata” taken from each manifest about the Generator and stored with the scanned image. The metadata enables searches by: Site Town, Site Address, Generator Name, Generator ID Number, Manifest ID Number and Date of Shipment. • All of the metadata and scanned images are available electronically via DEEP’s Document Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/ • Therefore, it is no longer necessary to visit the DEEP Records Center in Hartford for manifest records or information. • This CT Data dataset “Hazardous Waste Portal Manifest Metadata” (or “Portal Manifest Metadata”) was copied from the DEEP Document Online Search Portal, and includes only the metadata – no images.
WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF? The Portal Manifest Metadata is a good search tool to use along with the Portal. Searching the Portal Manifest Metadata can provide the following advantages over searching the Portal: • faster searches, especially for “large searches” - those with a large number of search returns unlimited number of search returns (Portal is limited to 500); • larger display of search returns; • search returns can be sorted and filtered online in CT Data; and • search returns and the entire dataset can be downloaded from CT Data and used offline (e.g. download to Excel format) • metadata from searches can be copied from CT Data and pasted into the Portal search fields to quickly find single scanned images. The main advantages of the Portal are: • it provides access to scanned images of manifest documents (CT Data does not); and • images can be downloaded one or multiple at a time.
WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA? All hazardous waste manifest records received and maintained by the DEEP Manifest Program; including: • manifests originating from a Connecticut Generator or sent to a Connecticut Destination Facility including manifests accompanying an exported shipment • manifests with RCRA hazardous waste listed on them (such manifests may also have non-RCRA hazardous waste listed) • manifests from a Generator with a Connecticut Generator ID number (permanent or temporary number) • manifests with sufficient quantities of RCRA hazardous waste listed for DEEP to consider the Generator to be a Small or Large Quantity Generator • manifests with PCBs listed on them from 2016 to 6-29-2018. • Note: manifests sent to a CT Destination Facility were indexed by the Connecticut or Out of State Generator. Searches by CT Designated Facility are not possible unless such facility is the Generator for the purposes of manifesting.
All other manifests were considered “non-hazardous” manifests and not scanned. They were discarded after 2 years in accord with DEEP records retention schedule. Non-hazardous manifests include: • Manifests with only non-RCRA hazardous waste listed • Manifests from generators that did not have a permanent or temporary Generator ID number • Sometimes non-hazardous manifests were considered “Hazardous Manifests” and kept on file if DEEP had reason to believe the generator should have had a permanent or temporary Generator ID number. These manifests were scanned and included in the Portal.
Dates included: manifests with shipment dates from 1980 to present • States were the primary keepers of manifest records until June 29, 2018. Any manifest regarding a Connecticut Generator or Destination Facility should have been sent to DEEP, and should be present in the Portal and CT Data. • June 30, 2018 was the start of the EPA e-Manifest program. Most manifests with a shipment date on and after this date are sent to, and maintained by the EPA. • For information from EPA regarding these newer manifests: • Overview: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/em/emoverview • To search by site, use EPA’s Sites List: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/hd/handlerindex (Tip: Change the Location field from “National” to “Connecticut”) • Manifests still sent to DEEP on or after 6-30-2018 include: • manifests from exported shipments; and • manifest copies submitted pursuant to discrepancy reports and unmanifested shipments.
HOW DOES THE PORTAL MANIFEST METADATA RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT DATA?
• DEEP has posted in CT Data two other datasets about the same hazardous waste documents which are the subject of the Portal and the Portal Manifest Metadata Copy.
• There are likely some differences in the metadata between the Portal Manifest Metadata and the two others. DEEP recommends using all data sources for a complete search.
• These two datasets were the best search tool DEEP had available to the public prior to the Portal and the Metadata Copy:
• “Hazardous Waste Manifest Data (CT) 1984 – 2008”
https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008/h6d8-qiar; and
• “Hazardous Waste Manifest Data (CT) 1984 – 2008: Generator Summary View”
https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008-Generat/72mi-3f82.
• The only difference between these two datasets is:
• the first dataset includes all of the metadata transcribed from the manifests.
• the second “Generator Summary View” dataset is a smaller subset of the first, requested for convenience by the public.
Both of these datasets:
• Are copies of metadata from a manifest database maintained by DEEP. No scanned images are available as a companion to these datasets.
• The date range of the manifests for these datasets is 1984 to approximately 2008.
IMPORTANT NOTES (4): NOTE 1: Some manifest images are effectively unavailable via the Portal and the Portal Metadata due to incomplete or incorrect metadata. Such errors may be the result of unintentional data entry error, errors on the manifests or illegible manifests. • Incomplete or incorrect metadata may prevent a manifest from being found by a search. DEEP is currently working to complete the metadata as best it can. • Please report errors to the DEEP Manifest Program at deep.manifests@ct.gov. • DEEP will publish updates regarding this work here and through the DEEP Hazardous Waste Advisory Committee listserv. To sign up for this listserv, visit this webpage: https://portal.ct.gov/DEEP/Waste-Management-and-Disposal/Hazardous-Waste-Advisory-Committee/HWAC-Home. NOTE 2: This dataset does not replace the potential need for a full review of other files publicly available either on-line and/or at CT DEEP’s Records Center. For a complete review of agency records for this or other agency programs, you can perform your own search in our DEEP public file room located at 79 Elm Street, Hartford CT or at our DEEP Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/Home. NOTE 3: Other DEEP programs or state and federal agencies may maintain manifest records (e.g., DEEP Emergency Response, US Environmental Protection Agency, etc.) These other manifests were not scanned along with those from the Manifest Program files. However, most likely these other manifests are duplicate copies of manifests available via the Portal. NOTE 4: search tips for using the Portal and CT Data: • If your search will yield a small number of search returns, try using the Portal for your search. “Small” is meant to mean fewer than the 500 maximum search returns allowed using the Portal. • Start your search as broadly as possible – try entering just the town and the street name, or a portion of the street name that is likely to be spelled correctly • For searches yielding a large number of search returns, try using first the Portal Manifest Metadata in CT Data. • Try downloading the metadata and sorting, filtering, etc. the data to look for related spellings, etc. • Once you narrow down you research, copy the manifest number of a manifest you are interested in, and paste it into the Agency ID field of the Portal search page. • If you are using information from older information sources for consistency, you may want to search the two datasets copied from the older DEEP Manifest Database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies our work that introduces a metadata schema for TREC run files based on the PRIMAD model. PRIMAD considers essential components of computational experiments that possibly can affect reproducibility on a conceptual level. We propose to align the metadata annotations to the PRIMAD components. In order to demonstrate the potential of metadata annotations, we curated a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotated these with the corresponding metadata. With this work, we hope to stimulate IR researchers to annotate run files and improve the reuse value of experimental artifacts even further.
This archive contains the following data:
demo.tar.xz : Selected annotated runs files that are used in the Colab demonstration.
metadata.zip : YAML files containing only the metadata annotations for each run.
runs.zip : The entire set of run files with annotations.
The annotated runs result from the following experiments:
Grossman and Cormack @ TREC Common Core 2017 Paper | Source
Grossman and Cormack @ TREC Common Core 2018 Paper | Source
Yu et al. @ TREC Common Core 2018 Paper | Source
Yu et al. @ ECIR 2019 Paper | Source
Breuer et al. @ SIGIR 2020 Paper | Source
Breuer et al. @ CLEF 2021 Paper | Source
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update May 2024: Fixed a data type issue with "id" column that prevented twitter ids from rendering correctly.
Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes).
Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and human images (i) is correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.
Note that in accordance with the privacy and control policy of Twitter, NO raw content from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id. In addition, users who want to access Twitter data should consult and follow rules and regulations closely at the official Twitter developer policy at https://developer.twitter.com/en/developer-terms/policy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of vessel tracking data in the form of AIS observations in the Baltic Sea during years 2017-19. The AIS observations have been enriched with vessel metadata such as power
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to the MSc work "Descrição de dados de investigação: requisitos de investigadores para modelos de metadados na Psicologia e Ciências da Educação", where, by observing practices, identifying needs, knowing work processes and identifying expectations, and trying to take advantage of researchers' knowledge about the importance of research data in their activity, we have analyzed procedures for accessing, storing and preserving research data. We also have studied models and workflows for managing datasets of completed projects. Considering the recommendations of Data Curation, we proposed the description model that was later expanded with specific examples of controlled vocabularies used in the Psychology scientific domain.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The TU Delft Repository is a digital repository, including peer-reviewed articles, technical reports, working papers, bachelor and master theses, and more. Managed by TU Delft Library, the purpose of the repository is to provide a sustainable, stable, central location for files and metadata.
The repository adheres to Open Access principles and promotes accessibility, transparency, and collaboration in academic research. To uphold these values, we submit the first dataset of metadata from our research repository (to be updated twice annually hereafter) under a CC0 license.
This research dataset was exported on 14.05.2025 and contains approximately 355,769 records, relating to 238,423 files from 566,150 contributors.
Magnetotellurics (MT) is an electromagnetic geophysical method that is sensitive to variations in subsurface electrical resistivity. Measurements of natural electric and magnetic fields are done in the time _domain, where instruments can record for a couple of hours up to mulitple months resulting in data sets on the order of gigabytes. The principles of findability, accessibility, interoperability, and reuse of digital assets (FAIR) requires standardized metadata. Unfortunately, the MT community has never had a metadata standard for time series data. In 2019, the Working Group for Magnetotelluric Data Handling and Software (https://www.iris.edu/hq/about_iris/governance/mt_soft) was assembled by the Incorporated Research Institutions for Seismology (IRIS) to develop a metadata standard for time series data. This product describes the metadata definitions. Metadata Hierarchy: Survey -> Station -> Run -> Channel The hierarchy and structure of the MT metadata logically follows how MT time series data are collected. The highest level is "survey" which contains metadata for data collected over a certain time interval in a given geographic region. This may include multiple principle investigators or multiple data collection episodes but should be confined to a specific project. Next, a "station" which contains metadata for a single _location over a certain time interval. If the _location changes during a run, then a new station should be created and subsequently a new run under the new station. If the sensors, cables, data logger, battery, etc. are replaced during a run but the station remains in the same _location, then this can be recorded in the "run" metadata but does not require a new station entry. A "run" contains metadata for continuous data collected at a single sample rate. If channel parameters are changed between runs, this would require creating a new run. If the station is relocated then a new station should be created. If a run has channels that drop out, the start and end period will be the minimum time and maximum time for all channels recorded. Finally, a "channel" contains metadata for a single channel during a single run, where "electric", "magnetic", and "auxiliary" channels have some different metadata to uniquely describe the physical measurement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Searchable Index of Metadata Aggregators is a database that stores general information of metadata aggregators. This database is accompanied with the “A WDS guide to Metadata Aggregators for Repository Managers”. The Searchable Index of Metadata Aggregators is an up-to-date catalogue of Dataset Metadata Aggregators (DMAs), implemented as an access database. It was designed to fill in a gap found by the Harvestable Metadata Services Working Group (HMetS-WG) members of the World Data System’s International Technology Office (WDS-ITO). These include up-to-date resources giving an overview of current infrastructures used to syndicate dataset metadata. The database contains information on DMA's supported metadata standards and software interfaces, as well as documentation on how to be aggregated by each.
The WDS Guide to Metadata Aggregators is a guidance document for the associated Searchable Index of Metadata Aggregators. We have defined DMAs as federated service infrastructures that foster the findability and accessibility of data products by enabling access to multiple, distributed metadata records via a single search interface. This guide gives a description of this catalogue and general guidance on how to use it. In the sections that follow, we give a short background to the Harvestable Metadata Services-Working Group project. Then, we outline the project's research methodology and the properties of the searchable index. Finally, we discuss this project's limitations, as well as its future development. Providing metadata to aggregators can significantly improve the findability of research data products.
Together, this guidance document and dataset package are designed to provide research data repository managers with options for participation in federated research data systems, and support institutional repositories' harvestable metadata service implementation strategies. In addition, as developers in the global research data management community seek to create pathways and workflows across data, software and compute resources, we anticipate that they're likely to prioritize connecting sites, organizations and services that have already done a lot of work harmonizing content from disparate providers. In this context, this resource will be helpful for creating roadmaps and implementation plans for integration across science clouds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is An emergent theory of digital library metadata : enrich then filter. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels. SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide.
The SAPFLUXNET version 0.1.5 database harbours 202 globally distributed datasets, from 121 geographical locations. SAPFLUXNET contains sap flow data for 2714 individual plants (1584 angiosperms and 1130 gymnosperms), belonging to 174 species (141 angiosperms and 33 gymnosperms), 95 different genera and 45 different families. More information on the database coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.
The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).
Changelog
Compared to version 0.1.4, this version includes some changes in the metadata, but all time series data (sap flow, environmental) remain the same.
For all datasets, climate metadata (temperature and precipitation, ‘si_mat’ and ‘si_map’) have been extracted from CHELSA (https://chelsa-climate.org/), replacing the previous climate data obtained with Wordclim. This change has modified the biome classification of the datasets in ‘si_biome’.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) is now assigned a value of 0 if species are in the understorey. This affects two datasets: AUS_MAR_UBD and AUS_MAR_UBW, where, previously, the sum of species basal area percentages could add up to more than 100%.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) has been corrected for datasets USA_SIL_OAK_POS, USA_SIL_OAK_1PR, USA_SIL_OAK_2PR.
In ‘site’ metadata, the vegetation type (‘si_igbp’) has been changed to SAV for datasets CHN_ARG_GWD and CHN_ARG_GWS.
Variables and units
SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.
Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. The description and units of all data and metadata variables can be found here: Metadata and data units.
To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:
library(sapfluxnetr)
vignette(package='sapfluxnetr')
vignette('metadata-and-data-units', package='sapfluxnetr')
vignette('data-flags', package='sapfluxnetr')
Data formats
SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.
Working with sfn_data files
To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).
After downloading the entire database, the sapfluxnetr package can be used to: - Work with data from a single site: data access, plotting and time aggregation. - Select the subset datasets to work with. - Work with data from multiple sites: data access, plotting and time aggregation.
Please check the following package vignettes to learn more about how to work with sfn_data files:
Quick guide
Metadata and data units
sfn_data classes
Custom aggregation
Memory and parallelization
Working with text files
We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.
Data issues and reporting
Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.
Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').
Data access, use and citation
This version of the SAPFLUXNET database is open access and corresponds to the data paper submitted to Earth System Science Data in August 2020.
When using SAPFLUXNET data in an academic work, please cite the data paper, when available, or alternatively, the Zenodo dataset (see the ‘Cite as’ section on the right panels of this web page).
https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/
Note that this Crossref metadata is always openly available. The difference here is that we’ve done the time-saving work of putting all of the records registered through April 2024 into one file for download. To keep this metadata current, you can access new records via our public API at: And, if you do use our API, we encourage you to read the section of the documentation on "etiquette". That is, how to use the API without making it impossible for others to use.
This dataset includes data from 713 members of the early childhood workforce who completed a survey in March and April 2021.This survey included the collection of information on personal and workplace chatacteristics, workplace culture and climate (co-worker and supervisor relations, teamwork, organisational climate, autonomy, impact on decision making) and work-related wellbeing (personal accomplishment, emotional exhaustion, professional respect, pay and benefits, intention to leave the profession).
Description of each data variable is provided in the attached metadata file named 'intention to leave metadata'.
The data is provided in .csv format.
This study was approved by the researchers’ university Human Research Ethics Committee, and was conducted in accordance with the Australian National Statement on Ethical Conduct in Human Research (National Health and Medical Research Council, 2018). No information was collected in the study that could identify participants or services. An information statement was provided to all participants before they commenced the survey, and participants gave informed consent by starting the survey. Participants were free to complete as much or as little of the survey as they chose.
The research and survey link were emailed to each service listed on the ACECQA publicly available database of ECEC services. Centre directors then sent the survey information and link to their service’s educators if they chose to. A reminder email was sent to services two weeks after the initial email. The survey remained open for four weeks.
NASA's Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010) mapped the sky at 3.4, 4.6, 12, and 22 μm (W1, W2, W3, W4) in 2010 with an angular resolution of 6.1", 6.4", 6.5", & 12.0" in the four bands. WISE achieved 5σ point source sensitivities better than 0.08, 0.11, 1 and 6 mJy in unconfused regions on the ecliptic in the four bands. Sensitivity improves toward the ecliptic poles due to denser coverage and lower zodiacal background.The All-Sky Release includes all data taken during the WISE full cryogenic mission phase, 7 January 2010 to 6 August 2010, that were processed with improved calibrations and reduction algorithms. Release data products include an Atlas of 18,240 match-filtered, calibrated and coadded image sets, a Source Catalog containing positional and photometric information for over 563 million objects detected on the WISE images, and an Explanatory Supplement that is a guide to the format, content, characteristics and cautionary notes for the WISE All-Sky Release products.The WISE All-Sky Data Release Single-exposure Source Working Database contains positions and brightness information, uncertainties, time of observation and assorted quality flags for 9,479,433,101 "sources" detected on the individual WISE 7.7s (W1 and W2) and 8.8s (W3 and W4) Single-exposure images. Because WISE scanned every point on the sky multiple times, the Single-exposure Database contains multiple, independent measurements of objects on the sky.Entries in the Single-exposure Source Table include detections of real astrophysical objects, as well as spurious detections of low SNR noise excursions, transient events such as hot pixels, charged particle strikes and satellite streaks, and image artifacts light from bright sources including the moon. Many of the unreliable detections are flagged in the Single-exposure Table, but they have not been filtered out as they were for the Source Catalog. Therefore, the Table must be used with caution. Users are strongly encouraged to read the Cautionary Notes before using the Table.
The AllWISE program builds upon the work of the successful Wide-field Infrared Survey Explorer mission (WISE; Wright et al. 2010) by combining data from the WISE cryogenic and NEOWISE (Mainzer et al. 2011 ApJ, 731, 53) post-cryogenic survey phases to form the most comprehensive view of the full mid-infrared sky currently available. By combining the data from two complete sky coverage epochs using an advanced data processing system, AllWISE has generated new products that have enhanced photometric sensitivity and accuracy, and improved astrometric precision compared to the 2012 WISE All-Sky Data Release. Exploiting the 6 to 12 month baseline between the WISE sky coverage epochs enables AllWISE to measure source motions for the first time, and to compute improved flux variability statistics. The AllWISE Atlas Metadata Table contains brief descriptions of all metadata information that is relevant to the production of the Atlas images and Source Catalog. The table contains the (RA, DEC) of the center of the Tile. Much of the information in this table is processing-specific and may not be of interest to general users (e.g., flags indicating whether frames have been processed successfully or not, and the date and time of the start of the pipeline processing, etc.). The metadata table also contains some characterization and derived statistics of the coadd image Tile, basic photometric parameters used for photometry and derived statistics for extracted sources and artifacts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is for the publication "Exploring ORCID adoption and metadata presence in Spain's research landscape" and is composed of the following files:
norm_aff.tsv (2795x5) - Curated list of role titles
openalex_all_spanish.csv (811062x4) - List of Spanish authors retrieved from OpenAlex
orcid_affiliation.csv (377377x23) - Activity employment affiliation of Spanish ORCID records
orcid_full_maps.csv (15182292x6)- Topic classification of Spanish ORCID records
orcid_metrics.csv (226512x39) - Metadata availability in Spanish ORCID records
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.