CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset resulted from a survey over research data management metadriven studies. A summary of RDM studies was outlined that use at least one technique to engage researchers in the development of tools, or to improve and assess their metadata practices.To this end the Scopus database was searched, in January 2019, and 219 RDM entries that feature the concept “metadata” in the title or as a keyword were obtained. For a broader coverage of publications the list of 301 publications provided by the Perrier et al. scoping review was also assessed. The final corpus of analysis consisted of 14 studies that were coded according to their RDM context, motivation, metadata context, participants domain, methodological approach, metadata practices of participants, their main findings and recommendations. The publications with tha labels LabTrove, Archaeological_Digging and Publishing_Pushing were collected through the Perrier et al. scoping review dataset (https://doi.org/10.1371/journal.pone.0178261.s003).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.
The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).
Data collection occurred in the period: 2019-12-19 -- 2019-12-23.
The header for each CSV file is:
[ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]
where for each row (a portal's dataset) the following fields are defined as follows:
[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
This excel workbook is a compilation of the major metadata schemas for life cycle assessment. Resources in this dataset:Resource Title: LCADomain_MetadataSchema_Inventory_v1_0_2. File Name: LCADomain_MetadataSchema_Inventory_v1_0_2.xlsm
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Government Data (OGD) has the potential to support social and economic progress. However, this potential can be frustrated if this data remains unused. Although the literature suggests that OGD datasets' metadata quality is one of the main factors affecting their use, to the best of our knowledge, no quantitative study provided evidence of this relationship. Considering about 400,000 datasets of 28 national, municipal, and international OGD portals, we have programmatically analyzed their usage, their metadata quality, and the relationship between the two. Our analysis has highlighted three main findings. First of all, regardless of their size, the software platform adopted, and their administrative and territorial coverage, most OGD datasets are underutilized. Second, OGD portals pay varying attention to the quality of their datasets’ metadata. Third, we did not find clear evidence that datasets usage is positively correlated to better metadata publishing practices. Finally, we have considered other factors, such as datasets’ category, and some demographic characteristics of the OGD portals, and analyzed their relationship with datasets usage, obtaining partially affirmative answers.
The dataset consists of three zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 400,000 datasets belonging to the 8 national, 4 international, and 16 US municipalities OGD portals considered in the study.
Data collection occurred in the period: 2019-12-19 -- 2019-12-23.
Portal #Datasets Platform
US 261,514 CKAN
France 39,412 Other
Colombia 9,795 Socrata
IE 9,598 CKAN
Slovenia 4,892 CKAN
Poland 1,032 Other
Latvia 336 CKAN
Puerto Rico 178 Socrata
New York, NY 2,771 Socrata
Baltimore, MD 2,617 Socrata
Austin, TX 2,353 Socrata
Chicago, IL 1,368 Socrata
San Francisco, CA 1,001 Socrata
Dallas, TX 1,001 Socrata
Los Angeles, CA 943 Socrata
Seattle, WA 718 Socrata
Providence, RI 288 Socrata
Honolulu, HI 244 Socrata
New Orleans, LA 215 Socrata
Buffalo, NY 213 Socrata
Nashville, TN 172 Socrata
Boston, MA 170 CKAN
Albuquerque, NM 60 CKAN
Albany, NY 50 Socrata
HDX 17,325 CKAN
EUODP 14,058 CKAN
NASA 9,664 Socrata
World Bank Finances 2,177 Socrata
The three datasets share the same table structure:
Table Fields
portalid: portal identifier
id: dataset identifier
engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)
admindomain: 1 (National), 2 (US), 3 (International)
downloaddate: date of data collection
views: number of total views for the dataset
downloads: number of total downloads for the dataset
overallq: overall quality values computed by applying the methodology presented by Neumaier et al. in [1]
qvalues: json object containing the quality values computed for the 17 metrics presented in by Neumaier et al. [1]
assessdate: date of quality assessment
metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema
[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was collected using this GitHub repository Repositories-Extraction, collected the links to the repositories from each scientific cluster, and using the GitHub repository Metadata-Extraction, we were able to extract the relevant information needed to answer our research questions (RQ):
RQ1: How do communities describe Research Software metadata in their code repositories?
RQ2: What is the adoption of archival infrastructures across disciplines?
RQ3: How do software projects adopt versioning?
RQ4: How comprehensive is the metadata provided in code repositories? Specifically:
What is the adoption of open licenses?
Do research projects include a description?
How well documented are research projects? (i.e., in terms of installation instructions, requirements and documentation availability
RQ5: What are the most common citation practices among communities?
The dataset has two types of information, for example for one cluster we can say "ENVRI", for each RQ you will find "analysis_envri_rq1.json" which contains the information extracted using SOMEF and processed to extract the relevant information, and you will find "results_envri_rq1.json" which is the calculations of the percentages of each relevant files to the RQ.
The main functionalities of Maggot were established according to a well-defined need (See Background) Documente with Metadata your datasets produced within a collective of people, thus making it possible : o answer certain questions of the Data Management Plan (DMP) concerning the organization, documentation, storage and sharing of data in the data storage space, to meet certain data and metadata requirements, listed for example by the Open Research Europe in accordance with the FAIR principles. Search datasets by their metadata : Indeed, the descriptive metadata thus produced can be associated with the corresponding data directly in the storage space then it is possible to perform a search on the metadata in order to find one or more sets of data. Only descriptive metadata is accessible by default. Publish the metadata of datasets along with their data files into an Europe-approved repository PHP, 7.4.33 Mongodb, 6.0.14 Python, 3.8.10 Docker, 20.10.12
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset
level. This is also referred to as the package
in some CKAN documentation. This is the main
table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db
database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data contain aggregated survey responses assessing the quality and completeness of metadata for datasets deposited in public repositories and for the same datasets after professional curation.Responses were provided by 10 professional editors representing life, social and physical sciences. Each were randomly assigned four datasets to assess, half (20) of which had been curated according to the standards of Springer Nature's Research Data Support service and half (20) which had not.Curated datasets were shared privately with research participants. The versions that did not receive curation via Springer Nature's Research Data Support are openly accessible.Single-blind testing was employed; the researchers were not made aware which datasets had been curated and which had not, and it was ensured that no participant assessed the same dataset before and after curation. Responses were collected via an online survey. The relevant question and scoring is provided below:Rate the overall quality and completeness of the metadata for the dataset (with regards to finding and accessing and citing the data, not reusing the data)1 = not complete, 5 = very complete
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains daily stock data for Meta Platforms, Inc. (META), formerly Facebook Inc., from May 19, 2012, to January 20, 2025. It offers a comprehensive view of Meta’s stock performance and market fluctuations during a period of significant growth, acquisitions, and technological advancements. This dataset is valuable for financial analysis, market prediction, machine learning projects, and evaluating the impact of Meta’s business decisions on its stock price.
The dataset includes the following key features:
Open: Stock price at the start of the trading day. High: Highest stock price during the trading day. Low: Lowest stock price during the trading day. Close: Stock price at the end of the trading day. Adj Close: Adjusted closing price, accounting for corporate actions like stock splits, dividends, and other financial adjustments. Volume: Total number of shares traded during the trading day.
Date: The date of the trading day, formatted as YYYY-MM-DD. Open: The stock price at the start of the trading day. High: The highest price reached by the stock during the trading day. Low: The lowest price reached by the stock during the trading day. Close: The stock price at the end of the trading day. Adj Close: The adjusted closing price, which reflects corporate actions like stock splits and dividend payouts. Volume: The total number of shares traded on that specific day.
This dataset was sourced from reliable public APIs such as Yahoo Finance or Alpha Vantage. It is provided for educational and research purposes and is not affiliated with Meta Platforms, Inc. Users are encouraged to adhere to the terms of use of the original data provider.
Point of Interest (POI) is defined as an entity (such as a business) at a ground location (point) which may be (of interest). We provide high-quality POI data that is fresh, consistent, customizable, easy to use and with high-density coverage for all countries of the world.
This is our process flow:
Our machine learning systems continuously crawl for new POI data
Our geoparsing and geocoding calculates their geo locations
Our categorization systems cleanup and standardize the datasets
Our data pipeline API publishes the datasets on our data store
A new POI comes into existence. It could be a bar, a stadium, a museum, a restaurant, a cinema, or store, etc.. In today's interconnected world its information will appear very quickly in social media, pictures, websites, press releases. Soon after that, our systems will pick it up.
POI Data is in constant flux. Every minute worldwide over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist. And over 94% of all businesses have a public online presence of some kind tracking such changes. When a business changes, their website and social media presence will change too. We'll then extract and merge the new information, thus creating the most accurate and up-to-date business information dataset across the globe.
We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via our data update pipeline.
Customers requiring regularly updated datasets may subscribe to our Annual subscription plans. Our data is continuously being refreshed, therefore subscription plans are recommended for those who need the most up to date data. The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.
Data samples may be downloaded at https://store.poidata.xyz/us
Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.
The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.
The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.
This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.
The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.
In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.
The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
The main files are books.json, works.json, and authors.json. The identifiers can be used to locate relevant data across files and the data contained in these files complement each other.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptions of data derived from previously published studies and used in a meta-analysis as desribed in "Prey responses to direct and indirect predation risk cues reveal the importance of multiple information sources."
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
We provide a high-quality Rotten Tomatoes movie dataset that includes key metadata for thousands of movies. This dataset is ideal for anyone working with movie-related platforms, entertainment analytics, content curation, or movie discovery tools.
Our collection is structured, clean, and designed to support real-time apps, dashboards, and research use cases.
Each record in the dataset contains core information pulled directly from Rotten Tomatoes, including:
Movie Name – The official title of the movie.
Poster URL – High-resolution image link to the movie poster.
Trailer URL – Direct link to the official trailer (when available).
Genre – One or more genres associated with the movie, such as Action, Drama, Comedy, or Horror.
Release Date – The date the movie was released to the public.
Actors – Main cast members listed on Rotten Tomatoes.
Directors – Director(s) responsible for the movie.
Rating – Audience or critic scores, where available.
This dataset spans a wide range of movies across all major genres and decades. From modern releases to timeless classics, from Hollywood blockbusters to independent films — we’ve included movies of all types with relevant data points.
You can expect data on:
U.S. theatrical releases
Netflix, Amazon, and other streaming exclusives
Festival films and limited releases
Animated and documentary films
Here are just a few ways this dataset can be useful:
Movie Recommendation Engines – Use metadata and genre info to power personalized movie suggestions.
Entertainment Search Tools – Build searchable movie listings with visual poster previews and trailer links.
Data Visualization Projects – Create dashboards showing trends by genre, release periods, or actor participation.
AI/ML Training – Use metadata to train classification models or sentiment prediction tools.
Research & Academic Use – Analyze patterns in movie releases, cast dynamics, and genre evolution.
Clean & ready-to-use: No raw HTML, just clean structured data.
Minimal but meaningful fields: Focused on useful movie attributes without clutter.
Updated info: Covers both classic and current titles.
Simple integration: Easy to use for developers, analysts, and product teams.
If you're working on a movie-based product or looking for reliable film metadata for your project, this dataset offers an ideal foundation.
Let us know if you’d like to explore it further.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation
Introduction
This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.
Locations and sensors
The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.
Data volume per session
Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:
Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]
2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1
2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3
2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8
2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9
2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6
2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3
255 40.3 542.7 3’914’921 287.0
Data and metadata structure
Sensor data corpus
The sensor data corpus comprises two processing stages:
raw data streams stored in ROS bagfiles (aka logfiles),
camera and sonar images (aka datafiles) extracted from the logfiles.
The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:
${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}
A typical logfile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag
A typical datafile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg
All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.
Metadatabase
The metadatabase is provided in two equivalent forms:
as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,
as a collection of CSV files in the csv/ directory for users who prefer other tools.
The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.
An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json
Some general design remarks:
For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.
In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.
A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.
As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:
SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Nowadays, there are lots of datasets available for training and experimentation in the field of recommender systems. Specifically, in the recommendation of audiovisual content, the MovieLens dataset is a prominent example. It is focused on the user-item relationship, providing actual interaction data between users and movies. However, although movies can be described with several characteristics, this dataset only offers limited information about the movie genres.
In this work, we propose enriching the MovieLens dataset by incorporating metadata available on the web (such as cast, description, keywords, etc.) and movie trailers. By leveraging the trailers, we extract audio information and generate transcriptions for each trailer, introducing a crucial textual dimension to the dataset. The audio information was extracted by the waveform and frequency analysis, followed by the application of dimensionality reduction techniques. For the transcription generation, the deep learning model Whisper was used. Finally, metadata was obtained from TMDB, and the BERT model was applied to extract embeddings.
These additional attributes enrich the original dataset, providing deeper and more precise analysis. Then, the use of this extended and enhanced dataset could drive significant advancements in recommendation systems, enhancing user experiences by providing more relevant and tailored movie recommendations based on their tastes and preferences.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data.public.lu provides all its metadata in the DCAT and DCAT-AP formats, i.e. all data about the data stored or referenced on data.public.lu. DCAT (Data Catalog Vocabulary) is a specification designed to facilitate interoperability between data catalogs published on the Web. This specification has been extended via the DCAT-AP (DCAT Application Profile for data portals in Europe) standard, specifically for data portals in Europe. The serialisation of those vocabularies is mainly done in RDF (Resource Description Framework). The implementation of data.public.lu is based on the one of the open source udata platform. This API enables the federation of multiple Data portals together, for example, all the datasets published on data.public.lu are also published on data.europa.eu. The DCAT API from data.public.lu is used by the european data portal to federate its metadata. The DCAT standard is thus very important to guarantee the interoperability between all data portals in Europe. Usage Full catalog You can find here a few examples using the curl command line tool: To get all the metadata from the whole catalog hosted on data.public.lu curl https://data.public.lu/catalog.rdf Metadata for an organization To get the metadata of a specific organization, you need first to find its ID. The ID of an organization is the last part of its URL. For the organization "Open data Lëtzebuerg" its URL is https://data.public.lu/fr/organizations/open-data-letzebuerg/ and its ID is open-data-letzebuerg. To get all the metadata for a given organization, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/organizations/{id}/catalog.rdf Example: curl https://data.public.lu/api/1/organizations/open-data-letzebuerg/catalog.rdf Metadata for a dataset To get the metadata of a specific dataset, you need first to find its ID. The ID of dataset is the last part of its URL. For the dataset "Digital accessibility monitoring report - 2020-2021" its URL is https://data.public.lu/fr/datasets/digital-accessibility-monitoring-report-2020-2021/ and its ID is digital-accessibility-monitoring-report-2020-2021. To get all the metadata for a given dataset, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/datasets/{id}/rdf Example: curl https://data.public.lu/api/1/datasets/digital-accessibility-monitoring-report-2020-2021/rdf Compatibility with DCAT-AP 2.1.1 The DCAT-AP standard is in constant evolution, so the compatibility of the implementation should be regularly compared with the standard and adapted accordingly. In May 2023, we have done this comparison, and the result is available in the resources below (see document named 'udata 6 dcat-ap implementation status"). In the DCAT-AP model, classes and properties have a priority level which should be respected in every implementation: mandatory, recommended and optional. Our goal is to implement all mandatory classes and properties, and if possible implement all recommended classes and properties which make sense in the context of our open data portal.
From the earliest stages of planning the North West Shelf Joint Environmental Management Study it was evident that good management of the scientific data to be used in the research would be important for the success of the Study. A comprehensive review of data sets and other information relevant to the marine ecosystems, the geology, infrastructure and industries of the North West Shelf area had been completed (Heyward et al. 2006). The Data Management Project was established to source and prepare existing data sets for use, requiring the development and use of a range of tools: metadata systems, data visualisation and data delivery applications. These were made available to collaborators to allow easy access to data obtained and generated by the Study. The CMAR MarLIN metadata system was used to document the 285 data sets, those which were identified as potentially useful for the Study and the software and information products generated by and for the Study. This report represents a hard copy atlas of all NWSJEMS data products and the existing data sets identified for potential use as inputs to the Study. It comprises summary metadata elements describing the data sets, their custodianship and how the data sets might be obtained. The identifiers of each data set can be used to refer to the full metadata records in the on-line MarLIN system.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...