Facebook
Twitterbs-modeling-metadata/c4-en-html-with-metadata dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Social Page Metadata technology, compiled through global website indexing conducted by WebTechSurvey.
Facebook
TwitterThis data dictionary describes relevant fields from secondary data sources that can assist with modeling the conditions of use for a chemical when performing a chemical assessment. Information on how to access the secondary data sources are included. This dataset is associated with the following publication: Chea, J.D., D.E. Meyer, R.L. Smith, S. Takkellapati, and G.J. Ruiz-Mercado. Exploring automated tracking of chemicals through their conditions of use to support life cycle chemical assessment. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 29(2): 413-616, (2025).
Facebook
TwitterAttribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
This dataset is an extension of the rag-mini-bioasq dataset. Its difference resides in the text-corpus part of the aforementioned set where the metadata was added for each passage. Metadata contains six separate categories, each in a dedicated column:
Year of the publication (publish_year) Type of the publication (publish_type) Country of the publication - often correlated with the homeland of the authors (country) Number of pages (no_pages) Authors (authors) Keywords (keywords)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Dublin Core Metadata Element Set is a vocabulary of fifteen properties for use in resource description. The name "Dublin" is due to its origin at a 1995 invitational workshop in Dublin, Ohio; "core" because its elements are broad and generic, usable for describing a wide range of resources.
The fifteen element "Dublin Core" described in this standard is part of a larger set of metadata vocabularies and technical specifications maintained by the Dublin Core Metadata Initiative (DCMI). The full set of vocabularies, DCMI Metadata Terms, also includes sets of resource classes (including the DCMI Type Vocabulary, vocabulary encoding schemes, and syntax encoding schemes. The terms in DCMI vocabularies are intended to be used in combination with terms from other, compatible vocabularies in the context of application profiles and on the basis of the DCMI Abstract Model.
All changes made to terms of the Dublin Core Metadata Element Set since 2001 have been reviewed by a DCMI Usage Board in the context of a DCMI Namespace Policy. The namespace policy describes how DCMI terms are assigned Uniform Resource Identifiers (URIs) and sets limits on the range of editorial changes that may allowably be made to the labels, definitions, and usage comments associated with existing DCMI terms.
This document, an excerpt from the more comprehensive document DCMI Metadata Terms provides an abbreviated reference version of the fifteen element descriptions that have been formally endorsed in the following standards:
Since 1998, when these fifteen elements entered into a standardization track, notions of best practice in the Semantic Web have evolved to include the assignment of formal domains and ranges in addition to definitions in natural language. Domains and ranges specify what kind of described resources and value resources are associated with a given property. Domains and ranges express the meanings implicit in natural-language definitions in an explicit form that is usable for the automatic processing of logical inferences. When a given property is encountered, an inferencing application may use information about the domains and ranges assigned to a property in order to make inferences about the resources described thereby.
Since January 2008, therefore, DCMI includes formal domains and ranges in the definitions of its properties. So as not to affect the conformance of existing implementations of "simple Dublin Core" in RDF, domains and ranges have not been specified for the fifteen properties of the dce: namespace (http://purl.org/dc/elements/1.1/). Rather, fifteen new properties with "names" identical to those of the Dublin Core Metadata Element Set Version 1.1 have been created in the dct: namespace (http://purl.org/dc/terms/). These fifteen new properties have been defined as sub-properties of the corresponding properties of DCMES Version 1.1 and assigned domains and ranges as specified in the more comprehensive document DCMI Metadata Terms.
Implementers may freely choose to use these fifteen properties either in their legacy dce: variant (e.g., http://purl.org/dc/elements/1.1/creator) or in the dct: variant (e.g., http://purl.org/dc/terms/creator) depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dct:creator to dce:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dct: properties, as they more fully follow emerging notions of best practice for machine-processable metadata.
Homepage: https://www.dublincore.org/specifications/dublin-core/dces/
Namespace: http://purl.org/dc/elements/1.1/
Facebook
TwitterThe OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This comprehensive dataset features detailed metadata for over 190,000 movies and TV shows, with a strong concentration in the Horror genre. It is ideal for entertainment research, machine learning models, genre-specific trend analysis, and content recommendation systems.
Each record contains rich information, making it perfect for streaming platforms, film industry analysts, or academic media researchers.
Primary Genre Focus: Horror
Build movie recommendation systems or genre classifiers
Train NLP models on movie descriptions
Analyze Horror content trends over time
Explore box office vs. rating correlations
Enrich entertainment datasets with directorial and cast metadata
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of human-written and AI-generated texts, along with metadata such as text length, word count, and source type. The content ranges from case studies, essays, reflections, and personal narratives to AI reviews and feedback. It is useful for tasks such as text classification, authorship attribution, NLP benchmarking, and AI vs. human text analysis.
Dataset Structure
Each record is stored in JSON format with the following fields:
text (string) – The full text content (essay, case study, review, or reflection).
source (string) – Indicates whether the text was written by a Human or generated by an AI/Assistant.
prompt_id (integer) – Identifier linking the text to a given prompt or task.
text_length (integer) – The number of characters in the text.
word_count (integer) – The number of words in the text.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Automated classification of metadata of research data by their field of study can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. To evaluate different machine learning approaches, data from the DataCite index have been downloaded in May 2019 with a GeRDI harvester (filtering out any metadata without a qualified subject, i.e. a subject with either a subjectName or a subjectURI) . These is the resulting raw data set.
Facebook
TwitterThis dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected this data, 36 installations were running versions of the Dataverse software that allow depositors to choose a license or data use agreement from a dropdown menu in the dataset deposit form. For more information, see https://guides.dataverse.org/en/5.11.1/user/dataset-management.html#choosing-a-license. The metadatablocks_from_most_known_dataverse_installations.csv file contains the metadata block names, field names and child field names (if the field is a compound field) of the 77 Dataverse installations' metadata blocks. The metadatablocks_from_most_known_dataverse_installations.csv file is useful for comparing each installation's dataset metadata model (the metadata fields and the metadata blocks that each installation uses). The CSV file was created using a Python script at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py, which takes as inputs the directories and files created by the get_dataset_metadata_of_all_installations.py script. Known errors The metadata of two datasets from one of the known installations could not be downloaded because the datasets' pages and metadata could not be accessed with the Dataverse APIs. About metadata blocks Read about the Dataverse software's metadata blocks system at http://guides.dataverse.org/en/latest/admin/metadatacustomization.html
Facebook
Twitterhttps://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global enterprise metadata management market size is USD 7.85 billion in 2024 and will expand at a compound annual growth rate (CAGR) of 24.1% from 2024 to 2031. Market Dynamics of Enterprise Metadata Management Market
Key Drivers for Enterprise Metadata Management Market
Rapidly expanding data sets- The market growth is fueled by enterprise metadata management. Enterprises need to manage and understand their massive and varied datasets as the amount of data generated by these entities continues to grow at an exponential rate. The management of structured and unstructured data is becoming more complicated as organizations gather massive volumes of data from many sources. Enterprise metadata management is crucial for comprehending data context, linkages, and usage; enterprise metadata management offers a framework for organizing, characterizing, and controlling data using metadata. Moreover, improved data quality, easier data integration, and system-wide consistency are all results of well-managed metadata. Better decision-making and operational efficiency can be achieved when firms use enterprise metadata management because it increases data discoverability, streamlines data processes, and supports advanced analytics.
The demand for enterprise metadata management is being driven by these markets becoming more popular because of the growth of big data and advanced analytics tools.
Key Restraints for Enterprise Metadata Management Market
The enterprise metadata management industry is restricted due to a high implementation cost.
The implementation and maintenance of enterprise metadata management solutions can be impeded by a lack of trained specialists in this industry.
za Introduction of the Enterprise Metadata Management Market
Enterprise metadata management is the process of managing all of an organization’s information. Metadata is information about other data that gives it organization, meaning, and context. Better management of data, following the rules, and better decisions are all made easier by enterprise metadata management, which makes sure that data is correctly defined and easy to find. The necessity for improved data governance and strict adherence to regulations is mostly driving the global enterprise metadata management market. The demand for enterprise metadata management is also being propelled by the increasingly digital landscape and the widespread use of advanced analytics. In addition, because it aids in managing and securing the metadata created and stored, blockchain technology is gaining traction across many industries, opening up enormous possibilities for enterprise metadata management. As a result, there will likely be a meteoric rise in the business metadata management industry. Issues with data consistency across numerous channels provide a challenge for both business users and IT departments in the enterprise metadata management market.
Facebook
TwitterThe ESS-DIVE sample identifiers and metadata reporting format primarily follows the System for Earth Sample Registration (SESAR) Global Sample Number (IGSN) guide and template, with modifications to address Environmental Systems Science (ESS) sample needs and practicalities (IGSN-ESS). IGSNs are associated with standardized metadata to characterize a variety of different sample types (e.g. object type, material) and describe sample collection details (e.g. latitude, longitude, environmental context, date, collection method). Globally unique sample identifiers, particularly IGSNs, facilitate sample discovery, tracking, and reuse; they are especially useful when sample data is shared with collaborators, sent to different laboratories or user facilities for analyses, or distributed in different data files, datasets, and/or publications. To develop recommendations for multidisciplinary ecosystem and environmental sciences, we first conducted research on related sample standards and templates. We provide a comparison of existing sample reporting conventions, which includes mapping metadata elements across existing standards and Environment Ontology (ENVO) terms for sample object types and environmental materials. We worked with eight U.S. Department of Energy (DOE) funded projects, including those from Terrestrial Ecosystem Science and Subsurface Biogeochemical Research Scientific Focus Areas. Project scientists tested the process of registering samples for IGSNs and associated metadata in workflows for multidisciplinary ecosystem sciences.more » We provide modified IGSN metadata guidelines to account for needs of a variety of related biological and environmental samples. While generally following the IGSN core descriptive metadata schema, we provide recommendations for extending sample type terms, and connecting to related templates geared towards biodiversity (Darwin Core) and genomic (Minimum Information about any Sequence, MIxS) samples and specimens. ESS-DIVE recommends registering samples for IGSNs through SESAR, and we include instructions for registration using the IGSN-ESS guidelines. Our resulting sample reporting guidelines, template (IGSN-ESS), and identifier approach can be used by any researcher with sample data for ecosystem sciences.« less
Facebook
TwitterWorld Imagery provides one meter or better satellite and aerial imagery in many parts of the world and lower resolution satellite imagery worldwide. The map includes 15m TerraColor imagery at small and mid-scales (~1:591M down to ~1:72k) and 2.5m SPOT Imagery (~1:288k to ~1:72k) for the world. The map features 0.5m resolution imagery in the continental United States and parts of Western Europe from Vantor. Additional Vantor sub-meter imagery is featured in many parts of the world. In the United States, 1 meter or better resolution NAIP imagery is available in some areas. In other parts of the world, imagery at different resolutions has been contributed by the GIS User Community. In select communities, very high resolution imagery (down to 0.03m) is available down to ~1:280 scale. You can contribute your imagery to this map and have it served by Esri via the Community Maps Program. View the list of Contributors for the World Imagery Map. See World Imagery for more information on this map. Metadata: Point and click on the map to see the resolution, collection date, and source of the imagery. Values of "99999" mean that metadata is not available for that field. The metadata applies only to the best available imagery at that location. You may need to zoom in to view the best available imagery. Feedback: Have you ever seen a problem in the Esri World Imagery Map that you wanted to see fixed? You can use the Imagery Map Feedback web map to provide feedback on issues or errors that you see. The feedback will be reviewed by the ArcGIS Online team and considered for one of our updates. Need Newer Imagery?: If you need to access more recent or higher resolution imagery, you can find and order that in the Content Store for ArcGIS app.Learn MoreGet AccessOpen App
Facebook
TwitterThe dataset consists of public domain acute and chronic toxicity and chemistry data for algal species. Data are accessible at: https://envirotoxdatabase.org/ Data include algal species, chemical identification, and the concentrations that do and do not affect algal growth.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema. From the mappings, we developed a preliminary set of minimal metadata elements that can be used to describe NIH-funded datasets. Please see the readme file for more details about the individual sheets within the spreadsheet.
Facebook
TwitterSraghvi/subset-0-with-metadata dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation
Introduction
This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.
Locations and sensors
The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.
Data volume per session
Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:
Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]
2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1
2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3
2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8
2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9
2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6
2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3
255 40.3 542.7 3’914’921 287.0
Data and metadata structure
Sensor data corpus
The sensor data corpus comprises two processing stages:
raw data streams stored in ROS bagfiles (aka logfiles),
camera and sonar images (aka datafiles) extracted from the logfiles.
The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:
${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}
A typical logfile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag
A typical datafile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg
All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.
Metadatabase
The metadatabase is provided in two equivalent forms:
as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,
as a collection of CSV files in the csv/ directory for users who prefer other tools.
The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.
An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json
Some general design remarks:
For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.
In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.
A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.
As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:
SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
Facebook
TwitterMetadata extracted from the Rotten Tomatoes website using web scraping techniques. All the code used to do that can be seen in https://github.com/rafaelstjf/Tomato_Brusher
I wanted to use machine learning approaches to see if it was possible to predict the user score of a movie using only things like genre, rating, rating, critic score, cast and crew.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Title: IMDB & TMDB Movie Metadata Big Dataset (>1M)
Subtitle: A Comprehensive Dataset Featuring Detailed Metadata of Movies (IMDB, TMDB). Over 1M Rows & 42 Features: Metadata, Ratings, Genres, Cast, Crew, Sentiment Analysis and many more...
Detailed Description:
Overview: This comprehensive dataset merges the extensive film data available from both IMDB and TMDB, offering a rich resource for movie enthusiasts, data scientists, and researchers. With over 1 million rows and 42 detailed features, this dataset provides in-depth information about a wide variety of movies, spanning different genres, periods, and production backgrounds.
File Information: 1. File Size: ≈ 1GB 2. Format: CSV (Comma-Separated Values)
Column Descriptors/Key Features: 1. ID: Unique identifier for each movie. 2. Title: The official title of the movie. 3. Vote Average: Average rating received by the movie. 4. Vote Count: Number of votes the movie has received. 5. Status: Current status of the movie (e.g., Released, Post-Production). 6. Release Date: Official release date of the movie. 7. Revenue: Box office revenue generated by the movie. 8. Runtime: Duration of the movie in minutes. 9. Adult: Indicates if the movie is for adults. 10. Genres: List of genres the movie belongs to. 11. Overview Sentiment: Sentiment analysis of the movie's overview text. 12. Cast: List of main actors in the movie. 13. Crew: List of key crew members, including directors, producers, and writers. 14. Genres List: Detailed genres in list format. 15. Keywords: List of relevant keywords associated with the movie. 16. Director of Photography: Name of the cinematographer. 17. Producers: Names of the producers. 18. Music Composer: Name of the music composer.
Additional Features:
Potential Use Cases: - Sentiment Analysis: Analyze audience sentiment towards movies based on reviews and ratings. - Recommendation Systems: Build models to recommend movies based on user preferences and viewing history. - Market Analysis: Study trends in the movie industry, including genre popularity and revenue patterns. - Content Analysis: Investigate the thematic content and diversity of movies over time. - Data Visualization: Create visual representations of movie data to uncover hidden insights.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GSC/BRC Metadata Standards ProjectSetup: Example CSV file for setting up Project registration and update event for GSC/BRC metadata standards. User as load the setup file using CLI interface or user can setup project using metadata setup GUI. (CSV 14 kb)
Facebook
Twitterbs-modeling-metadata/c4-en-html-with-metadata dataset hosted on Hugging Face and contributed by the HF Datasets community