100+ datasets found
  1. Meta Kaggle

    • kaggle.com
    zip
    Updated Feb 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2026). Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle
    Explore at:
    zip(10313419305 bytes)Available download formats
    Dataset updated
    Feb 1, 2026
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Meta Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more

    Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

  2. h

    the-stack-metadata

    • huggingface.co
    Updated Nov 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack-metadata [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack Metadata

      Changelog
    

    Release Description

    v1.1 This is the first release of the metadata. It is for The Stack v1.1

    v1.2 Metadata dataset matching The Stack v1.2

      Dataset Summary
    

    This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories.

      Supported Tasks and Leaderboards
    

    The main task is to recreate… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-metadata.

  3. Dataset metadata of known Dataverse installations

    • search.datacite.org
    Updated 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2019). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/dvn/dcdkzq
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    DataCite
    Harvard Dataverse
    Authors
    Julian Gautier
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected this data, 36 installations were running versions of the Dataverse software that allow depositors to choose a license or data use agreement from a dropdown menu in the dataset deposit form. For more information, see https://guides.dataverse.org/en/5.11.1/user/dataset-management.html#choosing-a-license. The metadatablocks_from_most_known_dataverse_installations.csv file contains the metadata block names, field names and child field names (if the field is a compound field) of the 77 Dataverse installations' metadata blocks. The metadatablocks_from_most_known_dataverse_installations.csv file is useful for comparing each installation's dataset metadata model (the metadata fields and the metadata blocks that each installation uses). The CSV file was created using a Python script at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py, which takes as inputs the directories and files created by the get_dataset_metadata_of_all_installations.py script. Known errors The metadata of two datasets from one of the known installations could not be downloaded because the datasets' pages and metadata could not be accessed with the Dataverse APIs. About metadata blocks Read about the Dataverse software's metadata blocks system at http://guides.dataverse.org/en/latest/admin/metadatacustomization.html

  4. metadata

    • catalog.data.gov
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). metadata [Dataset]. https://catalog.data.gov/dataset/metadata-f2500
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The dataset consists of public domain acute and chronic toxicity and chemistry data for algal species. Data are accessible at: https://envirotoxdatabase.org/ Data include algal species, chemical identification, and the concentrations that do and do not affect algal growth.

  5. c

    Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection)

    • crawlfeeds.com
    csv, zip
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection) [Dataset]. https://crawlfeeds.com/datasets/movies-tv-shows-metadata-dataset-190k-records-horror-heavy-collection
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Aug 23, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    This comprehensive dataset features detailed metadata for over 190,000 movies and TV shows, with a strong concentration in the Horror genre. It is ideal for entertainment research, machine learning models, genre-specific trend analysis, and content recommendation systems.

    Each record contains rich information, making it perfect for streaming platforms, film industry analysts, or academic media researchers.

    Primary Genre Focus: Horror

    Use Cases:

    • Build movie recommendation systems or genre classifiers

    • Train NLP models on movie descriptions

    • Analyze Horror content trends over time

    • Explore box office vs. rating correlations

    • Enrich entertainment datasets with directorial and cast metadata

  6. Z

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

    • data.niaid.nih.gov
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
    Explore at:
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    German Research Center for Artificial Intelligence (DFKI)
    Kraken Robotik GmbH
    Authors
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

    Introduction

    This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

    Locations and sensors

    The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

    Data volume per session

    Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

    Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

    2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

    2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

    2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

    2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

    2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

    2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

    255 40.3 542.7 3’914’921 287.0

    Data and metadata structure

    Sensor data corpus

    The sensor data corpus comprises two processing stages:

    raw data streams stored in ROS bagfiles (aka logfiles),

    camera and sonar images (aka datafiles) extracted from the logfiles.

    The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

    ${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

    A typical logfile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

    A typical datafile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

    All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

    Metadatabase

    The metadatabase is provided in two equivalent forms:

    as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

    as a collection of CSV files in the csv/ directory for users who prefer other tools.

    The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

    An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

    Some general design remarks:

    For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

    In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

    A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

    As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

    SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;

  7. Realistic Email Categorization Dataset (Synthetic)

    • kaggle.com
    zip
    Updated Jan 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fenil Sonani (2025). Realistic Email Categorization Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/fenilsonani/email-data-for-email-classification
    Explore at:
    zip(2746947 bytes)Available download formats
    Dataset updated
    Jan 4, 2025
    Authors
    Fenil Sonani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset, titled "Realistic Email Categorization Dataset for BERT (Synthetic)," contains 20,000 entries of diverse and realistic email addresses generated using a Python script. The dataset is meticulously crafted to mimic real-world email categorization scenarios, making it an excellent resource for training and evaluating machine learning models, particularly transformer-based models like BERT.

    Features:

    • Email Address: The complete email address (e.g., john.doe@example.com).
    • Category: Broad classification of the email (e.g., Sales, Support, Marketing).
    • Subcategory: Granular classification within the main category (e.g., Technical Support, Domestic Sales).
    • Local Part: The part of the email address before the @ symbol.
    • Domain: The part of the email address after the @ symbol.
    • Length: Total character count of the email address.
    • Character Bi-grams & Tri-grams: Sequences of two and three consecutive characters extracted from the local part.
    • Email Content: Randomly generated textual snippets associated with the email.
    • Timestamp: Simulated creation or usage timestamp within the past two years.
    • Disposable & Spam Indicators: Boolean flags indicating whether the email is disposable or marked as spam.
    • Country & Language: Geographical and linguistic metadata derived from the domain.

    Key Highlights:

    • The data is entirely synthetic and generated using the Faker Python library and additional randomization techniques.
    • Designed to simulate realistic email structures, including name-based, role-based, and department-specific addresses.
    • Features a diverse range of domains, subdomains, TLDs, and email patterns, ensuring applicability across a wide range of machine learning and natural language processing tasks.
    • Includes advanced annotations like email content, spam indicators, and geographical metadata, providing rich contextual information for model training.

    Applications:

    • Email Classification: Train models to categorize emails into predefined categories.
    • Spam Detection: Use the spam indicators to train or evaluate anti-spam algorithms.
    • Feature Engineering: Explore the impact of local part, domain, and character n-grams on machine learning models.
    • BERT Fine-Tuning: Leverage the email content and category labels to fine-tune transformer models for NLP tasks.
  8. TMDB Top 8550 Movies Metadata 2025

    • kaggle.com
    zip
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufi Inam Ul Hassan (2025). TMDB Top 8550 Movies Metadata 2025 [Dataset]. https://www.kaggle.com/datasets/sufiinamulhassan/tmdb-top-8550-movies-metadata
    Explore at:
    zip(1239492 bytes)Available download formats
    Dataset updated
    May 24, 2025
    Authors
    Sufi Inam Ul Hassan
    Description

    📄 Description

    This dataset contains metadata for the top 8,550 movies listed on The Movie Database (TMDB). Each entry includes valuable information such as:

    • 🎬 Title
    • 📅 Release Date
    • 🎭 Genres
    • 🌐 Original Language
    • Average Rating
    • 📈 Popularity Score
    • 🗳️ Vote Count
    • 🧾 Overview / Synopsis

    ✅ The dataset is ideal for:

    1. Exploratory Data Analysis (EDA)
    2. Building Recommendation Systems
    3. Popularity Trend Analysis
    4. Sentiment or Genre-based Analysis
    5. Predictive Modeling & Machine Learning

    It serves as a great resource for data scientists, analysts, machine learning practitioners, and film enthusiasts interested in movie metadata.

    📚 Use Cases

    Here are a few ideas for how to use this dataset:

    • 📌 Build a Movie Recommender System
    • 📌 Compare Trends Over Time (Genres, Ratings, etc.)
    • 📌 Visualize Rating Distributions by Year
    • 📌 Cluster Movies Based on Metadata

    🔍 Source

    All data is sourced from the TMDB API and reflects the top-rated or most popular movies available at the time of collection.

    📢 Disclaimer

    This dataset is intended for educational and research purposes only. All movie data and assets belong to their respective copyright holders and TMDB.

  9. Common Metadata Elements for Cataloging Biomedical Datasets

    • figshare.com
    xlsx
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Read (2016). Common Metadata Elements for Cataloging Biomedical Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.1496573.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kevin Read
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema. From the mappings, we developed a preliminary set of minimal metadata elements that can be used to describe NIH-funded datasets. Please see the readme file for more details about the individual sheets within the spreadsheet.

  10. n

    Government Data Open Platform Get Dataset Metadata API

    • data.nat.gov.tw
    Updated Jan 29, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2026). Government Data Open Platform Get Dataset Metadata API [Dataset]. https://data.nat.gov.tw/en/datasets/156634
    Explore at:
    Dataset updated
    Jan 29, 2026
    License

    https://data.nat.gov.tw/licensehttps://data.nat.gov.tw/license

    Description

    Obtain Dataset Metadata API.......................

  11. Amazon Books Dataset (20K Books + 727K Reviews)

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hadi Fariborzi (2025). Amazon Books Dataset (20K Books + 727K Reviews) [Dataset]. https://www.kaggle.com/datasets/hadifariborzi/amazon-books-dataset-20k-books-727k-reviews
    Explore at:
    zip(233373889 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Authors
    Hadi Fariborzi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.

    What's Included:

    Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:

    Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:

    Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.

  12. d

    Metadata Mahoenui giant wētā - Dataset - data.govt.nz - discover and use...

    • catalogue.data.govt.nz
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Metadata Mahoenui giant wētā - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/metadata-mahoenui-giant-weta
    Explore at:
    Dataset updated
    Apr 23, 2025
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    Metadata to support the publication: 'Population genomic analysis of Mahoenui giant wētā (Deinacrida mahoenui) reveals minimal reduction in genomic diversity following translocation'. Insect Conservation and Diversity. https://doi.org/10.1111/icad.12810. Mahoenui giant wētā were sampled at the source population in the Mahoenui giant wētā Scientific Reserve, and a translocated population on Mahurangi Island, and sequenced via genotyping-by-sequencing to assess population diversity and differentiation. Here we provide the associated metadata.

  13. A Dataset of Metadata of Articles Citing Retracted Articles

    • zenodo.org
    csv
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yagmur Ozturk; Yagmur Ozturk (2024). A Dataset of Metadata of Articles Citing Retracted Articles [Dataset]. http://doi.org/10.5281/zenodo.13621503
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yagmur Ozturk; Yagmur Ozturk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises of metada of articles citing retracted publications. Originally, we obtained the DOIs from the Feet of Clay Detector of the Problematic Paper Screener (PPS - FoCD). Additional columns that were not provided in PPS were added using Crossref & Retraction Watch Database (CRxRW) and Dimensions API services. This detector flags publications that cite retracted articles with additional metadata.

    By querying the Dimensions API with the DOIs of the FoC articles, we acquired information such as more detailed document types (editorial, review article, research article), open access status (we only kept open access FoC articles in the dataset since we want to access the full-texts in the future), and research fields (classified according to the Australian and New Zealand Standard Research Classification (ANZSRC) Fields of Research (FoR), comprising of 23 main fields such as biological sciences, education.

    To get further information about the cited retracted articles in the dataset, we used the joint release of CRxRW. Using this dataset, we added the retraction reasons and retraction years.

    The original dataset was obtained from the PPS FoCD in December 2023. At this time there were 22558 total articles flagged in FoCD. Using the data filtering feature in PPS, we had a preliminary selection before downloading the first version of the dataset. We applied a filter to obtain:

    • non-retracted citing articles at the time of data curation*
    • open-access citing articles since we need the whole text to go forward with natural language processing tasks
    • cited retracted articles with at least one scientific content related reason of retraction
    • only articles (not monographs, chapters) to retain a unified text type

    More information about the usage of this dataset will be updated.

    *Current retraction status of the citing articles can be different since this is a static dataset and scientific literature is dynamic.

  14. The Red Queen in the Repository: metadata quality in an ever-changing...

    • zenodo.org
    bin, csv, zip
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Philipson; Joakim Philipson (2024). The Red Queen in the Repository: metadata quality in an ever-changing environment (preprint of paper, presentation slides and dataset collection with validation schemas to IDCC2019 conference paper) [Dataset]. http://doi.org/10.5281/zenodo.2276777
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joakim Philipson; Joakim Philipson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This fileset contains a preprint version of the conference paper (.pdf), presentation slides (as .pptx) and the dataset(s) and validation schema(s) for the IDCC 2019 (Melbourne) conference paper: The Red Queen in the Repository: metadata quality in an ever-changing environment. Datasets and schemas are in .xml, .xsd , Excel (.xlsx) and .csv (two files representing two different sheets in the .xslx -file). The validationSchemas.zip holds the additional validation schemas (.xsd), that were not found in the schemaLocations of the metadata xml-files to be validated. The schemas must all be placed in the same folder, and are to be used for validating the Dataverse dcterms records (with metadataDCT.xsd) and the Zenodo oai_datacite feeds respectively (schema.datacite.org_oai_oai-1.0_oai.xsd). In the latter case, a simpler way of doing it might be to replace the incorrect URL "http://schema.datacite.org/oai/oai-1.0/ oai_datacite.xsd" in the schemaLocation of these xml-files by the CORRECT: schemaLocation="http://schema.datacite.org/oai/oai-1.0/ http://schema.datacite.org/oai/oai-1.0/oai.xsd" as has been done already in the sample files here. The sample file folders testDVNcoll.zip (Dataverse), testFigColl.zip (Figshare) and testZenColl.zip (Zenodo) contain all the metadata files tested and validated that are registered in the spreadsheet with objectIDs.
    In the case of Zenodo, one original file feed,
    zen2018oai_datacite3orig-https%20_zenodo.org_oai2d%20verb=ListRecords%26metadata
    Prefix=oai_datacite%26from=2018-11-29%26until=2018-11-30.xml
    ,
    is also supplied to show what was necessary to change in order to perform validation as indicated in the paper.

    For Dataverse, a corrected version of a file,
    dvn2014ddi-27595Corr_https%20_dataverse.harvard.edu_api_datasets_export%20
    exporter=ddi%26persistentId=doi%253A10.7910_DVN_27595Corr.xml
    ,
    is also supplied in order to show the changes it would take to make the file validate without error.

  15. g

    SAS code used to analyze data and a datafile with metadata glossary |...

    • gimi9.com
    Updated Dec 28, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). SAS code used to analyze data and a datafile with metadata glossary | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_sas-code-used-to-analyze-data-and-a-datafile-with-metadata-glossary
    Explore at:
    Dataset updated
    Dec 28, 2016
    Description

    We compiled macroinvertebrate assemblage data collected from 1995 to 2014 from the St. Louis River Area of Concern (AOC) of western Lake Superior. Our objective was to define depth-adjusted cutoff values for benthos condition classes (poor, fair, reference) to provide tool useful for assessing progress toward achieving removal targets for the degraded benthos beneficial use impairment in the AOC. The relationship between depth and benthos metrics was wedge-shaped. We therefore used quantile regression to model the limiting effect of depth on selected benthos metrics, including taxa richness, percent non-oligochaete individuals, combined percent Ephemeroptera, Trichoptera, and Odonata individuals, and density of ephemerid mayfly nymphs (Hexagenia). We created a scaled trimetric index from the first three metrics. Metric values at or above the 90th percentile quantile regression model prediction were defined as reference condition for that depth. We set the cutoff between poor and fair condition as the 50th percentile model prediction. We examined sampler type, exposure, geographic zone of the AOC, and substrate type for confounding effects. Based on these analyses we combined data across sampler type and exposure classes and created separate models for each geographic zone. We used the resulting condition class cutoff values to assess the relative benthic condition for three habitat restoration project areas. The depth-limited pattern of ephemerid abundance we observed in the St. Louis River AOC also occurred elsewhere in the Great Lakes. We provide tabulated model predictions for application of our depth-adjusted condition class cutoff values to new sample data. This dataset is associated with the following publication: Angradi, T., W. Bartsch, A. Trebitz, V. Brady, and J. Launspach. A depth-adjusted ambient distribution approach for setting numeric removal targets for a Great Lakes Area of Concern beneficial use impairment: Degraded benthos. JOURNAL OF GREAT LAKES RESEARCH. International Association for Great Lakes Research, Ann Arbor, MI, USA, 43(1): 108-120, (2017).

  16. Environmental Sensor Metadata Survey.csv.zip

    • figshare.com
    zip
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Scully-Allison (2018). Environmental Sensor Metadata Survey.csv.zip [Dataset]. http://doi.org/10.6084/m9.figshare.5833818.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Connor Scully-Allison
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The following dataset is a collection of twelve anonymously gathered responses from scientists and technicians working with Environmental Science Sensor Networks Collecting in-situ time Series data. Specifically, the survey which produced this dataset was distributed to two working groups associated with the organization of Earth Science Information Partners: the Envirosensing Cluster and the Documentation Cluster.This survey was crafted to hopefully provide a picture of what metadata management and creation looks like for professionals working with environmental sensor data.

  17. YouTube Videos and Channels Metadata

    • kaggle.com
    zip
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). YouTube Videos and Channels Metadata [Dataset]. https://www.kaggle.com/datasets/thedevastator/revealing-insights-from-youtube-video-and-channe/code
    Explore at:
    zip(85613002 bytes)Available download formats
    Dataset updated
    Dec 14, 2022
    Authors
    The Devastator
    Area covered
    YouTube
    Description

    YouTube Videos and Channels Metadata

    Analyze the statistical relation between videos and form a topic tree

    By VISHWANATH SESHAGIRI [source]

    About this dataset

    This dataset contains YouTube video and channel metadata to analyze the statistical relation between videos and form a topic tree. With 9 direct features, 13 more indirect features, it has all that you need to build a deep understanding of how videos are related – including information like total views per unit time, channel views, likes/subscribers ratio, comments/views ratio, dislikes/subscribers ratio etc. This data provides us with a unique opportunity to gain insights on topics such as subscriber count trends over time or calculating the impact of trends on subscriber engagement. We can develop powerful models that show us how different types of content drive viewership and identify the most popular styles or topics within YouTube's vast catalogue. Additionally this data offers an intriguing look into consumer behaviour as we can explore what drives people to watch specific videos at certain times or appreciate certain channels more than others - by analyzing things like likes per subscribers and dislikes per views ratios for example! Finally this dataset is completely open source with an easy-to-understand Github repo making it an invaluable resource for anyone looking to gain better insights into how their audience interacts with their content and how they might improve it in the future

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use This Dataset

    In general, it is important to understand each parameter in the data set before proceeding with analysis. The parameters included are totalviews/channelelapsedtime, channelViewCount, likes/subscriber, views/subscribers, subscriberCounts, dislikes/views comments/subscriberchannelCommentCounts,, likes/dislikes comments/views dislikes/ subscribers totviewes /totsubsvews /elapsedtime.

    To use this dataset for your own analysis:1) Review each parameter’s meaning and purpose in our dataset; 2) Get familiar with basic descriptive statistics such as mean median mode range; 3) Create visualizations or tables based on subsets of our data; 4) Understand correlations between different sets of variables or parameters; 5) Generate meaningful conclusions about specific channels or topics based on organized graph hierarchies or tables.; 6) Analyze trends over time for individual parameters as well as an aggregate reaction from all users when videos are released

    Research Ideas

    • Predicting the Relative Popularity of Videos: This dataset can be used to build a statistical model that can predict the relative popularity of videos based on various factors such as total views, channel viewers, likes/dislikes ratio, and comments/views ratio. This model could then be used to make recommendations and predict which videos are likely to become popular or go viral.

    • Creating Topic Trees: The dataset can also be used to create topic trees or taxonomies by analyzing the content of videos and looking at what topics they cover. For example, one could analyze the most popular YouTube channels in a specific subject area, group together those that discuss similar topics, and then build an organized tree structure around those topics in order to better understand viewer interests in that area.

    • Viewer Engagement Analysis: This dataset could also be used for viewer engagement analysis purposes by analyzing factors such as subscriber count, average time spent watching a video per user (elapsed time), comments made per view etc., so as to gain insights into how engaged viewers are with specific content or channels on YouTube. From this information it would be possible to optimize content strategy accordingly in order improve overall engagement rates across various types of video content and channel types

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: YouTubeDataset_withChannelElapsed.csv | Column name | Description | |:----------------------------------|:-------------------------------------------------------| | totalviews/channelelapsedtime | Ratio of total views to channel elapsed time. (Ratio) | | channelViewCount | Total number of views for the channel. (Integer) | | likes/subscriber ...

  18. n

    OpenScience Slovenia document metadata dataset

    • narcis.nl
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borovič, M (via Mendeley Data) (2021). OpenScience Slovenia document metadata dataset [Dataset]. http://doi.org/10.17632/7wh9xvvmgk.3
    Explore at:
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Borovič, M (via Mendeley Data)
    Area covered
    Slovenia
    Description

    The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.

  19. JPL Physical Oceanography Distributed Active Archive Center (PODAAC) Dataset...

    • catalog.data.gov
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2025). JPL Physical Oceanography Distributed Active Archive Center (PODAAC) Dataset Metadata API [Dataset]. https://catalog.data.gov/dataset/jpl-physical-oceanography-distributed-active-archive-center-podaac-dataset-metadata-api
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    PO.DAAC provides several ways to discover and access physical oceanography data, from the PO.DAAC Web Portal to FTP access to front-end user interfaces (see http://podaac.jpl.nasa.gov). That same data can also be discovered and accessed through PO.DAAC Web Services, enabling efficient machine-to-machine communication and data transfers.

  20. I

    Version values for DataCite dataset records

    • databank.illinois.edu
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Wickes, Version values for DataCite dataset records [Dataset]. http://doi.org/10.13012/B2IDB-4803136_V1
    Explore at:
    Authors
    Elizabeth Wickes
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (https://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected. This dataset contains three files: 1) readme.txt: A readme file. 2) version-results.csv: A CSV file containing three columns: DOI, DOI prefix, and version text contents 3) version-counts.csv: A CSV file containing counts for unique version text content values.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2026). Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle
Organization logo

Meta Kaggle

Kaggle's public data on competitions, users, submission scores, code, and more

Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
zip(10313419305 bytes)Available download formats
Dataset updated
Feb 1, 2026
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Meta Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more

Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

Search
Clear search
Close search
Google apps
Main menu