100+ datasets found
  1. H

    Harvard Common Data Set

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Sep 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Institutional Research (2016). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2016
    Dataset provided by
    Harvard Dataverse
    Authors
    Office of Institutional Research
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZVhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZV

    Description

    This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.

  2. s

    NINDS Common Data Elements

    • scicrunch.org
    • dknet.org
    • +2more
    Updated Mar 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). NINDS Common Data Elements [Dataset]. http://identifiers.org/RRID:SCR_006577
    Explore at:
    Dataset updated
    Mar 15, 2018
    Description

    The purpose of the NINDS Common Data Elements (CDEs) Project is to standardize the collection of investigational data in order to facilitate comparison of results across studies and more effectively aggregate information into significant metadata results. The goal of the National Institute of Neurological Disorders and Stroke (NINDS) CDE Project specifically is to develop data standards for clinical research within the neurological community. Central to this Project is the creation of common definitions and data sets so that information (data) is consistently captured and recorded across studies. To harmonize data collected from clinical studies, the NINDS Office of Clinical Research is spearheading the effort to develop CDEs in neuroscience. This Web site outlines these data standards and provides accompanying tools to help investigators and research teams collect and record standardized clinical data. The Institute still encourages creativity and uniqueness by allowing investigators to independently identify and add their own critical variables. The CDEs have been identified through review of the documentation of numerous studies funded by NINDS, review of the literature and regulatory requirements, and review of other Institute''s common data efforts. Other data standards such as those of the Clinical Data Interchange Standards Consortium (CDISC), the Clinical Data Acquisition Standards Harmonization (CDASH) Initiative, ClinicalTrials.gov, the NINDS Genetics Repository, and the NIH Roadmap efforts have also been followed to ensure that the NINDS CDEs are comprehensive and as compatible as possible with those standards. CDEs now available: * General (CDEs that cross diseases) Updated Feb. 2011! * Congenital Muscular Dystrophy * Epilepsy (Updated Sept 2011) * Friedreich''s Ataxia * Parkinson''s Disease * Spinal Cord Injury * Stroke * Traumatic Brain Injury CDEs in development: * Amyotrophic Lateral Sclerosis (Public review Sept 15 through Nov 15) * Frontotemporal Dementia * Headache * Huntington''s Disease * Multiple Sclerosis * Neuromuscular Diseases ** Adult and pediatric working groups are being finalized and these groups will focus on: Duchenne Muscular Dystrophy, Facioscapulohumeral Muscular Dystrophy, Myasthenia Gravis, Myotonic Dystrophy, and Spinal Muscular Atrophy The following tools are available through this portal: * CDE Catalog - includes the universe of all CDEs. Users are able to search the full universe to isolate a subset of the CDEs (e.g., all stroke-specific CDEs, all pediatric epilepsy CDEs, etc.) and download details about those CDEs. * CRF Library - (a.k.a., Library of Case Report Form Modules and Guidelines) contains all the CRF Modules that have been created through the NINDS CDE Project as well as various guideline documents. Users are able to search the library to find CRF Modules and Guidelines of interest. * Form Builder - enables users to start the process of assembling a CRF or form by allowing them to choose the CDEs they would like to include on the form. This tool is intended to assist data managers and database developers to create data dictionaries for their study forms.

  3. f

    An ontology-based rare disease common data model harmonising international...

    • figshare.com
    csv
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana ZschΓΌntzsch; Susanna Wiegand; Peter KΓΌhnen; Oya Beyan; Sylvia Thun; Elisabeth FΓ©licitΓ© Nyoungui; Filip Rehburg (2025). An ontology-based rare disease common data model harmonising international registries, FHIR, and Phenopackets [Dataset]. http://doi.org/10.6084/m9.figshare.26509150.v7
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    figshare
    Authors
    Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana ZschΓΌntzsch; Susanna Wiegand; Peter KΓΌhnen; Oya Beyan; Sylvia Thun; Elisabeth FΓ©licitΓ© Nyoungui; Filip Rehburg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Please see our GitHub repository here: https://github.com/BIH-CEI/rd-cdm/ Please see our RD CDM documentation here: https://rd-cdm.readthedocs.io/en/latest/index.html/ Attention: The RD CDM paper is currently under review (version 2.0.0.dev0). As soon as the paper is accepted, we will publish v2.0.0. For more information please see our ChangeLog: https://rd-cdm.readthedocs.io/en/latest/changelog.htmlWe introduce our RD CDM v2.0.0β€” a common data model specifically designed for rare diseases. This RD CDM simplifies the capture, storage, and exchange of complex clinical data, enabling researchers and healthcare providers to work with harmonized datasets across different institutions and countries. The RD CDM is based on the ERDRI-CDS, a common data set developed by the European Rare Disease Research Infrastructure (ERDRI) to support the collection of harmonized data for rare disease research. By extending the ERDRI-CDS with additional concepts and relationships, based on HL7 FHIR v4.0.1 and the GA4GH Phenopacket Schema v2.0, the RD CDM provides a comprehensive model for capturing detailed clinical information alongisde precise genetic data on rare diseases.Background:Rare diseases (RDs), though individually rare, collectively impact over 260 million people worldwide, with over 17 million affected in Europe. These conditions, defined by their low prevalence of fewer than 5 in 10,000 individuals, are often genetically driven, with over 70% of cases suspected to have a genetic cause. Despite significant advances in medical research, RD patients still face lengthy diagnostic delays, often due to a lack of awareness in general healthcare settings and the rarity of RD-specific knowledge among clinicians. Misdiagnosis and underrepresentation in routine care further compound the challenges, leaving many patients without timely and accurate diagnoses.Interoperability plays a critical role in addressing these challenges, ensuring the seamless exchange and interpretation of medical data through the use of internationally agreed standards. In the field of rare diseases, where data is often scarce and scattered, the importance of structured, standardized, and reusable medical records cannot be overstated. Interoperable data formats allow for more efficient research, better care coordination, and a clearer understanding of complex clinical cases. However, existing medical systems often fail to support the depth of phenotypic and genotypic data required for rare disease research and treatment, making interoperability a crucial enabler for improving outcomes in RD care.

  4. Popular Links in Newsletters

    • kaggle.com
    zip
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Popular Links in Newsletters [Dataset]. https://www.kaggle.com/datasets/thedevastator/popular-links-in-newsletters
    Explore at:
    zip(4320707 bytes)Available download formats
    Dataset updated
    Jan 22, 2023
    Authors
    The Devastator
    Description

    Popular Links in Newsletters

    Trends in Content Curation in 2020

    By Amber Thomas [source]

    About this dataset

    Welcome to the Monthly Popular Links Shared in Newsletters dataset! Here you will find the most popular links shared not just on one newsletter but across multiple newsletters that come out daily to bi-weekly. This data was collected through an experimental newsletter by The Pudding called Winning the Internet which aims to curate what's good in the internet amongst a sea of content.

    The newsletters we subscribe to must meet certain criteria such as having primary purpose of sharing links, having most of its connections to other places, and it should have general interest topics instead of being too niche. The dataset is compiled by automated scripts parsing emails and denylisting any self-promotion or repetitive links from different sources. All data used for these charts are then computed from these sources including rolling averages for how often a link appears in each newsletter and nouns chosen for subject lines for each email.

    So check out this eye-catching Monthly Popular Links Shared in Newsletters dataset complete with all the popular links as well as computation records from some of your favorite newslettersβ€”you won’t want to miss out on what’s winning the internet!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use This Dataset

    This dataset contains the most popular links shared in newsletters, allowing you to explore and visualize internet trends. With this data, you can analyze what topics are becoming more or less popular over time, as well as identify patterns in the type of content people are consuming online.

    Using this dataset is a great way to uncover insights into what people are interested in, making it perfect for marketers who want to learn more about their target audience’s preferences and interests.

    To get started with this dataset: - Download the CSV file containing the data. - Open the CSV file and inspect its columns: β€˜url’ (the URL of the link shared in the newsletter), and β€˜flag’ (a flag indicating whether or not it was shared in multiple newsletters).
    3 Analyse each column individually or combine them for further analysis by sorting them according to different parameters e.g., sorting by β€˜flag’ column to identify frequently appearing links or sorting by 'url' column for unique URLs ).

    4 Look at each piece of data carefully in order to gain an understanding of what URLs are being shared across newsletters. You should be looking for similarities and differences between various columns so that you can gain insights into how people's interests change over time and/or cross-reference between different newsletters as appropriate .

    5 Create visualizations using charts or graphs such as scatter plots and line graphs with your findings from step 4 above that show correlations between variables like frequency of URLs mentioned over time etc..

    6 Share your findings via a blog post or presentation showcasing your newfound knowledge which helps other marketers know better target those similar audiences better!

    Research Ideas

    • Analyzing the most effective newsletters and analyzing the effectiveness of their content.
    • Evaluating and comparing newsletters based on metrics such as the quality of links shared and/or readership growth within a certain timeframe.
    • Visualizing time trends in individual newsletter popularity, or tracking which topics are proving to be favorites among many newsletters over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: dump-2020-12-15.csv | Column name | Description | |:--------------|:----------------------------------------------------------------------------| | url | The URL of the link shared in the newsletter. (String) | | date | The date the link was shared in the newsletter. (Date) | | flag | A flag indicating if the link was shared in multiple newsletters. (Boolean) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Amber Thomas.

  5. Popular TV Shows data set

    • kaggle.com
    zip
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudhanshu Yadav (2023). Popular TV Shows data set [Dataset]. https://www.kaggle.com/datasets/sudhanshuy17/popular-tv-shows-data-set
    Explore at:
    zip(1677965 bytes)Available download formats
    Dataset updated
    Jan 6, 2023
    Authors
    Sudhanshu Yadav
    Description

    "This dataset contains information on a selection of popular TV shows from various networks and countries. It includes information on the show's title, year of release, genre, main cast, and a "This dataset contains information on a selection of popular TV shows from various networks and countries. It includes information on the show's title, year of release, genre, main cast, and a brief summary of the plot. With this dataset, one can analyze trends in TV show popularity and content over time, as well as investigate the characteristics of successful TV shows. The data is suitable for a variety of applications, such as content recommendation systems, media research, and entertainment industry analysis." You can find the dataset in the Tv_shows.csv file.

  6. e

    Open Data Portal Most Popular Data Sets

    • data.europa.eu
    csv, json
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zentrales IT-Management Schleswig-Holstein (2024). Open Data Portal Most Popular Data Sets [Dataset]. https://data.europa.eu/data/datasets/d663a020-7556-4fab-92fe-733dbae31912
    Explore at:
    json(568), csv(26368)Available download formats
    Dataset updated
    Jun 21, 2024
    Dataset authored and provided by
    Zentrales IT-Management Schleswig-Holstein
    License

    Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
    License information was derived automatically

    Description

    List of the ten most visited datasets in the open data portal Schleswig-Holstein per month.

    Only calls to the record metadata (description) are counted via the web interface. API calls and downloads of the associated files are not counted. So it happens that the record description is retrieved only once and then a regular access to the underlying data file is carried out. That would not be included in these figures.

    The calls of records from a time series were summed up in a single entry.

    The following fields are available:

    β€” β€˜month’ β€” in the format β€˜yyyy-mm’ β€” β€˜Number’ β€” Number of views β€” β€˜URL’ β€” Address of the record in the open data portal

    Column separator is comma

  7. C

    Data from: Imbalanced dataset for benchmarking

    • dataverse.csuc.cat
    application/gzip, txt
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira (2023). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.34810/data656
    Explore at:
    txt(1592), application/gzip(42530536)Available download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira
    License

    https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656

    Description

    The different algorithms of the "imbalanced-learn" toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011)

  8. Hulu Popular Shows Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Hulu Popular Shows Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/hulu-popular-shows-dataset
    Explore at:
    zip(359148 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    Description

    Hulu Popular Shows Dataset

    Dataset containing information on the top 1,000 most popular shows on Hulu

    By Chase Willden [source]

    About this dataset

    The Hulu Shows dataset is a comprehensive collection of information on the top 1,000 most popular shows available on the streaming platform Hulu. This dataset provides detailed insights into each show, including key details, availability, ratings, and other relevant information.

    The dataset aims to provide an objective analysis of Hulu's show offerings by offering a wide range of data points. It allows users to understand the diversity and popularity of shows available on Hulu and make informed decisions based on their preferences.

    Each entry in the dataset includes essential details about the shows like title, genre(s), runtime, release year, language(s), country of origin, description or summary of the plot. Additionally, it provides information about key cast members involved in each show.

    In terms of availability and accessibility for users interested in watching these shows on Hulu's platform are mentioned as well; this includes details such as whether a show is still ongoing or has ended its run. It also specifies whether all seasons are available for streaming or only selected seasons.

    Ratings play an important role when choosing what show to watch; therefore this dataset includes various rating metrics like IMDb rating (based on user ratings), Rotten Tomatoes critics score (based on professional reviews), Rotten Tomatoes audience score (based on viewer feedback), and Metacritic score (aggregated from multiple sources).

    To understand viewership trends further comprehensively - information about how many episodes are available for each show along with episode durations is included. This can give insights into binge-watching potential or evaluate if shorter episodes might be preferred over ones with longer durations.

    Furthermore - since user experiences regarding streaming quality matters - data regarding video resolution options (e.g., SD or HD) provided by Hulu for each specific series has been recorded too.

    Lastly - additional aspects worth considering while selecting which shows to invest their time could be knowing whether there are any parental warnings due to explicit content being present in certain programs. Similarly note if subtitles are available can encourage users with hearing impairments to explore the dataset and find suitable accessible content.

    The Hulu Shows dataset has been meticulously collated and organized to provide a comprehensive overview of the most popular shows on Hulu. This dataset can serve as a valuable resource for users, researchers, or analysts looking to evaluate the streaming platform's offerings and make informed decisions about their entertainment choices

    How to use the dataset

    Understanding the Columns

    Before diving into any analysis, it's crucial to understand the meaning of each column in the dataset. Here's a brief explanation of each column:

    • Show Name: The name/title of the show.
    • Genre(s): The genre(s) or category/categories to which the show belongs.
    • Run Time: The duration in minutes for each episode or average duration across episodes.
    • Number of Seasons: Total number of seasons available for the show.
    • Rating: Average viewer rating for the show ranging from 0-10 (provided by users).
    • Description: Brief summary or synopsis describing what the show is about.
    • Episodes: Numbered list containing episode names along with their respective release dates (if available).
    • Year Released: Year when the series was initially released.
    • IMDB Rating: Ratings provided by IMDB users on a scale from 0-10. 10:11:12:**Hulu Link**, :**Poster Link**, :**IMDB Link**, :**IMDB Poster Link**, : URL links providing access to additional information about each specific show.

    Exploring Different Genres

    One interesting aspect that can be explored using this dataset is analyzing different genres and their popularity on Hulu. You can create visualizations showing which genres have more shows available compared to others.

    For example:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Load the dataset
    df = pd.read_csv(hulu_popular_shows_dataset.csv)
    
    # Count the number of shows per genre
    genre_counts = df['Genre(s)'].value_counts().sort_values(ascending=False)
    
    # Plot a bar chart to visualize the counts by genre
    plt.figure(figsize=(12, 6))
    genre_counts.plot(kind='bar')
    plt.title(Number of Shows per Genre on Hulu)
    plt.xlabel(Genre)
    plt...
    
  9. h

    comma_v0.1_training_dataset

    • huggingface.co
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2025). comma_v0.1_training_dataset [Dataset]. https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset authored and provided by
    Common Pile
    Description

    Comma v0.1 dataset

    This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.

      Mixing rates and token counts
    

    The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.

  10. πŸ«€ Heart Disease Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). πŸ«€ Heart Disease Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/heart-disease-dataset/data
    Explore at:
    zip(408466 bytes)Available download formats
    Dataset updated
    Apr 8, 2024
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    • Cleveland
    • Hungarian
    • Switzerland
    • Long Beach VA
    • Statlog (Heart) Data Set.

    This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.

    Acknowlegement

    Foto von Kenny Eliason auf Unsplash

  11. d

    U.S. Geological Survey Oceanographic Time Series Data Collection

    • catalog.data.gov
    • data.usgs.gov
    • +4more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). U.S. Geological Survey Oceanographic Time Series Data Collection [Dataset]. https://catalog.data.gov/dataset/u-s-geological-survey-oceanographic-time-series-data-collection
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The oceanographic time series data collected by U.S. Geological Survey scientists and collaborators are served in an online database at http://stellwagen.er.usgs.gov/index.html. These data were collected as part of research experiments investigating circulation and sediment transport in the coastal ocean. The experiments (projects, research programs) are typically one month to several years long and have been carried out since 1975. New experiments will be conducted, and the data from them will be added to the collection. As of 2016, all but one of the experiments were conducted in waters abutting the U.S. coast; the exception was conducted in the Adriatic Sea. Measurements acquired vary by site and experiment; they usually include current velocity, wave statistics, water temperature, salinity, pressure, turbidity, and light transmission from one or more depths over a time period. The measurements are concentrated near the sea floor but may also include data from the water column. The user interface provides an interactive map, a tabular summary of the experiments, and a separate page for each experiment. Each experiment page has documentation and maps that provide details of what data were collected at each site. Links to related publications with additional information about the research are also provided. The data are stored in Network Common Data Format (netCDF) files using the Equatorial Pacific Information Collection (EPIC) conventions defined by the National Oceanic and Atmospheric Administration (NOAA) Pacific Marine Environmental Laboratory. NetCDF is a general, self-documenting, machine-independent, open source data format created and supported by the University Corporation for Atmospheric Research (UCAR). EPIC is an early set of standards designed to allow researchers from different organizations to share oceanographic data. The files may be downloaded or accessed online using the Open-source Project for a Network Data Access Protocol (OPeNDAP). The OPeNDAP framework allows users to access data from anywhere on the Internet using a variety of Web services including Thematic Realtime Environmental Distributed Data Services (THREDDS). A subset of the data compliant with the Climate and Forecast convention (CF, currently version 1.6) is also available.

  12. h

    england-phoneme-dataset

    • huggingface.co
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shiyuan zhao (2024). england-phoneme-dataset [Dataset]. https://huggingface.co/datasets/zdm-code/england-phoneme-dataset
    Explore at:
    Dataset updated
    Dec 13, 2024
    Authors
    shiyuan zhao
    Description

    British English Phonetic Dataset

      Introduction
    

    This dataset is an extension of Common Voice, from which 6 subsets were selected (Common Voice Corpus 1, Common Voice Corpus 2, Common Voice Corpus 3, Common Voice Corpus 4, Common Voice Corpus 18.0, Common Voice Corpus 19.0). All data containing the England accent from these 6 subsets were extracted and phonetically annotated accordingly.

      Description
    

    Key fields explanation:

    sentence: The English sentence… See the full description on the dataset page: https://huggingface.co/datasets/zdm-code/england-phoneme-dataset.

  13. Deep Learning Tutor Dataset

    • kaggle.com
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monkwarrior08 (2025). Deep Learning Tutor Dataset [Dataset]. https://www.kaggle.com/datasets/monkwarrior08/deep-learning-tutor-dataset
    Explore at:
    zip(120655 bytes)Available download formats
    Dataset updated
    Aug 12, 2025
    Authors
    monkwarrior08
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dive into the future of education with the Deep Learning Tutor Dataset – a pioneering resource designed to empower the creation of sophisticated, adaptive AI tutors. This dataset is meticulously curated to facilitate the fine-tuning of advanced large language models like GPT-4o, enabling them to internalize specialized pedagogical conversation patterns and expert teaching methodologies.

    This collection represents a significant step towards developing intelligent educational systems that can truly adapt to individual student needs, provide nuanced feedback, and foster deeper understanding. By leveraging the power of deep learning and state-of-the-art LLMs, this dataset paves the way for a new generation of personalized learning experiences.

    Key Features & Contents:

    • Specialized Pedagogical Conversation Data: An extensive collection of educational dialogue, carefully structured to represent effective tutoring interactions. This includes examples of:
      • Expert Explanations: Clear, concise, and multi-faceted explanations of complex concepts.
      • Adaptive Feedback: Responses tailored to student understanding levels, common misconceptions, and learning styles.
      • Guided Inquiry: Dialogue patterns that encourage critical thinking and problem-solving.
      • Conceptual Clarification: Interactions focused on identifying and addressing misunderstandings.
      • Motivational Prompts: Examples of how to engage and encourage learners.
    • Structured for Fine-tuning GPT-4o: The dataset is provided in a format optimized for fine-tuning OpenAI's GPT-4o, allowing the model to go beyond general knowledge and adopt a truly pedagogical persona.
    • Foundational for Adaptive Tutoring Systems: This data is the bedrock for training AI systems that can dynamically adjust their teaching approach based on student performance, engagement, and learning progress.

    Applications:

    • Building Next-Generation AI Tutors: Develop intelligent tutors capable of empathetic, effective, and adaptive teaching.
    • Research in AI in Education (AIEd): A valuable resource for researchers exploring the application of LLMs in educational contexts, dialogue systems, and personalized learning.
    • Enhancing E-Learning Platforms: Integrate AI-driven tutoring capabilities into existing or new online learning environments.
    • Developing Conversational AI for Learning: Train models to understand and generate educational dialogues that mimic expert human tutors.
    • Personalized Learning Initiatives: Contribute to systems that offer highly individualized learning paths and support.

    How to Leverage This Dataset: Fine-tuning Your AI Tutor

    The primary utility of this dataset is to fine-tune a powerful LLM like GPT-4o, imbuing it with the specific conversational and pedagogical skills required for adaptive tutoring.

    Prerequisites: * An OpenAI account with API access. * Familiarity with the OpenAI Platform and fine-tuning concepts.

    Step 1: Download the Dataset Download the educational_conversation_data.jsonl file from this Kaggle dataset.

    Step 2: Initiate GPT-4o Fine-tuning This process will train GPT-4o to emulate the expert teaching methodologies embedded within the dataset. 1. Upload Data: Navigate to the "Fine-tuning" section in your OpenAI Platform. Upload the educational_conversation_data.jsonl file. 2. Create Fine-tuning Job: * Base Model: gpt-4o (or gpt-4o-mini for more cost-effective experimentation). * Epochs: 3 (A common starting point; adjust based on dataset size and desired performance). * Learning Rate Multiplier: 2 (A good initial value; can be tuned). * Batch Size: 1 (Often effective for pedagogical data, but can be adjusted). * Note: These parameters are recommendations. Experimentation may be required to achieve optimal results for your specific application. 3. Start Job: Initiate the fine-tuning process. Once complete, you will receive a new custom model ID, representing your fine-tuned pedagogical AI.

    Step 3: Integrate Your Fine-tuned Model The fine-tuned model ID can now be used with OpenAI's API to power your adaptive AI tutor. You can integrate it into: * A custom chat interface. * An existing educational platform. * A research prototype for conversational AI in education.

    Files in This Dataset:

    • educational_conversation_data.jsonl: The core dataset containing the specialized pedagogical conversation patterns and expert teaching methodologies, formatted for OpenAI fine-tuning.
    • README.md: (Optional, but good practice) A brief overview of the dataset and usage.
  14. a

    Global Roads Open Access Data Set, Version 1 (gROADSv1)-Copy

    • hub.arcgis.com
    • arcgis.com
    • +2more
    Updated May 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New Mexico Community Data Collaborative (2022). Global Roads Open Access Data Set, Version 1 (gROADSv1)-Copy [Dataset]. https://hub.arcgis.com/maps/e4e59bdbebc44208964aa1fb677416ec
    Explore at:
    Dataset updated
    May 19, 2022
    Dataset authored and provided by
    New Mexico Community Data Collaborative
    Area covered
    Description

    The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. The purpose is to provide an open access, well documented global data set of roads between settlements using a consistent data model (UNSDI-T v.2) which is, to the extent possible, topologically integrated.Dataset SummaryThe Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.Documentation for the Global Roads Open Access Data Set, Version 1 (gROADSv1)Recommended CitationCenter for International Earth Science Information Network - CIESIN - Columbia University, and Information Technology Outreach Services - ITOS - University of Georgia. 2013. Global Roads Open Access Data Set, Version 1 (gROADSv1). Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). http://dx.doi.org/10.7927/H4VD6WCT. Accessed DAY MONTH YEAR.

  15. r

    Sydney Harbour Environmental Data Facility Sydney Harbour Model Data 11046

    • researchdata.edu.au
    Updated Sep 6, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The University of Sydney (2013). Sydney Harbour Environmental Data Facility Sydney Harbour Model Data 11046 [Dataset]. https://researchdata.edu.au/sydney-harbour-environmental-model-11046/189582
    Explore at:
    Dataset updated
    Sep 6, 2013
    Dataset provided by
    The University of Sydney
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Time period covered
    Sep 5, 2013 - May 13, 2014
    Area covered
    Description

    This data collection contains Hydrodynamic Model output data produced by the Sydney Harbour Hydrodynamic Model.

    The Sydney Harbour (real-time) model collates observations from the Bureau of Meteorology, Macquarie University, Sydney Ports Authority and the Manly Hydraulics Laboratory offshore buoy. The Sydney Harbour Model is contained within the Sydney Harbour Observatory (SHO) system.

    The Sydney Harbour Hydrodynamic Model divides the Harbour water into a number of boxes or voxels. Each voxel is less than 60m x 60m x 1m in depth. In narrow parts of the Harbour, or in shallower regions, the voxels are smaller. Layers are numbered - so the sea floor is number 1 and the surface is number 24.

    The model is driven by the conditions on the boundaries. It uses rainfall rates at 13 sites in the Sydney catchment, the wind speed, tide height, the solar radiation and astronomical tides. Every hour the display is refreshed.

    The model utilizes the following environmental data inputs;

    • Dr Serena Lee provide the following: 24 layer grid of the Sydney Harbour Estuary, bathymetry inputs, and the run-off coefficient formula used to convert rainfall readings provided by the Bureau of Meteorology into boundary input data.
    • The Bureau of Meteorology provides the following model inputs; rainfall from 13 individual rain gauges, air temperature, humidity, barometric pressure, cloud cover, evaporation, wind speed, wind direction and forecast data
    • Sydney Ports Authority provides tidal input data.
    • The Office of Environment and Heritage, and the Manly Hydraulics Laboratory provides ocean boundary temperature input data.
    • Macquarie University provides solar radiation input data.

    The hydrodynamic modeling system models the following environmental variables:

    • Salinity
    • Temperature
    • Depth average salinity
    • Horizontal water velocity
    • Vertical water velocity
    • Depth average north velocity
    • Depth average east velocity
    • Water elevation

    This dataset is available in Network Common Data Form – Climate and Forecast (NetCDF-CF) format.

  16. u

    University of Cape Town Student Admissions Data 2015-2019 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2015-2019 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/787
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2015 - 2019
    Area covered
    South Africa
    Description

    Abstract

    The dataset was generated from a set of Excel spreadsheets extracted from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). The data in this second part of the series contain information on applications to UCT made between January 2015 and September 2019.

    In the original form received by DataFirst the data were ill suited to research purposes. The series represents an attempt at cleaning and organizing the data into a more tractable format.

    Analysis unit

    Individuals, applications

    Universe

    All applications to study at the University of Cape Town

    Kind of data

    Administrative records data

    Mode of data collection

    Other [oth]

    Cleaning operations

    In order to lessen computation times the main applications file was split by year - this part contains the years 2014-2019. Note however that the other 3 files released with the application file (that can be merged into it for additional detail) did not need to be split. As such, the four files can be used to produce a series for 2014-2019 and are labelled as such, even though the person, secondary schooling and tertiary education files all span a longer time period.

    Here is additional information about the files:

    1. Application file: the "finest" or most disaggregated unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,540,129 applications between 2015 and 2019. As mentioned, it was this application file that was split to reduce computation times. It was not necessary or logical to split the other files.
    2. Person file: Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. It is important to note that because individuals may have multiple applications, potentially spanning over many years, it was decided not to split the person level datafile. Rather, the person file spans the full data range from 2006 to 2019.
    3. Secondary Education Information: Individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC). This file spans 2006 to 2019.
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.This file spans 2006 to 2019.

    Further information on the processing of the original data files is summarised in a document entitled "Notes on preparing the UCT Student Admissions Data" accompanying the data.

  17. DermaEvolve - Original Unprocessed

    • kaggle.com
    zip
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lokesh Bhaskar (2025). DermaEvolve - Original Unprocessed [Dataset]. https://www.kaggle.com/datasets/lokeshbhaskarnr/dermaevolve-original-unprocessed
    Explore at:
    zip(3287235366 bytes)Available download formats
    Dataset updated
    Mar 11, 2025
    Authors
    Lokesh Bhaskar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    DermaEvolve Dataset

    Overview

    The DermaEvolve dataset is a comprehensive collection of skin lesion images, sourced from publicly available datasets and extended with additional rare diseases. This dataset aims to aid in the development and evaluation of machine learning models for dermatological diagnosis.

    Sources

    The dataset is primarily derived from: - HAM10000 (Kaggle link) – A collection of dermatoscopic images with various skin lesion types. - ISIC Archive (Kaggle link) – A dataset of skin cancer images categorized into multiple classes. - Dermnet NZ – Used to source additional rare diseases for dataset extension. https://dermnetnz.org/ - Google Database - Images

    Categories

    The dataset includes images of the following skin conditions:

    Common Categories:

    • Basal Cell Carcinoma
    • Squamous Cell Carcinoma
    • Melanoma
    • Actinic Keratosis
    • Pigmented Benign Keratosis
    • Seborrheic Keratosis
    • Vascular Lesion
    • Melanocytic Nevus
    • Dermatofibroma

    Rare Diseases (Extended):

    To enhance diversity, the following rare skin conditions were added from Dermnet NZ: - Elastosis Perforans Serpiginosa - Lentigo Maligna - Nevus Sebaceus - Blue Naevus

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15829785%2F732c8390d6b7e2f0d7b51eeefdc03299%2Fupn.png?generation=1741697396385432&alt=media" alt="Original dataset distribution">

    Dataset Characteristics

    • Unprocessed: The dataset consists of raw, unprocessed images.
    • Variable Image Sizes: Image dimensions vary as they have not been standardized.

    Acknowledgements

    Special thanks to the authors of the original datasets: - HAM10000 – Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. - ISIC Archive – International Skin Imaging Collaboration (ISIC), a repository for dermatology imaging. - Dermnet NZ – A valuable resource for dermatological images.

    Usage

    This dataset can be used for: - Training deep learning models for skin lesion classification. - Research on dermatological image analysis. - Development of computer-aided diagnostic tools.

    Please cite the original datasets if you use this resource in your work.

    NOTE :

    Check out the github repository for the streamlit application that focuses on skin disease prediction --> https://github.com/LokeshBhaskarNR/DermaEvolve---An-Advanced-Skin-Disease-Predictor.git

    Streamlit Application Link : https://dermaevolve.streamlit.app/

    Kindly check out my notebooks for the processed models and code -->

    Check out my NoteBooks on multiple models trained on this dataset :

  18. Z

    Data from: Voice Conversion Challenge 2020 Listening Test Data

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jan 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhao Yi; Wen-Chin Huang; Xiaohai Tian; Junichi Yamagishi; Rohan Kumar Das; Tomi Kinnunen; Zhenhua Ling; Tomoki Toda (2021). Voice Conversion Challenge 2020 Listening Test Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4345997
    Explore at:
    Dataset updated
    Jan 15, 2021
    Dataset provided by
    National University of Singapore, Singapore
    University of Science and Technology of China, P.R.China
    University of Eastern Finland, Finland
    Nagoya University, Japan
    National Institute of Informatics, Japan
    Authors
    Zhao Yi; Wen-Chin Huang; Xiaohai Tian; Junichi Yamagishi; Rohan Kumar Das; Tomi Kinnunen; Zhenhua Ling; Tomoki Toda
    Description

    Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform.

    In 2016, we have launched the Voice Conversion Challenge (VCC) 2016 [1][2] at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodology and protocols for bench-marking the performance of VC systems.

    In 2018, we have launched the second edition of VCC, the VCC 2018 [3]. In the second edition, we revised three aspects of the challenge. First, we educed the amount of speech data used for the construction of participant's VC systems to half. This is based on feedback from participants in the previous challenge and this is also essential for practical applications. Second, we introduced a more challenging task refereed to a Spoke task in addition to a similar task to the 1st edition, which we call a Hub task. In the Spoke task, participants need to build their VC systems using a non-parallel database in which source and target speakers read out different sets of utterances. We then evaluate both parallel and non-parallel voice conversion systems via the same large-scale crowdsourcing listening test. Third, we also attempted to bridge the gap between the ASV and VC communities. Since new VC systems developed for the VCC 2018 may be strong candidates for enhancing the ASVspoof 2015 database, we also asses spoofing performance of the VC systems based on anti-spoofing scores.

    In 2020, we launched the third edition of VCC, the VCC 2020 [4][5]. In this third edition, we constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. The dataset for intra-lingual VC consists of a smaller parallel corpus and a larger nonparallel corpus, where both of them are of the same language. The dataset for cross-lingual VC consists of a corpus of the source speakers speaking in the source language and another corpus of the target speakers speaking in the target language. As a more challenging task than the previous ones, we focused on cross-lingual VC, in which the speaker identity is transformed between two speakers uttering different languages, which requires handling completely nonparallel training over different languages.

    As for listening test, we subcontracted the crowd-sourced perceptual evaluation with English and Japanese listeners to Lionbridge TechnologiesInc. and Koto Ltd., respectively. Given the extremely large costs required for the perceptual evaluation, we selected 5 utterances (E30001, E30002, E30003,E30004, E30005) only from each speaker of each team. To evaluate the speaker similarity of the cross-lingual task, we used audio in both the English language and in the target speaker’s L2language as reference. For each source-target speaker pair, we selected three English recordings and two L2 language recordings as the natural reference for the converted five utterances.

    This data repository includes the audio files used for the crowd-sourced perceptual evaluation and raw listening test scores.

    [1] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "The Voice Conversion Challenge 2016" in Proc. of Interspeech, San Francisco.

    [2] Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "Analysis of the Voice Conversion Challenge 2016 Evaluation Results" in Proc. of Interspeech 2016.

    [3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, "The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods", Proc Speaker Odyssey 2018, June 2018.

    [4] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion" Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 80-98, DOI: 10.21437/VCC_BC.2020-14.

    [5] Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, and Tomoki Toda. "Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions." Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 99-120, DOI: 10.21437/VCC_BC.2020-15.

  19. d

    Data from: Daily Weather Data (Precipitation, Minimum and Maximum Air...

    • catalog.data.gov
    • data.usgs.gov
    • +3more
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Daily Weather Data (Precipitation, Minimum and Maximum Air Temperatures) of Florida, Parts of Georgia, Alabama, and South Carolina, 1895-1915 [Dataset]. https://catalog.data.gov/dataset/daily-weather-data-precipitation-minimum-and-maximum-air-temperatures-of-florida-part-1895
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Florida
    Description

    This data release consists of Network Common Data Form (NetCDF) data sets of daily total-precipitation and minimum and maximum air temperatures for the time period from January 1, 1895 to December 31, 1915. These data sets are based on individual station data obtained for 153 National Oceanic and Atmospheric Administration (NOAA) weather stations in Florida and parts of Georgia, Alabama, and South Carolina (available at http://www.ncdc.noaa.gov/cdo-web/results). Weather station data were used to produce a total of 23,007 daily raster surfaces (7,669 daily raster surfaces for each of the 3 data sets) using a thin-plate-spline method of interpolation. The geographic extent of the weather station data coincides with the geographic extent of the Floridan aquifer system, with the exception of a small portion of southeast Mississippi where the Floridan aquifer system is saline and was not used.

  20. Spotify Most Popular Songs Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RishabhPancholi1302 (2025). Spotify Most Popular Songs Dataset [Dataset]. https://www.kaggle.com/datasets/rishabhpancholi1302/spotify-most-popular-songs-dataset
    Explore at:
    zip(3707341 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    RishabhPancholi1302
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Spotify Most Popular Songs Dataset 🎡

    Overview:

    This dataset contains a collection of the most popular songs on Spotify, along with various attributes that can be used for music analysis and recommendation systems. It includes audio features, lyrical details, and general metadata about each track, making it an excellent resource for machine learning, data science, and music analytics projects.

    Each song in the dataset includes the following features:

    🎧 Audio Features (Extracted from Spotify API):

    1. - Danceability – How suitable a track is for dancing (0.0 – 1.0).
    2. - Energy – Intensity and activity level of a song (0.0 – 1.0).
    3. - Loudness – Overall loudness in decibels (dB).
    4. - Speechiness – Presence of spoken words in the track (0.0 – 1.0).
    5. - Acousticness – Probability that a track is acoustic (0.0 – 1.0).
    6. - Instrumentalness – Predicts if a track is instrumental (0.0 – 1.0).
    7. - Liveness – Probability of a live audience (0.0 – 1.0).
    8. - Valence – Musical positivity or happiness (0.0 – 1.0).
    9. - Tempo – Beats per minute (BPM) of the track.
    10. - Key & Mode – Musical key and mode (major/minor).

    πŸ“ Lyrics-Based Features:

    1. - Lyrics Text – Full lyrics of the song (if available).

    🎢 General Song Information:

    1. - Track Name – Name of the song.
    2. - Artist(s) – Performing artist(s).
    3. - Album Name – Album the track belongs to.
    4. - Release Year – Year when the song was released.
    5. - Genre – Song’s primary genre classification.
    6. - Popularity Score – Spotify popularity metric (0 – 1).

    Use Cases πŸš€:

    This dataset is ideal for:

    1. - Music Recommendation Systems – Build collaborative or content-based recommenders.
    2. - Audio Feature Analysis – Discover trends in song characteristics.
    3. - Sentiment Analysis – Study how song lyrics relate to emotions.
    4. - Hit Song Prediction – Use machine learning to predict song popularity.
    5. - Music Genre Classification – Train classifiers to categorize music.

    Acknowledgments:

    Data collected using the Spotify API and other sources. If you use this dataset, consider crediting it in your projects!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Office of Institutional Research (2016). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV

Harvard Common Data Set

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2016
Dataset provided by
Harvard Dataverse
Authors
Office of Institutional Research
License

https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZVhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZV

Description

This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.

Search
Clear search
Close search
Google apps
Main menu