Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZVhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZV
This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.
Facebook
TwitterThe purpose of the NINDS Common Data Elements (CDEs) Project is to standardize the collection of investigational data in order to facilitate comparison of results across studies and more effectively aggregate information into significant metadata results. The goal of the National Institute of Neurological Disorders and Stroke (NINDS) CDE Project specifically is to develop data standards for clinical research within the neurological community. Central to this Project is the creation of common definitions and data sets so that information (data) is consistently captured and recorded across studies. To harmonize data collected from clinical studies, the NINDS Office of Clinical Research is spearheading the effort to develop CDEs in neuroscience. This Web site outlines these data standards and provides accompanying tools to help investigators and research teams collect and record standardized clinical data. The Institute still encourages creativity and uniqueness by allowing investigators to independently identify and add their own critical variables. The CDEs have been identified through review of the documentation of numerous studies funded by NINDS, review of the literature and regulatory requirements, and review of other Institute''s common data efforts. Other data standards such as those of the Clinical Data Interchange Standards Consortium (CDISC), the Clinical Data Acquisition Standards Harmonization (CDASH) Initiative, ClinicalTrials.gov, the NINDS Genetics Repository, and the NIH Roadmap efforts have also been followed to ensure that the NINDS CDEs are comprehensive and as compatible as possible with those standards. CDEs now available: * General (CDEs that cross diseases) Updated Feb. 2011! * Congenital Muscular Dystrophy * Epilepsy (Updated Sept 2011) * Friedreich''s Ataxia * Parkinson''s Disease * Spinal Cord Injury * Stroke * Traumatic Brain Injury CDEs in development: * Amyotrophic Lateral Sclerosis (Public review Sept 15 through Nov 15) * Frontotemporal Dementia * Headache * Huntington''s Disease * Multiple Sclerosis * Neuromuscular Diseases ** Adult and pediatric working groups are being finalized and these groups will focus on: Duchenne Muscular Dystrophy, Facioscapulohumeral Muscular Dystrophy, Myasthenia Gravis, Myotonic Dystrophy, and Spinal Muscular Atrophy The following tools are available through this portal: * CDE Catalog - includes the universe of all CDEs. Users are able to search the full universe to isolate a subset of the CDEs (e.g., all stroke-specific CDEs, all pediatric epilepsy CDEs, etc.) and download details about those CDEs. * CRF Library - (a.k.a., Library of Case Report Form Modules and Guidelines) contains all the CRF Modules that have been created through the NINDS CDE Project as well as various guideline documents. Users are able to search the library to find CRF Modules and Guidelines of interest. * Form Builder - enables users to start the process of assembling a CRF or form by allowing them to choose the CDEs they would like to include on the form. This tool is intended to assist data managers and database developers to create data dictionaries for their study forms.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Please see our GitHub repository here: https://github.com/BIH-CEI/rd-cdm/ Please see our RD CDM documentation here: https://rd-cdm.readthedocs.io/en/latest/index.html/ Attention: The RD CDM paper is currently under review (version 2.0.0.dev0). As soon as the paper is accepted, we will publish v2.0.0. For more information please see our ChangeLog: https://rd-cdm.readthedocs.io/en/latest/changelog.htmlWe introduce our RD CDM v2.0.0β a common data model specifically designed for rare diseases. This RD CDM simplifies the capture, storage, and exchange of complex clinical data, enabling researchers and healthcare providers to work with harmonized datasets across different institutions and countries. The RD CDM is based on the ERDRI-CDS, a common data set developed by the European Rare Disease Research Infrastructure (ERDRI) to support the collection of harmonized data for rare disease research. By extending the ERDRI-CDS with additional concepts and relationships, based on HL7 FHIR v4.0.1 and the GA4GH Phenopacket Schema v2.0, the RD CDM provides a comprehensive model for capturing detailed clinical information alongisde precise genetic data on rare diseases.Background:Rare diseases (RDs), though individually rare, collectively impact over 260 million people worldwide, with over 17 million affected in Europe. These conditions, defined by their low prevalence of fewer than 5 in 10,000 individuals, are often genetically driven, with over 70% of cases suspected to have a genetic cause. Despite significant advances in medical research, RD patients still face lengthy diagnostic delays, often due to a lack of awareness in general healthcare settings and the rarity of RD-specific knowledge among clinicians. Misdiagnosis and underrepresentation in routine care further compound the challenges, leaving many patients without timely and accurate diagnoses.Interoperability plays a critical role in addressing these challenges, ensuring the seamless exchange and interpretation of medical data through the use of internationally agreed standards. In the field of rare diseases, where data is often scarce and scattered, the importance of structured, standardized, and reusable medical records cannot be overstated. Interoperable data formats allow for more efficient research, better care coordination, and a clearer understanding of complex clinical cases. However, existing medical systems often fail to support the depth of phenotypic and genotypic data required for rare disease research and treatment, making interoperability a crucial enabler for improving outcomes in RD care.
Facebook
TwitterBy Amber Thomas [source]
Welcome to the Monthly Popular Links Shared in Newsletters dataset! Here you will find the most popular links shared not just on one newsletter but across multiple newsletters that come out daily to bi-weekly. This data was collected through an experimental newsletter by The Pudding called Winning the Internet which aims to curate what's good in the internet amongst a sea of content.
The newsletters we subscribe to must meet certain criteria such as having primary purpose of sharing links, having most of its connections to other places, and it should have general interest topics instead of being too niche. The dataset is compiled by automated scripts parsing emails and denylisting any self-promotion or repetitive links from different sources. All data used for these charts are then computed from these sources including rolling averages for how often a link appears in each newsletter and nouns chosen for subject lines for each email.
So check out this eye-catching Monthly Popular Links Shared in Newsletters dataset complete with all the popular links as well as computation records from some of your favorite newslettersβyou wonβt want to miss out on whatβs winning the internet!
For more datasets, click here.
- π¨ Your notebook can be here! π¨!
How to Use This Dataset
This dataset contains the most popular links shared in newsletters, allowing you to explore and visualize internet trends. With this data, you can analyze what topics are becoming more or less popular over time, as well as identify patterns in the type of content people are consuming online.
Using this dataset is a great way to uncover insights into what people are interested in, making it perfect for marketers who want to learn more about their target audienceβs preferences and interests.
To get started with this dataset: - Download the CSV file containing the data. - Open the CSV file and inspect its columns: βurlβ (the URL of the link shared in the newsletter), and βflagβ (a flag indicating whether or not it was shared in multiple newsletters).
3 Analyse each column individually or combine them for further analysis by sorting them according to different parameters e.g., sorting by βflagβ column to identify frequently appearing links or sorting by 'url' column for unique URLs ).4 Look at each piece of data carefully in order to gain an understanding of what URLs are being shared across newsletters. You should be looking for similarities and differences between various columns so that you can gain insights into how people's interests change over time and/or cross-reference between different newsletters as appropriate .
5 Create visualizations using charts or graphs such as scatter plots and line graphs with your findings from step 4 above that show correlations between variables like frequency of URLs mentioned over time etc..
6 Share your findings via a blog post or presentation showcasing your newfound knowledge which helps other marketers know better target those similar audiences better!
- Analyzing the most effective newsletters and analyzing the effectiveness of their content.
- Evaluating and comparing newsletters based on metrics such as the quality of links shared and/or readership growth within a certain timeframe.
- Visualizing time trends in individual newsletter popularity, or tracking which topics are proving to be favorites among many newsletters over time
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: dump-2020-12-15.csv | Column name | Description | |:--------------|:----------------------------------------------------------------------------| | url | The URL of the link shared in the newsletter. (String) | | date | The date the link was shared in the newsletter. (Date) | | flag | A flag indicating if the link was shared in multiple newsletters. (Boolean) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Amber Thomas.
Facebook
Twitter"This dataset contains information on a selection of popular TV shows from various networks and countries. It includes information on the show's title, year of release, genre, main cast, and a "This dataset contains information on a selection of popular TV shows from various networks and countries. It includes information on the show's title, year of release, genre, main cast, and a brief summary of the plot. With this dataset, one can analyze trends in TV show popularity and content over time, as well as investigate the characteristics of successful TV shows. The data is suitable for a variety of applications, such as content recommendation systems, media research, and entertainment industry analysis." You can find the dataset in the Tv_shows.csv file.
Facebook
TwitterPublic Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
List of the ten most visited datasets in the open data portal Schleswig-Holstein per month.
Only calls to the record metadata (description) are counted via the web interface. API calls and downloads of the associated files are not counted. So it happens that the record description is retrieved only once and then a regular access to the underlying data file is carried out. That would not be included in these figures.
The calls of records from a time series were summed up in a single entry.
The following fields are available:
β βmonthβ β in the format βyyyy-mmβ β βNumberβ β Number of views β βURLβ β Address of the record in the open data portal
Column separator is comma
Facebook
Twitterhttps://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656
The different algorithms of the "imbalanced-learn" toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011)
Facebook
TwitterBy Chase Willden [source]
The Hulu Shows dataset is a comprehensive collection of information on the top 1,000 most popular shows available on the streaming platform Hulu. This dataset provides detailed insights into each show, including key details, availability, ratings, and other relevant information.
The dataset aims to provide an objective analysis of Hulu's show offerings by offering a wide range of data points. It allows users to understand the diversity and popularity of shows available on Hulu and make informed decisions based on their preferences.
Each entry in the dataset includes essential details about the shows like title, genre(s), runtime, release year, language(s), country of origin, description or summary of the plot. Additionally, it provides information about key cast members involved in each show.
In terms of availability and accessibility for users interested in watching these shows on Hulu's platform are mentioned as well; this includes details such as whether a show is still ongoing or has ended its run. It also specifies whether all seasons are available for streaming or only selected seasons.
Ratings play an important role when choosing what show to watch; therefore this dataset includes various rating metrics like IMDb rating (based on user ratings), Rotten Tomatoes critics score (based on professional reviews), Rotten Tomatoes audience score (based on viewer feedback), and Metacritic score (aggregated from multiple sources).
To understand viewership trends further comprehensively - information about how many episodes are available for each show along with episode durations is included. This can give insights into binge-watching potential or evaluate if shorter episodes might be preferred over ones with longer durations.
Furthermore - since user experiences regarding streaming quality matters - data regarding video resolution options (e.g., SD or HD) provided by Hulu for each specific series has been recorded too.
Lastly - additional aspects worth considering while selecting which shows to invest their time could be knowing whether there are any parental warnings due to explicit content being present in certain programs. Similarly note if subtitles are available can encourage users with hearing impairments to explore the dataset and find suitable accessible content.
The Hulu Shows dataset has been meticulously collated and organized to provide a comprehensive overview of the most popular shows on Hulu. This dataset can serve as a valuable resource for users, researchers, or analysts looking to evaluate the streaming platform's offerings and make informed decisions about their entertainment choices
Understanding the Columns
Before diving into any analysis, it's crucial to understand the meaning of each column in the dataset. Here's a brief explanation of each column:
- Show Name: The name/title of the show.
- Genre(s): The genre(s) or category/categories to which the show belongs.
- Run Time: The duration in minutes for each episode or average duration across episodes.
- Number of Seasons: Total number of seasons available for the show.
- Rating: Average viewer rating for the show ranging from 0-10 (provided by users).
- Description: Brief summary or synopsis describing what the show is about.
- Episodes: Numbered list containing episode names along with their respective release dates (if available).
- Year Released: Year when the series was initially released.
- IMDB Rating: Ratings provided by IMDB users on a scale from 0-10. 10:11:12:**Hulu Link**, :**Poster Link**, :**IMDB Link**, :**IMDB Poster Link**, : URL links providing access to additional information about each specific show.
Exploring Different Genres
One interesting aspect that can be explored using this dataset is analyzing different genres and their popularity on Hulu. You can create visualizations showing which genres have more shows available compared to others.
For example:
import pandas as pd import matplotlib.pyplot as plt # Load the dataset df = pd.read_csv(hulu_popular_shows_dataset.csv) # Count the number of shows per genre genre_counts = df['Genre(s)'].value_counts().sort_values(ascending=False) # Plot a bar chart to visualize the counts by genre plt.figure(figsize=(12, 6)) genre_counts.plot(kind='bar') plt.title(Number of Shows per Genre on Hulu) plt.xlabel(Genre) plt...
Facebook
TwitterComma v0.1 dataset
This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.
Mixing rates and token counts
The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we⦠See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.
Foto von Kenny Eliason auf Unsplash
Facebook
TwitterThe oceanographic time series data collected by U.S. Geological Survey scientists and collaborators are served in an online database at http://stellwagen.er.usgs.gov/index.html. These data were collected as part of research experiments investigating circulation and sediment transport in the coastal ocean. The experiments (projects, research programs) are typically one month to several years long and have been carried out since 1975. New experiments will be conducted, and the data from them will be added to the collection. As of 2016, all but one of the experiments were conducted in waters abutting the U.S. coast; the exception was conducted in the Adriatic Sea. Measurements acquired vary by site and experiment; they usually include current velocity, wave statistics, water temperature, salinity, pressure, turbidity, and light transmission from one or more depths over a time period. The measurements are concentrated near the sea floor but may also include data from the water column. The user interface provides an interactive map, a tabular summary of the experiments, and a separate page for each experiment. Each experiment page has documentation and maps that provide details of what data were collected at each site. Links to related publications with additional information about the research are also provided. The data are stored in Network Common Data Format (netCDF) files using the Equatorial Pacific Information Collection (EPIC) conventions defined by the National Oceanic and Atmospheric Administration (NOAA) Pacific Marine Environmental Laboratory. NetCDF is a general, self-documenting, machine-independent, open source data format created and supported by the University Corporation for Atmospheric Research (UCAR). EPIC is an early set of standards designed to allow researchers from different organizations to share oceanographic data. The files may be downloaded or accessed online using the Open-source Project for a Network Data Access Protocol (OPeNDAP). The OPeNDAP framework allows users to access data from anywhere on the Internet using a variety of Web services including Thematic Realtime Environmental Distributed Data Services (THREDDS). A subset of the data compliant with the Climate and Forecast convention (CF, currently version 1.6) is also available.
Facebook
TwitterBritish English Phonetic Dataset
Introduction
This dataset is an extension of Common Voice, from which 6 subsets were selected (Common Voice Corpus 1, Common Voice Corpus 2, Common Voice Corpus 3, Common Voice Corpus 4, Common Voice Corpus 18.0, Common Voice Corpus 19.0). All data containing the England accent from these 6 subsets were extracted and phonetically annotated accordingly.
Description
Key fields explanation:
sentence: The English sentence⦠See the full description on the dataset page: https://huggingface.co/datasets/zdm-code/england-phoneme-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dive into the future of education with the Deep Learning Tutor Dataset β a pioneering resource designed to empower the creation of sophisticated, adaptive AI tutors. This dataset is meticulously curated to facilitate the fine-tuning of advanced large language models like GPT-4o, enabling them to internalize specialized pedagogical conversation patterns and expert teaching methodologies.
This collection represents a significant step towards developing intelligent educational systems that can truly adapt to individual student needs, provide nuanced feedback, and foster deeper understanding. By leveraging the power of deep learning and state-of-the-art LLMs, this dataset paves the way for a new generation of personalized learning experiences.
The primary utility of this dataset is to fine-tune a powerful LLM like GPT-4o, imbuing it with the specific conversational and pedagogical skills required for adaptive tutoring.
Prerequisites: * An OpenAI account with API access. * Familiarity with the OpenAI Platform and fine-tuning concepts.
Step 1: Download the Dataset
Download the educational_conversation_data.jsonl file from this Kaggle dataset.
Step 2: Initiate GPT-4o Fine-tuning
This process will train GPT-4o to emulate the expert teaching methodologies embedded within the dataset.
1. Upload Data: Navigate to the "Fine-tuning" section in your OpenAI Platform. Upload the educational_conversation_data.jsonl file.
2. Create Fine-tuning Job:
* Base Model: gpt-4o (or gpt-4o-mini for more cost-effective experimentation).
* Epochs: 3 (A common starting point; adjust based on dataset size and desired performance).
* Learning Rate Multiplier: 2 (A good initial value; can be tuned).
* Batch Size: 1 (Often effective for pedagogical data, but can be adjusted).
* Note: These parameters are recommendations. Experimentation may be required to achieve optimal results for your specific application.
3. Start Job: Initiate the fine-tuning process. Once complete, you will receive a new custom model ID, representing your fine-tuned pedagogical AI.
Step 3: Integrate Your Fine-tuned Model The fine-tuned model ID can now be used with OpenAI's API to power your adaptive AI tutor. You can integrate it into: * A custom chat interface. * An existing educational platform. * A research prototype for conversational AI in education.
educational_conversation_data.jsonl: The core dataset containing the specialized pedagogical conversation patterns and expert teaching methodologies, formatted for OpenAI fine-tuning.README.md: (Optional, but good practice) A brief overview of the dataset and usage.
Facebook
TwitterThe data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. The purpose is to provide an open access, well documented global data set of roads between settlements using a consistent data model (UNSDI-T v.2) which is, to the extent possible, topologically integrated.Dataset SummaryThe Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.Documentation for the Global Roads Open Access Data Set, Version 1 (gROADSv1)Recommended CitationCenter for International Earth Science Information Network - CIESIN - Columbia University, and Information Technology Outreach Services - ITOS - University of Georgia. 2013. Global Roads Open Access Data Set, Version 1 (gROADSv1). Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). http://dx.doi.org/10.7927/H4VD6WCT. Accessed DAY MONTH YEAR.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This data collection contains Hydrodynamic Model output data produced by the Sydney Harbour Hydrodynamic Model.
The Sydney Harbour (real-time) model collates observations from the Bureau of Meteorology, Macquarie University, Sydney Ports Authority and the Manly Hydraulics Laboratory offshore buoy. The Sydney Harbour Model is contained within the Sydney Harbour Observatory (SHO) system.
The Sydney Harbour Hydrodynamic Model divides the Harbour water into a number of boxes or voxels. Each voxel is less than 60m x 60m x 1m in depth. In narrow parts of the Harbour, or in shallower regions, the voxels are smaller. Layers are numbered - so the sea floor is number 1 and the surface is number 24.
The model is driven by the conditions on the boundaries. It uses rainfall rates at 13 sites in the Sydney catchment, the wind speed, tide height, the solar radiation and astronomical tides. Every hour the display is refreshed.
The model utilizes the following environmental data inputs;
The hydrodynamic modeling system models the following environmental variables:
This dataset is available in Network Common Data Form β Climate and Forecast (NetCDF-CF) format.
Facebook
TwitterThe dataset was generated from a set of Excel spreadsheets extracted from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). The data in this second part of the series contain information on applications to UCT made between January 2015 and September 2019.
In the original form received by DataFirst the data were ill suited to research purposes. The series represents an attempt at cleaning and organizing the data into a more tractable format.
Individuals, applications
All applications to study at the University of Cape Town
Administrative records data
Other [oth]
In order to lessen computation times the main applications file was split by year - this part contains the years 2014-2019. Note however that the other 3 files released with the application file (that can be merged into it for additional detail) did not need to be split. As such, the four files can be used to produce a series for 2014-2019 and are labelled as such, even though the person, secondary schooling and tertiary education files all span a longer time period.
Here is additional information about the files:
Further information on the processing of the original data files is summarised in a document entitled "Notes on preparing the UCT Student Admissions Data" accompanying the data.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The DermaEvolve dataset is a comprehensive collection of skin lesion images, sourced from publicly available datasets and extended with additional rare diseases. This dataset aims to aid in the development and evaluation of machine learning models for dermatological diagnosis.
The dataset is primarily derived from: - HAM10000 (Kaggle link) β A collection of dermatoscopic images with various skin lesion types. - ISIC Archive (Kaggle link) β A dataset of skin cancer images categorized into multiple classes. - Dermnet NZ β Used to source additional rare diseases for dataset extension. https://dermnetnz.org/ - Google Database - Images
The dataset includes images of the following skin conditions:
To enhance diversity, the following rare skin conditions were added from Dermnet NZ: - Elastosis Perforans Serpiginosa - Lentigo Maligna - Nevus Sebaceus - Blue Naevus
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15829785%2F732c8390d6b7e2f0d7b51eeefdc03299%2Fupn.png?generation=1741697396385432&alt=media" alt="Original dataset distribution">
Special thanks to the authors of the original datasets: - HAM10000 β Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. - ISIC Archive β International Skin Imaging Collaboration (ISIC), a repository for dermatology imaging. - Dermnet NZ β A valuable resource for dermatological images.
This dataset can be used for: - Training deep learning models for skin lesion classification. - Research on dermatological image analysis. - Development of computer-aided diagnostic tools.
Please cite the original datasets if you use this resource in your work.
Check out the github repository for the streamlit application that focuses on skin disease prediction --> https://github.com/LokeshBhaskarNR/DermaEvolve---An-Advanced-Skin-Disease-Predictor.git
Streamlit Application Link : https://dermaevolve.streamlit.app/
Kindly check out my notebooks for the processed models and code -->
Check out my NoteBooks on multiple models trained on this dataset :
Facebook
TwitterVoice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform.
In 2016, we have launched the Voice Conversion Challenge (VCC) 2016 [1][2] at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodology and protocols for bench-marking the performance of VC systems.
In 2018, we have launched the second edition of VCC, the VCC 2018 [3]. In the second edition, we revised three aspects of the challenge. First, we educed the amount of speech data used for the construction of participant's VC systems to half. This is based on feedback from participants in the previous challenge and this is also essential for practical applications. Second, we introduced a more challenging task refereed to a Spoke task in addition to a similar task to the 1st edition, which we call a Hub task. In the Spoke task, participants need to build their VC systems using a non-parallel database in which source and target speakers read out different sets of utterances. We then evaluate both parallel and non-parallel voice conversion systems via the same large-scale crowdsourcing listening test. Third, we also attempted to bridge the gap between the ASV and VC communities. Since new VC systems developed for the VCC 2018 may be strong candidates for enhancing the ASVspoof 2015 database, we also asses spoofing performance of the VC systems based on anti-spoofing scores.
In 2020, we launched the third edition of VCC, the VCC 2020 [4][5]. In this third edition, we constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. The dataset for intra-lingual VC consists of a smaller parallel corpus and a larger nonparallel corpus, where both of them are of the same language. The dataset for cross-lingual VC consists of a corpus of the source speakers speaking in the source language and another corpus of the target speakers speaking in the target language. As a more challenging task than the previous ones, we focused on cross-lingual VC, in which the speaker identity is transformed between two speakers uttering different languages, which requires handling completely nonparallel training over different languages.
As for listening test, we subcontracted the crowd-sourced perceptual evaluation with English and Japanese listeners to Lionbridge TechnologiesInc. and Koto Ltd., respectively. Given the extremely large costs required for the perceptual evaluation, we selected 5 utterances (E30001, E30002, E30003,E30004, E30005) only from each speaker of each team. To evaluate the speaker similarity of the cross-lingual task, we used audio in both the English language and in the target speakerβs L2language as reference. For each source-target speaker pair, we selected three English recordings and two L2 language recordings as the natural reference for the converted five utterances.
This data repository includes the audio files used for the crowd-sourced perceptual evaluation and raw listening test scores.
[1] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "The Voice Conversion Challenge 2016" in Proc. of Interspeech, San Francisco.
[2] Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "Analysis of the Voice Conversion Challenge 2016 Evaluation Results" in Proc. of Interspeech 2016.
[3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, "The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods", Proc Speaker Odyssey 2018, June 2018.
[4] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion" Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 80-98, DOI: 10.21437/VCC_BC.2020-14.
[5] Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, and Tomoki Toda. "Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions." Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 99-120, DOI: 10.21437/VCC_BC.2020-15.
Facebook
TwitterThis data release consists of Network Common Data Form (NetCDF) data sets of daily total-precipitation and minimum and maximum air temperatures for the time period from January 1, 1895 to December 31, 1915. These data sets are based on individual station data obtained for 153 National Oceanic and Atmospheric Administration (NOAA) weather stations in Florida and parts of Georgia, Alabama, and South Carolina (available at http://www.ncdc.noaa.gov/cdo-web/results). Weather station data were used to produce a total of 23,007 daily raster surfaces (7,669 daily raster surfaces for each of the 3 data sets) using a thin-plate-spline method of interpolation. The geographic extent of the weather station data coincides with the geographic extent of the Floridan aquifer system, with the exception of a small portion of southeast Mississippi where the Floridan aquifer system is saline and was not used.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of the most popular songs on Spotify, along with various attributes that can be used for music analysis and recommendation systems. It includes audio features, lyrical details, and general metadata about each track, making it an excellent resource for machine learning, data science, and music analytics projects.
Each song in the dataset includes the following features:
π§ Audio Features (Extracted from Spotify API):
π Lyrics-Based Features:
πΆ General Song Information:
This dataset is ideal for:
Data collected using the Spotify API and other sources. If you use this dataset, consider crediting it in your projects!
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZVhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/AOD2ZV
This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.