6 datasets found
  1. h

    Course_summaries_dataset

    • huggingface.co
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    recapper (2023). Course_summaries_dataset [Dataset]. https://huggingface.co/datasets/recapper/Course_summaries_dataset
    Explore at:
    Dataset updated
    Apr 6, 2023
    Dataset authored and provided by
    recapper
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About Dataset

    The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.

  2. R

    Segments textuels consolidés - Consolidated Textual Segments - Hérelles...

    • entrepot.recherche.data.gouv.fr
    pdf, tsv, zip
    Updated Mar 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Margaux Holveck; Maksim Koptelov; Maksim Koptelov; Mathieu Roche; Maguelonne Teisseire; Maguelonne Teisseire; Margaux Holveck; Mathieu Roche (2024). Segments textuels consolidés - Consolidated Textual Segments - Hérelles Project [Dataset]. http://doi.org/10.57745/XIVJ65
    Explore at:
    zip(148129), pdf(655744), tsv(682), zip(8577), pdf(622295), zip(166810)Available download formats
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Margaux Holveck; Maksim Koptelov; Maksim Koptelov; Mathieu Roche; Maguelonne Teisseire; Maguelonne Teisseire; Margaux Holveck; Mathieu Roche
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    (English version below) L’un des objectifs du projet Hérelles est de trouver de nouveaux mécanismes afin de faciliter l’étiquetage (ou sémantisation) des clusters issus des séries temporelles d’images satellite. Pour y parvenir, une solution proposée est d’associer des éléments textuels d’intérêt (adéquation avec la thématique d’étude, et le périmètre spatio-temporel des séries temporelles) aux données satellite. Ce jeu de donnée est une version consolidée du jeux de donnée "Segments Textuels Hérelles". Il présente un corpus thématique préalablement récolté et annoté manuellement ainsi que le code et les résultats d’une méthode d’extraction automatique des éléments textuels d'intérêt. Il comprend les éléments suivants : Le fichier "Corpus_Expert_Links" présente le corpus thématique utilisé avec les liens vers les documents qui le composent. Ils ont été choisis pour leur richesse en règles et contraintes concernant l’occupation des sols. Le fichier "Lisez_Moi_Consolidated_Version" est la version consolidée du premier protocole d’annotation, avec la définition des différents termes employés (segments, règles, …). Le fichier "Read_Me_Consolidated_Version" est la version anglaise du fichier Lisez_Moi. Le dossier compressé "Corpus_Manual_Annotation_Consolidated_Version" contient les documents d’intérêt en version txt et annotés manuellement. Le dossier compressé "Corpus_Extracted_Segments_Consolidated_Version" contient la version consolidée des résultats du processus de segmentation automatique sur les documents d’intérêt avec les labels selon les 4 classes (Verifiable, Non-verifiable, Informative and Not pertinent). Le dossier compressé "LUPAN_code" contient le code associé à l'extraction de texte des documents PDF, la construction de segments à partir des documents texte, la préparation des données pour l'évaluation utilisant la méthode (CamemBERT). Notre corpus est disponible dans la bibliothèque Huggingface et peut être chargé directement en Python : https://huggingface.co/datasets/Herelles/lupan De plus, nous avons fine-tuné un modèle au-dessus de CamemBERT en utilisant LUPAN qui est également disponible sur huggingface : https://huggingface.co/Herelles/camembert-base-lupan Enfin, nous avons développé une démo qui démontre les capacités de notre corpus et de ce modèle : https://huggingface.co/spaces/Herelles/segments-lupan -------- One of the objectives of the Hérelles project is to discover new mechanisms to facilitate the labeling (or semantic annotation) of clusters extracted from time series of satellite images. To achieve this, a proposed solution is to associate textual elements of interest (relevant to the study's theme and the spatiotemporal scope of the time series) with satellite data. This dataset is a consolidated version of the "Hérelles Textual Segments" dataset. It includes a thematically collected and manually annotated corpus, as well as the code and the results of an automatic extraction method for textual elements of interest. It comprises the following elements: The file "Corpus_Expert_Links" presents the thematic corpus used with links to its constituent documents. These documents were chosen for their richness in rules and constraints regarding land use. The file "Lisez_Moi_Consolidated_Version" is the consolidated version of the initial annotation protocol, providing definitions of various terms used (segments, rules, etc.). The file "Read_Me_Consolidated_Version" is the English version of the "Lisez_Moi" file. The compressed folder "Corpus_Manual_Annotation_Consolidated_Version" contains the manually annotated versions of the documents of interest in txt format. The compressed folder "Corpus_Extracted_Segments_Consolidated_Version" contains the consolidated version of the results of the automatic segmentation process applied to the documents of interest, along with labels for the four classes (Verifiable, Non-verifiable, Informative, and Not pertinent). The compressed folder "LUPAN_code" contains the code of the corpus construction and preliminary evaluation for LUPAN. It includes extraction of the text from the PDF documents, construction of segments from the text documents, preparation of the data for evaluation, evaluation experiments using a state-of-the-art method (CamemBERT). Our corpus is available in the Huggingface library and can be loaded directly in Python: https://huggingface.co/datasets/Herelles/lupan In addition, we fine-tuned a model on top of CamemBERT using LUPAN, which is also available on huggingface: https://huggingface.co/Herelles/camembert-base-lupan Finally, we developed a demo which demonstrates the capabilities of our corpus and this model: https://huggingface.co/spaces/Herelles/segments-lupan

  3. h

    ytseg

    • huggingface.co
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Retkowski (2024). ytseg [Dataset]. http://doi.org/10.57967/hf/1824
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2024
    Authors
    Fabian Retkowski
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions

    We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats… See the full description on the dataset page: https://huggingface.co/datasets/retkowski/ytseg.

  4. OAIZIB-CM: Dataset from the CartiMorph Project

    • zenodo.org
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongcheng Yao; Yongcheng Yao (2025). OAIZIB-CM: Dataset from the CartiMorph Project [Dataset]. http://doi.org/10.5281/zenodo.14934086
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yongcheng Yao; Yongcheng Yao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Feb 27, 2025
    Description

    This is the official release of OAIZIB-CM dataset.

    For convenient dataset download in Python, please refer to the Hugging Face release of the same dataset:
    https://huggingface.co/datasets/YongchengYAO/OAIZIB-CM

  5. h

    gigaspeech

    • huggingface.co
    • opendatalab.com
    Updated Aug 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpeechColab (2022). gigaspeech [Dataset]. http://doi.org/10.57967/hf/6261
    Explore at:
    Dataset updated
    Aug 30, 2022
    Dataset authored and provided by
    SpeechColab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

  6. h

    ChatGPT-Research-Abstracts

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolai Thorer Sivesind, ChatGPT-Research-Abstracts [Dataset]. https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Nicolai Thorer Sivesind
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    ChatGPT-Research-Abstracts

    This is a dataset created in relation to a bachelor thesis written by Nicolai Thorer Sivesind and Andreas Bentzen Winje. It contains human-produced and machine-generated text samples of scientific research abstracts. A reformatted version for text-classification is available in the dataset collection Human-vs-Machine. In this collection, all samples are split into separate data points for real and generated, and labeled either 0 (human-produced) or 1… See the full description on the dataset page: https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
recapper (2023). Course_summaries_dataset [Dataset]. https://huggingface.co/datasets/recapper/Course_summaries_dataset

Course_summaries_dataset

recapper/Course_summaries_dataset

Explore at:
Dataset updated
Apr 6, 2023
Dataset authored and provided by
recapper
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

About Dataset

The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.

Search
Clear search
Close search
Google apps
Main menu