42 datasets found
  1. 📺 10,000+ Popular TV Shows Dataset (TMDB)

    • kaggle.com
    zip
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritesh Swami (2025). 📺 10,000+ Popular TV Shows Dataset (TMDB) [Dataset]. https://www.kaggle.com/datasets/riteshswami08/10000-popular-tv-shows-dataset-tmdb
    Explore at:
    zip(2228816 bytes)Available download formats
    Dataset updated
    Sep 20, 2025
    Authors
    Ritesh Swami
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    10k Popular TV Shows — Global TV show metadata (1944–2025)

    Subtitle: Episode & show metadata: titles, original titles & languages, genre IDs, origin country, popularity & vote metrics, poster/backdrop paths — ideal for NLP, recommender systems, and entertainment analytics.

    🙏🙏🙏 If you found this dataset helpful, please consider upvoting! ❤️❤️❤️

    Short description:
    This dataset contains 10,000 TV-show records with metadata including titles, original names & languages, genre IDs, origin country, popularity, vote metrics, overviews, poster/backdrop paths and first air dates. Great for building recommenders, NLP experiments, temporal analysis, and media analytics.

    Source / Provenance: Collected from TMDb API on 2025-08-19.

    License: CC BY-SA 4.0 (attribution — share-alike).

    What’s inside

    • Rows × Columns: 10,000 × 14
    • Date range (first_air_date): 1944-01-20 → 2025-10-26 (includes future-dated entries).
    • Columns: adult, backdrop_path, genre_ids, id, origin_country, original_language, original_name, overview, popularity, poster_path, first_air_date, name, vote_average, vote_count

    Columns (definition table)

    ColumnTypeMissing %DescriptionExample
    adultint / bool0.00%Adult-content flag (0 = no, 1 = yes).0
    backdrop_pathstring6.83%Relative path to the backdrop image (nullable)./gR7GmFJ0lCFqtv8h2yDfzX3ojx0.jpg
    genre_idslist-like string0.00%Genre IDs stored as a list-like string (e.g. "[18, 35]"). Parse with ast.literal_eval to get a list of ints."[18, 35]"
    idint0.00%Canonical integer id for the show — use as primary key when merging.1399
    origin_countrylist-like string0.00%Country code(s) stored as a list-like string (e.g. "['US']"). Parse before grouping."['US']"
    original_languagestring0.00%ISO language code of the original show (e.g. en, ja).en
    original_namestring0.00%Original title in the original language.Breaking Bad
    overviewstring9.27%Short synopsis/summary of the show (nullable). Great for NLP.A high school chemistry teacher...
    popularityfloat0.00%Numeric popularity score (float). Useful as a proxy for demand.123.456
    poster_pathstring2.47%Relative path to the poster image (nullable)./abc123poster.jpg
    first_air_datedate (string)0.34%Release / first air date. Convert to datetime; note some future dates (~2025).2013-01-20
    namestring0.00%Common/English title (not guaranteed unique).Game of Thrones
    vote_averagefloat0.00%Mean user rating (float). Consider filtering by vote_count.8.6
    vote_countint0.00%Number of votes (int). Use to filter reliable ratings (e.g., >= 10).12534

    Key notes & quirks - genre_ids and origin_country are stored as list-like strings (e.g. "[18, 35]" or "['US']"). Parse them with ast.literal_eval or a JSON-aware parser before treating as lists.
    - overview is missing for ~**9.3%** of rows — still suitable for most NLP tasks but expect some nulls.
    - backdrop_path missing ~**6.8%**; poster_path missing ~**2.5%**.
    - Some first_air_date values are future dates — verify before time-series modeling.
    - name is not guaranteed unique (remakes, different regions, duplicates). Use id as primary identifier.

    Small sample (first 3 rows)

    This quick preview shows structure and typical values:

    idnameoriginal_namelangcountryfirst_air_dategenre_idspopularityvote_avgvote_cnt
    119051WednesdayWednesdayen['US']2022-11-23[10765, 9648, 35]318.7818.3929781
    194766The Summer I Turned PrettyThe Summer I Turned Prettyen['US']2022-06-17[18]266.2938.173956
    157239Alien: EarthAlien: Earthen['US']2021-04-08[10765, 18]229.4967.708427

    Quick EDA highlights

    • Top origin countries: US, JP, KR, CN, GB.
    • Top original languages: en, ja, zh, ko, es.
    • Overviews available for ~90.7% of rows — suitable for embeddings, topic modeling, summarization.
    • Popularity & vote metrics included (use vote_count to filter reliable ratings).

    Suggested uses / example projects

    • Content-based recommender using genre_ids + overview embeddings.
    • NLP: topic modeling, clustering, summarization, or fine-tuning on overview.
    • Popularity/rating prediction from metadata (genres, country, language, overview length).
    • Temporal analysis: show production/popularity trends by decade & country.
    • Image projects if you ma...
  2. Multilingual Healthcare Text Dataset (Hi, En, Pu)

    • kaggle.com
    zip
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kajol Bagga (2025). Multilingual Healthcare Text Dataset (Hi, En, Pu) [Dataset]. https://www.kaggle.com/datasets/kajolagga/multilingual-healthcare-text-dataset-hi-en-pu
    Explore at:
    zip(421647 bytes)Available download formats
    Dataset updated
    Feb 13, 2025
    Authors
    Kajol Bagga
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains three healthcare datasets in Hindi and Punjabi, translated from English. The datasets cover medical diagnoses, disease names, and related healthcare information. The data has been carefully cleaned and formatted to ensure accuracy and usability for various applications, including machine learning, NLP, and healthcare analysis.

    Diagnosis: Description of the medical condition or disease. Symptoms: List of symptoms associated with the diagnosis. Treatment: Common treatments or recommended procedures. Severity: Severity level of the disease (e.g., mild, moderate, severe). Risk Factors: Known risk factors associated with the condition. Language: Specifies the language of the dataset (Hindi, Punjabi, or English). The purpose of these datasets is to facilitate research and development in regional language processing, especially in the healthcare sector.

    Column Descriptions: Original Data Columns: patient_id – Unique identifier for each patient. age – Age of the patient. gender – Gender of the patient (e.g., Male/Female/Other). Diagnosis – The diagnosed medical condition or disease. Remarks – Additional notes or comments from the doctor. doctor_id – Unique identifier for the doctor treating the patient. Patient History – Medical history of the patient, including previous conditions. age_group – Categorized age group (e.g., Child, Adult, Senior). gender_numeric – Numeric encoding for gender (e.g., 0 = Female, 1 = Male). symptoms – List of symptoms reported by the patient. treatment – Recommended treatment or medication. timespan – Duration of the illness or treatment period. Diagnosis Category – General category of the diagnosis (e.g., Cardiovascular, Neurological). Pseudonymized Data Columns: These columns replace personally identifiable information with anonymized versions for privacy compliance:

    Pseudonymized_patient_id – An anonymized patient identifier. Pseudonymized_age – Anonymized age value. Pseudonymized_gender – Anonymized gender field. Pseudonymized_Diagnosis – Diagnosis field with anonymized identifiers. Pseudonymized_Remarks – Anonymized doctor notes. Pseudonymized_doctor_id – Anonymized doctor identifier. Pseudonymized_Patient History – Anonymized version of patient history. Pseudonymized_age_group – Anonymized version of age groups. Pseudonymized_gender_numeric – Anonymized numeric encoding of gender. Pseudonymized_symptoms – Anonymized symptom descriptions. Pseudonymized_treatment – Anonymized treatment descriptions. Pseudonymized_timespan – Anonymized illness/treatment duration. Pseudonymized_Diagnosis Category – Anonymized category of diagnosis.

  3. 4

    Difficulty and Time Perceptions of Preparatory Activities for Quitting...

    • data.4tu.nl
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nele Albers; Mark A. Neerincx; Willem-Paul Brinkman, Difficulty and Time Perceptions of Preparatory Activities for Quitting Smoking: Dataset [Dataset]. http://doi.org/10.4121/5198f299-9c7a-40f8-8206-c18df93ee2a0.v1
    Explore at:
    zipAvailable download formats
    Dataset provided by
    4TU.ResearchData
    Authors
    Nele Albers; Mark A. Neerincx; Willem-Paul Brinkman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 6, 2022 - Nov 16, 2022
    Description

    This dataset contains the data on 144 daily smokers each rating 44 preparatory activities for quitting smoking (e.g., envisioning one's desired future self after quitting smoking, tracking one's smoking behavior, learning about progressive muscle relaxation) on their perceived ease/difficulty and required completion time. Since becoming more physically active can make it easier to quit smoking, some activities were also about becoming more physically active (e.g., tracking one's physical activity behavior, learning about what physical activity is recommended, envisioning one's desired future self after becoming more physically active). Moreover, participants provided a free-text response on what makes some activities more difficult than others.


    Study

    The data was gathered during a study on the online crowdsourcing platform Prolific between 6 September and 16 November 2022. The Human Research Ethics Committee of Delft University of Technology granted ethical approval for the research (Letter of Approval number: 2338).

    In this study, daily smokers who were contemplating or preparing to quit smoking first filled in a prescreening questionnaire and were then invited to a repertory grid study if they passed the prescreening. In the repertory grid study, participants were asked to divide sets of 3 preparatory activities for quitting smoking into two subgroups. Afterward, they rated all preparatory activities on the perceived ease of doing them and the perceived required time to do them. Participants also provided a free-text response on what makes some activities more difficult than others.

    The study was pre-registered in the Open Science Framework (OSF): https://osf.io/cax6f. This pre-registration describes the study setup, measures, etc. Note that this dataset contains only part of the collected data: the data related to studying the perceived difficulty of preparatory activities.

    The file "Preparatory_Activity_Formulations.xlsx" contains the formulations of the 44 preparatory activities used in this study.


    Data

    This dataset contains three types of data:

    - Data from participants' Prolific profiles. This includes, for example, the age, gender, weekly exercise amount, and smoking frequency.

    - Data from a prescreening questionnaire. This includes, for example, the stage of change for quitting smoking and whether people previously tried to quit smoking.

    - Data from the repertory grid study. This includes the ratings of the 44 activities on ease and required time as well as the free-text responses on what makes some activities more difficult than others.

    There is for each data file a file that explains each data column. For example, the file "prolific_profile_data_explanation.xlsx" contains the column explanations for the data gathered from participants' Prolific profiles.

    Each data file contains a column called "rand_id" that can be used to link the data from the data files.


    In the case of questions, please contact Nele Albers (n.albers@tudelft.nl) or Willem-Paul Brinkman (w.p.brinkman@tudelft.nl).

  4. Collections (from American Folklife Center)

    • zenodo.org
    csv
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Egan; Patrick Egan (2024). Collections (from American Folklife Center) [Dataset]. http://doi.org/10.5281/zenodo.14140570
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Egan; Patrick Egan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2019
    Description

    Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019

    I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.

    The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.

    The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.

    Updates to these datasets will be announced and published as the project progresses.

    II. What’s included? This data set includes:

    The Items Dataset – a .CSV containing Media Note, OriginalFormat, On Website, Collection Ref, Missing In Duplication, Collection, Outside Link, Performer, Solo/multiple, Sub-item, type of tune, Tune, Position, Location, State, Date, Notes/Composer, Potential Linked Data, Instrument, Additional Notes, Tune Cleanup. This .CSV is the direct export of the Items Google Spreadsheet

    III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.

    IV. Data Set Field Descriptions

    IV

    a) Collections dataset field descriptions

    ItemId – this is the identifier for the collection that was found at the AFC
    Viewed – if the collection has been viewed, or accessed in any way by the researchers.
    On LOC – whether or not there are audio recordings of this collection available on the Library of Congress website.
    On Other Website – if any of the recordings in this collection are available elsewhere on the internet
    Original Format – the format that was used during the creation of the recordings that were found within each collection
    Search – this indicates the type of search that was performed in order that resulted in locating recordings and collections within the AFC
    Collection – the official title for the collection as noted on the Library of Congress website
    State – The primary state where recordings from the collection were located
    Other States – The secondary states where recordings from the collection were located
    Era / Date – The decade or year associated with each collection
    Call Number – This is the official reference number that is used to locate the collections, both in the urls used on the Library website, and in the reference search for catalog cards (catalog cards can be searched at this address: https://memory.loc.gov/diglib/ihas/html/afccards/afccards-home.html)
    Finding Aid Online? – Whether or not a finding aid is available for this collection on the internet

    b) Items dataset field descriptions

    id – the specific identification of the instance of a tune, song or dance within the dataset
    Media Note – Any information that is included with the original format, such as identification, name of physical item, additional metadata written on the physical item
    Original Format – The physical format that was used when recording each specific performance. Note: this field is used in order to calculate the number of physical items that were created in each collection such as 32 wax cylinders.
    On Webste? – Whether or not each instance of a performance is available on the Library of Congress website
    Collection Ref – The official reference number of the collection
    Missing In Duplication – This column marks if parts of some recordings had been made available on other websites, but not all of the recordings were included in duplication (see recordings from Philadelphia Céilí Group on Villanova University website)
    Collection – The official title of the collection given by the American Folklife Center
    Outside Link – If recordings are available on other websites externally
    Performer – The name of the contributor(s)
    Solo/multiple – This field is used to calculate the amount of solo performers vs group performers in each collection
    Sub-item – In some cases, physical recordings contained extra details, the sub-item column was used to denote these details
    Type of item – This column describes each individual item type, as noted by performers and collectors
    Item – The item title, as noted by performers and collectors. If an item was not described, it was entered as “unidentified”
    Position – The position on the recording (in some cases during playback, audio cassette player counter markers were used)
    Location – Local address of the recording
    State – The state where the recording was made
    Date – The date that the recording was made
    Notes/Composer – The stated composer or source of the item recorded
    Potential Linked Data – If items may be linked to other recordings or data, this column was used to provide examples of potential relationships between them
    Instrument – The instrument(s) that was used during the performance
    Additional Notes – Notes about the process of capturing, transcribing and tagging recordings (for researcher and intern collaboration purposes)
    Tune Cleanup – This column was used to tidy each item so that it could be read by machines, but also so that spelling mistakes from the Item column could be corrected, and as an aid to preserving iterations of the editing process

    V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.

    VI. Creator and Contributor Information

    Creator: Connections In Sound

    Contributors: Library of Congress Labs

    VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.

  5. Alpaca Cleaned

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training/code
    Explore at:
    zip(14548320 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca Cleaned

    Improving Pretrained Language Model Understanding

    By Huggingface Hub [source]

    About this dataset

    Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

    The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

    To make the most out of this dataset it is recommended to:

    • Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

    • Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

    • Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

    • Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

    • Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

    Research Ideas

    • Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.
    • Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.
    • Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  6. Modeling Word Importance in Conversational Transcripts from the Perspective...

    • zenodo.org
    bin, csv
    Updated Jun 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JOHN HITACHI; JOHN HITACHI (2022). Modeling Word Importance in Conversational Transcripts from the Perspective of Deaf and Hard of Hearing Viewers [Dataset]. http://doi.org/10.5281/zenodo.6728363
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jun 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    JOHN HITACHI; JOHN HITACHI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. MaskedPOSAugmentedData.csv

    This file contains word embeddings of 6659 tokens augmented with POS tagging and word importance score. The embedding size for each token is a one dimensional vector of length 768 by 1. Due to augmenting POS tagging, the new feature vector becomes a size of 769 by 1. So now the total feature matrix size is 6659 by 769. In the dataset the final column represents the word importance score.


    1. MaskedSentenceWithImportanceTag.csv

    This dataset contains the newly generated sentence using standard Masking technique and their corresponding token-wise importance score.

    1. Data cleaning procedure

    Using several steps, we have cleaned the “switchboard corpus” that we are using for this particular study. Here are the steps we followed:

    – Convert all the letter into lower case

    – Punctuation has be removed

    – Numbers or Cardinal values have been converted to text representation

    – We use lemmatization techniques to eradicate the possibility of multiple versions of the same word token.

    1. Annotation Instruction

    If researchers intend to produce additional masked text for further the size of the dataset, we recommend using the method we described in the paper.

    However, if someone wants to manually annotate the words or tokens in a dataset, it is important to remember that annotators need to put a score on each word based on its relative importance within a sentence or the information available around that text. It may not be appropriate to allow annotators to read the whole document first and then conduct annotation. Also using multiple annotators is recommended otherwise interrater agreement may not work well.

    1. Test and training data splitting

    As described in the paper, during our experiment, we have retained 10% of data as test data and use 90% data to train the models. For proper replication, we recommend reading our paper thoroughly.

    1. We understand that the dataset size is relatively small for training and testing a model that might be reliable. It is important to remember that data annotation with this particular user group might be challenging.

    N.B: While augmenting the new feature within the dataset, a portion of data has been excluded during the data curation phase. For validation, we have replicated the previous models so that we can measure how the dataset can perform with this newly formed dataset.

  7. h

    sql-create-context

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    brianm (2023). sql-create-context [Dataset]. https://huggingface.co/datasets/b-mc2/sql-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2023
    Authors
    brianm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.

  8. l

    LScDC Word-Category RIG Matrix

    • figshare.le.ac.uk
    pdf
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

  9. The OpenCare semantic social network data

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). The OpenCare semantic social network data [Dataset]. https://www.kaggle.com/datasets/thedevastator/semantic-opencare-network-analysis
    Explore at:
    zip(3227029 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Semantic OpenCare Network Analysis

    Exploring Interaction Patterns and Relationship Dynamics

    By [source]

    About this dataset

    The OpenCare dataset is a unique compilation of social network data providing an in-depth look into conversations, interactions, behaviors, and relationships. It includes posts and comments from the OpenCare platform which allow users to discuss and collaborate on global health issues. With this data set you can explore annotations and coding tags used by participants to describe their discussions as well as analyze post replies, likings, and other user engagement metrics. OpenCare provides a comprehensive look at how international stakeholders interact with each other on issues related to global health — allowing policy makers, researchers, academics and even everyday citizens to gain insight on the larger implications of healthcare decisions made in different countries around the world. Join us in exploring this dataset today – there’s infinite potential when it comes uncovering trends that can improve patient care worldwide!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Understand What’s in the Dataset- Before you can start making use of the data in this dataset, it’s important to understand what types of information are included in each column. The columns such as text contain conversation posts or comments while others include annotations or coding tags associated with particular posts or comments. It's also important to learn which columns contain unique identifiers like uri so you have an easier time working with your data afterwards.

    • Clean & Analyze Your Data - After exploring what is included in each column in your dataset it’s time to start cleaning it up! Depending on the type of analysis that you plan on doing you can choose which fields (columns) need to be cleaned up first before further analyzing them together such as participant’s names or post/comment contents if needed. Once these pieces are cleaned up then dive into some exploratory analytics to understand better about conversations that were held on OpenCare Platform - for example check out relationships between participants by looking into replies and likings made between different users involved in conversations etc..

      1. Visualize Your Results- Once all your analysis has been completed its time to visualize your results so others can make sense out of it easily! Whether its using style sheets developed within a platform like Tableau or other visualization tools like ChartJS depending on what best suits the complexity levels of results/analysis one might have come across through their analysis journey! Visualizing content shared by different users who had encountered issues & collaborating attempts aimed towards crisis resolution will give an overall idea about community driven concepts being discussed using OpenCare platform

    Research Ideas

    • Analyzing the spread of health information among different user groups to identify potential target audiences for campaigns or initiatives related to public health.
    • Examining relationships between users in different regions and countries based on their interactions and collaborations on openCare platform.
    • Visualizing the network of user relations and interactions, in order to better understand how users are connected, or have a shared interest within a particular topic related to global health issues

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: annotations.csv | Column name | Description | |:---------------|:-------------------------------------------------------------------------------------------------------------------------------------------------| | version | Version of the data being used. (Integer) | | text | Text associated with posts or comments. (String) ...

  10. e

    Kita facilities Hamburg

    • data.europa.eu
    csv, gml, html, oaf +3
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Behörde für Arbeit, Gesundheit, Soziales, Familie und Integration, Kita facilities Hamburg [Dataset]. https://data.europa.eu/data/datasets/c1ac42b2-c104-45b8-91f9-da14c3c88a1f?locale=en
    Explore at:
    wfs(2281), csv(241381), pdf(49383), csv(679780), html(126173), gml(2953249), oaf(5136), html(405944), html(295449), wfs(28031), wms(16512)Available download formats
    Dataset authored and provided by
    Behörde für Arbeit, Gesundheit, Soziales, Familie und Integration
    License

    Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
    License information was derived automatically

    Area covered
    Hamburg
    Description

    Two files are generated daily:

    1. Kita_Facilities.csv
    2. Kita_Facilities_Performance.csv

    The file Kita_Einrichtung.csv provides exactly one line with information for each daycare center and is sorted ascendingly according to the first column KITAEinrichtung_EinrNr.

    The file Kita_Einrichtung_Leistung.csv contains all services offered by the daycare centres and is also sorted ascendingly according to the first column KITAEinrichtung_EinrNr.

    format The files have been created under the UTF-8 codepage. The end of the line is marked with Carriage Return + Line Feed. The columns are separated by the separator ^ (=hat).

    Data structure The first line of the files is optimized with the content sep=^ for editing with the program Microsoft Excel. If Excel is not used, this line should be ignored. However, Excel considers all csv files as ANSI-encoded and does not automatically convert to UTF-8, so that, for example, the umlauts are not displayed correctly. Therefore, the use of the excel-internal text conversion wizard is explicitly recommended at this point for the import of the data to Excel, since all necessary settings for an error-free import can be made there.

    The second line of the files contains the column headings. From the third line to the end of the file, the exported records of the nursery database are located.

    Column definition of the Kita_Setup.csv file 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

    Column definition of the Kita_Setup_Performance.csv file 1. 2.

  11. Cleaned & Analyzed Phone Dataset from Smartprix

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Close (2025). Cleaned & Analyzed Phone Dataset from Smartprix [Dataset]. https://www.kaggle.com/datasets/allenclose/cleaned-and-analyzed-phone-dataset-from-smartprix
    Explore at:
    zip(1405920 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    Allen Close
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a professionally cleaned and structured version of the raw Smartprix mobile phones dataset. While the original dataset presents real-world data challenges, this version provides analysis-ready data with proper formatting, data types, and calculated fields - perfect for immediate insights and machine learning projects.

    ✨ What Makes This Dataset Special?

    Comprehensive Data Cleaning

    • Proper Data Types: All numeric fields converted from text to appropriate types
    • Standardized Formatting: Consistent text formatting across all fields
    • Extracted Values: Numbers extracted from text fields (weight, battery, display size)
    • Boolean Conversions: Clear Yes/No values instead of bits
    • Date Parsing: Structured release date information with year and month
    • Calculated Fields: Added derived metrics like original prices and screen ratios

    What's Included

    The dataset covers comprehensive mobile specifications:

    General Information - Brand, Model, OS, Release Date, SIM Type - Full Name and Pricing Information

    Design & Display - Physical dimensions (height, width, thickness) - Weight in grams - Available colors and build materials - Display specifications (size, resolution, refresh rate, PPI) - Screen-to-body ratio

    Camera System - Rear and front camera megapixels - Camera features and capabilities - Video recording specifications

    Performance & Storage - Processor details - RAM (in GB) - Internal storage capacity - Expandable storage support

    Battery & Charging - Battery capacity (mAh) - Fast charging support and wattage - Wireless charging capabilities

    Connectivity - 5G/4G/3G/2G support - Bluetooth, Wi-Fi specifications - USB type and GPS capabilities

    Pricing & Market - Current market price - Original launch price - Price drop amount and percentage

    🎯 Perfect For

    Data Analysis Projects

    • Market Trend Analysis: Identify patterns in pricing, features, and specifications
    • Brand Comparison: Compare offerings across different manufacturers
    • Feature Correlation: Discover relationships between specs and pricing
    • Consumer Insights: Understand what features drive mobile phone value

    Machine Learning Applications

    • Price Prediction Models: Predict phone prices based on specifications
    • Classification Tasks: Categorize phones by segment (budget, mid-range, flagship)
    • Recommendation Systems: Build phone recommendation engines
    • Feature Importance: Identify which specs matter most to pricing

    Business Intelligence

    • Competitive Analysis: Benchmark brands and models
    • Market Positioning: Understand product positioning strategies
    • Value Analysis: Identify best value propositions in the market

    📊 Data Quality

    This cleaned dataset features: - No mixed data types: Each column has consistent, appropriate types - Numeric fields ready for analysis: All prices, sizes, and capacities are properly formatted - Boolean clarity: Clear Yes/No values for binary features - Standardized text: Consistent formatting across all text fields - Calculated metrics: Added fields like original price and price drop percentage

    🔄 Data Updates

    The source data is updated daily from Smartprix. This cleaned version represents a snapshot with professional data preparation applied. For the latest raw data, refer to the original dataset.

    💡 Getting Started

    import pandas as pd
    
    # Load the cleaned dataset
    df = pd.read_csv('cleaned_mobiles.csv')
    
    # Ready for immediate analysis!
    # Example: Top 10 brands by average price
    top_brands = df.groupby('Brand')['Current_Price_Numeric'].mean().sort_values(ascending=False).head(10)
    print(top_brands)
    
    # Example: 5G phones under $500
    affordable_5g = df[(df['Has_5G'] == 'Yes') & (df['Current_Price_Numeric'] < 500)]
    

    🏆 Key Improvements Over Raw Dataset

    AspectRaw DatasetThis Cleaned Dataset
    Data TypesAll text/mixedProper numeric, boolean, text
    Numeric ValuesEmbedded in textExtracted and converted
    Boolean FieldsBit valuesHuman-readable Yes/No
    Price InformationCurrent onlyCurrent + Original + Drop %
    RAM/StorageMixed units (MB/GB/TB)Standardized to GB
    Analysis ReadinessRequires extensive cleaningReady for immediate use

    📝 Column Guide

    Key Numeric Columns

    • Current_Price_Numeric: Current market price (numeric)
    • Original_Price_Calculated: Launch price (numeric)
    • Price_Drop_Percentage: Discount from launch (%)
    • Weight_Numeric: Weight in grams
    • Battery_Capacity_Numeric: Battery in mAh
    • RAM_GB: RAM in gigabytes
    • Internal_Memory_GB: Storage in gigabytes

    Key Boolean Columns (Yes/No)

    • Has_5G, Has_4G, Has_3G, Has_2G
    • Has_NFC, Has_IR_Blaster
    • `Fast_Char...
  12. Wikipedia Biographies Text Generation Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
    Explore at:
    zip(269983242 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Biographies Text Generation Dataset

    Wikipedia Biographies: Infobox and First Paragraphs Texts

    By wiki_bio (From Huggingface) [source]

    About this dataset

    The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

    In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

    The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

    Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

    This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

    How to use the dataset

    • Overview:

      • This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
      • The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
      • Each file contains pairs of input text and target text.
    • File Descriptions:

      • train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
      • val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
      • test.csv: This file can be used to generate complete biographies based on the given input texts.
    • Column Information:

      a) For train.csv:

      • input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.
      • target_text: Target text column containing the complete biography text for each entry.

      b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

      c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

    • Usage Guidelines:

    • Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

    • Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

    • Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

    • Additional Information and Tips:

    • The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

    • The target text is the complete biography for each entry.

    • While working with this dataset, make sure to preprocess and

    Research Ideas

    • Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
  13. The LAMBADA Dataset for Word Prediction

    • kaggle.com
    zip
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2024). The LAMBADA Dataset for Word Prediction [Dataset]. https://www.kaggle.com/datasets/whenamancodes/the-lambada-dataset-for-word-prediction
    Explore at:
    zip(345323257 bytes)Available download formats
    Dataset updated
    Jun 27, 2024
    Authors
    Aman Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context :

    We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse. We show that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark. We thus propose LAMBADA as a challenging test set, meant to encourage the development of new models capable of genuine understanding of broad context in natural language text.

    The LAMBADA paper can be found here.

    Research Ideas

    Evaluating the performance of language models: The LAMBADA dataset can be used to assess the capabilities and limitations of different computational models in understanding and predicting text. By using the dataset, researchers can compare and benchmark their models' word prediction accuracy and contextual understanding. Developing better natural language processing (NLP) algorithms: The dataset can offer valuable insights for improving NLP algorithms and techniques for tasks such as text comprehension, information extraction, summarization, and question answering. Researchers can analyze patterns within the dataset to identify areas where existing algorithms fall short or need enhancement. Training language generation models: With the LAMBADA dataset, developers can train language generation models (e.g., chatbots or virtual assistants) to provide more accurate and contextually appropriate responses in natural language conversations. By exposing these models to a wide range of text samples from different domains, they can learn to generate coherent and relevant predictions in various conversational contexts

    How to use the dataset

    A Guide to Evaluating Text Understanding and Word Prediction Models

    What is the LAMBADA dataset? The LAMBADA dataset is designed specifically for assessing contextual understanding of language models through word prediction. It consists of sentences or passages of text with corresponding domains that provide context for the word prediction tasks. The dataset comprises three main files: validation.csv, train.csv, and test.csv.

    Familiarize yourself with the columns: a) 'text' column: This column contains sentences or passages from various domains that are used for word prediction tasks. b) 'domain' column: This categorical column indicates the specific domain or topic associated with each text sample.

    Understanding file purposes: a) validation.csv: The primary purpose of this file is to evaluate computational models by testing their word prediction abilities on unseen data samples in different domains. b) train.csv: Utilize this file as training data while evaluating computational models' abilities in both text comprehension and accurate word prediction. c) test.csv: This file enables you to assess your model's performance based on its ability to accurately predict words within provided contexts.

    Effective utilization tips: a) Preprocessing: Before using any machine learning model on this dataset, it is essential to preprocess the data by removing noise such as punctuation marks and special characters while preserving critical textual information. b) Feature Engineering: Explore additional ways like extracting n-grams or employing advanced embedding techniques (e.g., Word2Vec, BERT) to enhance model performance. c) Model Selection: Experiment with various machine learning algorithms, such as LSTM or Transformer-based models, to identify the best approach for word prediction tasks within text understanding.

    LAMBADA DATASET :

    This archive contains the LAMBADA dataset (Language Modeling Broadened to Account for Discourse Aspects) described in D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda and R. Fernandez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of ACL 2016 (54th Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL, pages 1525-1534. The source data come from the Book Corpus, made in turn of unpublished novels (see Y. Zhu, R. Kiros, R.f Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. ICCV 2015, pages 19-27).

    You will find 5 files besides this readme in the archive:

    1. lambada_development_plain_text.txt The development data include 4,86...
  14. Tableau Dummy Dataset for Practice

    • kaggle.com
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piush Dave (2025). Tableau Dummy Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/piyushdave/tableau-dummy-dataset-for-practice
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Piush Dave
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Domain-Specific Dataset and Visualization Guide

    This package contains 20 realistic datasets in CSV format across different industries, along with 20 text files suggesting visualization ideas. Each dataset includes about 300 rows of synthetic but domain-appropriate data. They are designed for data analysis, visualization practice, machine learning projects, and dashboard building.

    What’s inside

    • 20 CSV files, one for each domain:

      1. Education
      2. E-Commerce
      3. Healthcare
      4. Finance
      5. Retail
      6. Social Media
      7. Manufacturing
      8. Sports
      9. Transport
      10. Hospitality
      11. Telecom
      12. Banking
      13. Real Estate
      14. Gaming
      15. Agriculture
      16. Automobile
      17. Energy
      18. Insurance
      19. Government
      20. Entertainment

    20 TXT files, each listing 10 relevant graphing options for the dataset.

    MASTER_INDEX.csv, which summarizes all domains with their column names.

    Use cases

    • Practice data cleaning, exploration, and visualization in Excel, Tableau, Power BI, or Python.
    • Build dashboards for specific industries.
    • Train beginner-level machine learning models such as classification and regression.
    • Use in classroom teaching or workshops as ready-made datasets.

    Example

    • Education dataset has columns like StudentName, Class, Subject, Marks, AttendancePercent. Suggested graphs: bar chart of average marks by subject, scatter plot of marks vs attendance percent, line chart of attendance over time.

    • E-Commerce dataset has columns like OrderDate, Product, Category, Price, Quantity, Total. Suggested graphs: line chart of revenue trend, bar chart of revenue by category, pie chart of payment mode share.

  15. LAMBADA Word Prediction

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LAMBADA Word Prediction [Dataset]. https://www.kaggle.com/datasets/thedevastator/lambada-word-prediction-dataset
    Explore at:
    zip(341313852 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LAMBADA Word Prediction

    Evaluating text understanding through word prediction

    By lambada (From Huggingface) [source]

    About this dataset

    The LAMBADA dataset, also known as LAMBADA: Evaluating Computational Models for Text Understanding, serves as a valuable resource for assessing and evaluating the language understanding and word prediction abilities of computational models. This dataset is specifically designed to test the contextual understanding of these models by providing various text samples and their corresponding domains, thus providing necessary context for effective word prediction tasks.

    Comprised of three main files namely validation.csv, train.csv, and test.csv, this dataset offers a comprehensive range of data for training, validation, and testing purposes. Each file contains a collection of sentences or passages of text that serve as input for the word prediction tasks. Additionally, the domain column in each file indicates the specific domain or topic associated with the text sample. This inclusion allows computational models to be evaluated within relevant contexts and ensures accurate assessment of their performance in word prediction tasks related to specific domains.

    The validation.csv file can be utilized to evaluate computational models' predictive abilities during development stages. It provides both textual samples and corresponding domain information required for assessing model performance accurately.

    On the other hand, train.csv consists of training data that enables thorough exploration and improvement in computational models' textual understanding capabilities over time. By incorporating different sentence structures from diverse domains along with their respective domain labels into this training set, researchers gain invaluable insights into effectively enhancing model predictions within various contexts.

    Lastly, test.csv offers an essential evaluation tool by presenting an independent set of text samples alongside appropriate domain labels solely intended to assess model performance against previously unseen data examples. The aim is to rigorously evaluate how well these computational models predict words within different textual contexts spanning various domains.

    Overall, LAMBADA addresses an essential aspect in Natural Language Processing by presenting a benchmarking opportunity through its meticulously curated dataset featuring comprehensive records encompassing text passages along with domains assigned accurately according to relevant topic or subject matter knowledge

    How to use the dataset

    Subtitle: A Guide to Evaluating Text Understanding and Word Prediction Models

    Introduction:

    • What is the LAMBADA dataset? The LAMBADA dataset is designed specifically for assessing contextual understanding of language models through word prediction. It consists of sentences or passages of text with corresponding domains that provide context for the word prediction tasks. The dataset comprises three main files: validation.csv, train.csv, and test.csv.

    • Familiarize yourself with the columns: a) 'text' column: This column contains sentences or passages from various domains that are used for word prediction tasks. b) 'domain' column: This categorical column indicates the specific domain or topic associated with each text sample.

    • Understanding file purposes: a) validation.csv: The primary purpose of this file is to evaluate computational models by testing their word prediction abilities on unseen data samples in different domains. b) train.csv: Utilize this file as training data while evaluating computational models' abilities in both text comprehension and accurate word prediction. c) test.csv: This file enables you to assess your model's performance based on its ability to accurately predict words within provided contexts.

    • Effective utilization tips: a) Preprocessing: Before using any machine learning model on this dataset, it is essential to preprocess the data by removing noise such as punctuation marks and special characters while preserving critical textual information. b) Feature Engineering: Explore additional ways like extracting n-grams or employing advanced embedding techniques (e.g., Word2Vec, BERT) to enhance model performance. c) Model Selection: Experiment with various machine learning algorithms, such as LSTM or Transformer-based models, to identify the best approach for word prediction tasks within text understanding.

    Conclusion:

    Research Ideas

    • Evaluating the performance of language models: The LAMBADA dataset can be used to assess the capabilities and limitations of different computational models in understan...
  16. Pure gender bias detection (Male vs Female)

    • kaggle.com
    zip
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna GSVV (2025). Pure gender bias detection (Male vs Female) [Dataset]. https://www.kaggle.com/datasets/krishnagsvv/pure-gender-bias-detection-male-vs-female
    Explore at:
    zip(102799 bytes)Available download formats
    Dataset updated
    Oct 15, 2025
    Authors
    Krishna GSVV
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Equilens Gender Bias

    • Purpose: This corpus was generated by the EquiLens Corpus Generator to enable controlled, reproducible experiments testing how language models respond when only the name varies across prompts. Each row is a single prompt where profession, trait, and template are fixed while the name varies (Male vs Female).
    • **Scope: **~1,680 prompts for gender bias across multiple professions, competence/social trait categories, and four template variants.
    • **Intended use: **Model-response collection, parsing/cleaning experiments, statistical testing for demographic differences, visualisation, and reproducible research.
    • Sources & provenance: Names, professions, and trait lists are curated and combined deterministically by the project's JSON config (word_lists.json). The generator and metadata are included in the repository for reproducibility.
    • License: MIT

    Column descriptions

    • comparison_type — Audit category (e.g., gender_bias)
    • name — First name used in the prompt (Male or Female)
    • name_category — Name group label (Male / Female)
    • profession — Profession used in the prompt (engineer, nurse, doctor, etc.)
    • trait — Trait word inserted into the template (analytical, caring, etc.)
    • trait_category — Trait class (Competence / Social)
    • template_id — Template variant id (0–3)
    • full_prompt_text — Final full prompt text presented to the model

    Quick reproducibility & validation (PowerShell) ```powershell

    from the dataset folder

    Test-Path .\corpus\audit_corpus_gender_bias.csv Get-Content .\corpus\audit_corpus_gender_bias.csv | Measure-Object -Line

    Create venv and install deps

    python -m venv .venv .venv\Scripts\Activate.ps1 pip install pandas tqdm ```

    Quick start: load and basic stats (Python) ```python import pandas as pd df = pd.read_csv("corpus/audit_corpus_gender_bias.csv")

    counts per category

    print(df['name_category'].value_counts())

    sample prompts

    print(df.sample(5)['full_prompt_text'].to_list()) ```

    Recommended evaluation workflow (high level) 1. Use this CSV to generate model responses for each prompt (consistent model settings). 2. Clean & parse outputs into numeric/label format as appropriate (use structured prompting where possible). 3. Aggregate responses grouped by name_category (Male vs Female) while holding profession/trait/template constant. 4. Compute descriptive stats per group (mean, median, sd) and per stratum (profession × trait_category). 5. Run statistical tests and effect-size estimates: - Permutation test or Mann-Whitney U (non-parametric) - Bootstrap confidence intervals for medians/means - Cohen’s d or Cliff’s delta for effect size 6. Correct for multiple comparisons (Benjamini–Hochberg) when testing many strata. 7. Visualise with violin + boxplots and difference plots with CIs.

    Suggested quantitative metrics - Mean/median differences (Male − Female) - Bootstrap 95% CI on difference - Cohen’s d or Cliff’s delta - p-values from permutation test / Mann-Whitney U - Proportion of model outputs that deviate from the expected neutral baseline (for categorical outputs)

    Suggested visualizations - Grouped violin plots (by profession) split by name_category - Difference-in-means bar with bootstrap CI per profession - Heatmap of effect sizes (profession × trait_category) - Distribution overlay of raw responses

    Recommended analysis notebooks/kernels to provide on Kaggle - 01_data_load_and_summary.ipynb — load CSV, sanity checks, counts - 02_model_response_collection.ipynb — how to call a model endpoint safely (placeholders) - 03_cleaning_and_parsing.ipynb — parsing rules and robustness tests - 04_statistical_tests.ipynb — permutation tests, bootstrap CI, effect sizes - 05_visualizations.ipynb — plots and interpretation

    Security & best practices - Never commit API keys in notebooks. Use environment variables and secrets built into Kaggle. - Keep model call rate-limited and log failures; use retry/backoff. - Use fixed random seeds for reproducibility where sampling occurs.

    Limitations & caveats (must show on dataset page) - Cultural and name recognition: names may suggest different demographics across regions; results are context-sensitive. - Only Male vs Female: dataset intentionally isolates binary gender categories; extend carefully for broader demographic categories. - Controlled prompts reduce ecological validity — real interactions may be longer and noisier. - Parsing risk: models sometimes add explanatory text; structured prompting or requesting a JSON response is recommended.

    How this dataset differs from academic prototypes - This corpus is deterministic and template-driven to ensure strict control over confounds (only the name varies). Use it when you require reproducibility and controlled comparisons rather than open-ended, real-world prompts.

    Suggested Kaggle tags and categor...

  17. AI4Math: Mathematical QA Dataset

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). AI4Math: Mathematical QA Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ai4math-mathematical-qa-dataset
    Explore at:
    zip(1206195420 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    AI4Math: Mathematical QA Dataset

    9,800 Questions from IQ Tests, FunctionQA & PaperQA

    By Huggingface Hub [source]

    About this dataset

    AI4Math is an invaluable resource for those seeking to advance their research in developing tools for mathematical question-answering. With a total of 9,800 questions from IQ tests, FunctionQA tasks, and PaperQA presentations, the dataset provides a comprehensive collection of questions with valuable annotations. This includes information on the text of the question, related images as well as a decoded version of the image, choisable answers whenever relevant to aid answering accuracy measurement (precision), and predetermined answer types along with metadata which can provide additional insight into certain cases. By making use of this dataset researchers are able to target different areas within mathematical question-answering with precision relative to their respective goals -- be it IQ tests or natural language processing based function computation -- while assessing progress through recorded accurate answers (precision). AI4Math is truly transforming how mathematics can be applied for machine learning applications one step at a time

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Before you get started with this dataset, it is important to familiarize yourself with the columns: question, image, decoded_image, choices, unit, precision​ , question_type​ , answer_type​ , metadata ​and query.
    • It is advisable that you read through the data dictionary provided in order to understand which columns are given in the dataset and what type of data each column contains (for example 'question' for text questions). ​
    • Once you understand what information is contained within each column – it’s time to start exploring! Use a visual exploration tool such as Tableau or Dataiku DSS to explore your data before doing any in-depth analysis or machine learning processing on it. Visual explorations can provide insights on trends across different fields including demographics and purchase history etcetera which can be interesting even if they don’t result in any direct output from machine learning or statistical models used later in analysis/prediction tasks..4 You may also want to consider using a text analyzer such as Google NL API or Word2Vec API to look for relationships between words used in certain questions and answers across all datasets – this could help you get more insight into your current datasets and plan ideas for future research . 5 Lastly make sure you always keep track of versioning when performing tasks on any large dataset – having multiple versions makes it easier for everyone involved since mistakes can always be reverted before reverting by accident everything related with completed analyses/models..6 After exploring your data its time for actual machine learning processing - depending on what type of activity need they may use supervised/unsupervised algorithm approaches , neural networks etcetera trying out multiple solutions looks like a good idea since some techniques might work better than others depending specific problem at hand 7 After running several experiments track down results keeping notes nearby metrics obtained along process not only during predictions but also training 8 Finally its very important evaluate models after every cycle making sure their performance stable ; many times accuracy improvement more reliable indicator valid model rather than metrics like accuracy itself 9 If satisfied results set watch performance continuously over time checking ongoing basis if everything still works correctly 10 To keep up date new developments regarding technologies being used its highly recommended subscribing mailing lists leading software products companies whose solutions using regularly

    Research Ideas

    • Using the metadata and question columns to develop algorithms that automatically generate questions for certain topics as defined by the user.
    • Utilizing the image column to create a computer vision model for predicting and classifying similar images.
    • Analyzing the content in both the choices and answer_type columns for extracting underlying patterns in IQ Tests, FunctionQA tasks, and PaperQA presentations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy,...

  18. German Question-Answer Context Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). German Question-Answer Context Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/german-question-answer-context-dataset/code
    Explore at:
    zip(2618460 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    German Question-Answer Context Dataset

    German Q&A Context Dataset

    By germanquad (From Huggingface) [source]

    About this dataset

    The dataset provided is a comprehensive collection of German question-answer pairs with their corresponding context. It has been specifically compiled for the purpose of enhancing and facilitating natural language processing (NLP) tasks in the German language. The dataset includes two main files: train.csv and test.csv.

    The train.csv file contains a substantial amount of data, consisting of numerous entries that comprise various contexts along with their corresponding questions and answers in German. The contextual information may range from paragraphs to concise sentences, providing a well-rounded representation of different scenarios.

    Similarly, the test.csv file also contains a significant number of question-answer pairs in German along with their respective contexts. This file can be utilized for model evaluation and testing purposes, ensuring the robustness and accuracy of NLP models developed using this dataset.

    Both train.csv and test.csv provide valuable resources for training machine learning models in order to improve question-answering systems or any other NLP application specific to the German language. The inclusion of multiple context fields enhances diversity within the dataset and enables more thorough analysis by accounting for varying linguistic structures.

    Ultimate objectives behind creating this rich dataset involve fostering advancements in machine learning techniques applied to natural language understanding in German. Researchers, developers, and enthusiasts working on NLP tasks can leverage this extensive collection to explore state-of-the-art methodologies or develop novel approaches focused on understanding complex questions within given contextual frameworks accurately.

    How to use the dataset

    • Understanding the Dataset Structure: The dataset consists of two files - train.csv and test.csv. Both files contain question-answer pairs along with their corresponding context.

    • Columns: Each file has multiple columns that provide important information about the data:

      • context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information.

      • answers: This column contains the answer(s) to the given question in the corresponding context. The answers could be single or multiple.

    • Exploring and Analyzing Data: Before diving into any analysis or modeling tasks, it's recommended to explore and analyze the dataset thoroughly:

      • Load both train.csv and test.csv files into your preferred programming environment (Python/R).

      • Check for missing values (NaN) or any inconsistencies in data.

      • Analyze statistical properties of different columns such as count, mean, standard deviation etc., to understand variations within your dataset.

    • Preprocessing Text Data: Since this dataset contains text data (questions, answers), preprocessing steps might be required before further analysis.

      • Process text by removing punctuation marks, special characters and converting all words to lowercase for better consistency.

      • Tokenize text data by splitting sentences into individual words/tokens using libraries like NLTK or SpaCy.

      • Remove stop words (commonly occurring irrelevant words like 'the', 'is', etc.) from your text using available stop word lists.

    • Building Models: Once you have preprocessed your data appropriately, you can proceed with building models using a variety of techniques based on your goals and requirements. Some common approaches include:

      • Building question-answering systems using machine learning algorithms like Natural Language Processing (NLP) or transformers.

      • Utilizing pre-trained language models such as BERT, GPT, etc., for more accurate predictions.

      • Implementing deep learning architectures like LSTM or CNN for better contextual understanding.

    • Model Evaluation: After training your models, evaluate their performance by utilizing appropriate evaluation metrics and techniques.

    • Iterative Process: Most often, the process of building effective question-answering

    Research Ideas

    • Language understanding and translation: This dataset can be used to train models for German language understanding and translation tasks. By providing context, question, and answer pairs, the models can learn to understand the meaning of sentences in German and ...
  19. Suicidal Behaviors and Attempts

    • kaggle.com
    zip
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Suicidal Behaviors and Attempts [Dataset]. https://www.kaggle.com/datasets/thedevastator/suicidal-behaviors-and-attempts/code
    Explore at:
    zip(1359352 bytes)Available download formats
    Dataset updated
    Jan 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Suicidal Behaviors and Attempts

    Lethality, Risk Factors, and Statuses

    By [source]

    About this dataset

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Before you begin your analysis, it is important that you are familiar with the dataset: - The columns include: User (a unique identifier for each user), Post (provides text of each post), Label (indicates whether a post is associated with suicidal behavior or not).
    - Each row in this dataset provides detailed information regarding one user’s post on Reddit related to suicide.

    Now that you have an understanding of what’s included in this set, let's dive into working with it! First off, we recommend exploring within Jupyter Notebook given its ease of use and interactive nature - just open up a new notebook in Kaggle Notebooks. Here are some helpful tips:

    • Explore Data Types : Take some time getting familiarized with what type of data is found in each column by using various commands such as .dtypes or .info(). Knowing which type each column holds will make it easier when filtering columns later on. You could also explore any missing values using .isnull().sum() command which provides a good indication if any preprocessing such as filling missing values needs to take place prior to analysis.

    • Analyze Labels & Posts : Have a better understanding of labels attached to posts using value_counts() command which helps summarize proportions between these two variables so that more informed decisions can be made later on during analysis/modeling stages. Having an understanding when dealing real world problems often requires analyzing different aspects/labels associated before proceeding further so take your time here! For example, grouping posts based on labels can be done via groupby(Label).

    • Visualize your Results : Visualization makes findings easier to interpret; try leveraging matplotlib packages such as plt xy or seaborn sns heatmap; alternatively use Tableau externally once data preparation has been completed previously within Jupyter Notebook along side Python libraries like Scikit Learn or Numpy used for modeling techniques such machine learning algorithm implementations or complex computations like linear algebraic analyses respectively should there ever come an instance were

    Research Ideas

    • Analyzing which risk factors associated with suicidal behavior are most prevalent in certain demographic groups, such as gender and age.
    • Examining the potential outcomes of different methods of self-harm and understanding their lethality levels to create more effective prevention and response strategies.
    • Creating predictive models for mental health workers to use when assessing individuals at risk of suicide so they can identify individuals who may need immediate intervention or follow up care

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 500_Reddit_users_posts_labels.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------------------| | User | Unique identifier for each user. (String) | | Post | Text of the post. (String) | | Label | Label indicating whether the post is related to suicidal behavior or not. (Boolean) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  20. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ritesh Swami (2025). 📺 10,000+ Popular TV Shows Dataset (TMDB) [Dataset]. https://www.kaggle.com/datasets/riteshswami08/10000-popular-tv-shows-dataset-tmdb
Organization logo

📺 10,000+ Popular TV Shows Dataset (TMDB)

10K popular TV shows from TMDB with metadata for analysis, ML & recommendations

Explore at:
zip(2228816 bytes)Available download formats
Dataset updated
Sep 20, 2025
Authors
Ritesh Swami
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

10k Popular TV Shows — Global TV show metadata (1944–2025)

Subtitle: Episode & show metadata: titles, original titles & languages, genre IDs, origin country, popularity & vote metrics, poster/backdrop paths — ideal for NLP, recommender systems, and entertainment analytics.

🙏🙏🙏 If you found this dataset helpful, please consider upvoting! ❤️❤️❤️

Short description:
This dataset contains 10,000 TV-show records with metadata including titles, original names & languages, genre IDs, origin country, popularity, vote metrics, overviews, poster/backdrop paths and first air dates. Great for building recommenders, NLP experiments, temporal analysis, and media analytics.

Source / Provenance: Collected from TMDb API on 2025-08-19.

License: CC BY-SA 4.0 (attribution — share-alike).

What’s inside

  • Rows × Columns: 10,000 × 14
  • Date range (first_air_date): 1944-01-20 → 2025-10-26 (includes future-dated entries).
  • Columns: adult, backdrop_path, genre_ids, id, origin_country, original_language, original_name, overview, popularity, poster_path, first_air_date, name, vote_average, vote_count

Columns (definition table)

ColumnTypeMissing %DescriptionExample
adultint / bool0.00%Adult-content flag (0 = no, 1 = yes).0
backdrop_pathstring6.83%Relative path to the backdrop image (nullable)./gR7GmFJ0lCFqtv8h2yDfzX3ojx0.jpg
genre_idslist-like string0.00%Genre IDs stored as a list-like string (e.g. "[18, 35]"). Parse with ast.literal_eval to get a list of ints."[18, 35]"
idint0.00%Canonical integer id for the show — use as primary key when merging.1399
origin_countrylist-like string0.00%Country code(s) stored as a list-like string (e.g. "['US']"). Parse before grouping."['US']"
original_languagestring0.00%ISO language code of the original show (e.g. en, ja).en
original_namestring0.00%Original title in the original language.Breaking Bad
overviewstring9.27%Short synopsis/summary of the show (nullable). Great for NLP.A high school chemistry teacher...
popularityfloat0.00%Numeric popularity score (float). Useful as a proxy for demand.123.456
poster_pathstring2.47%Relative path to the poster image (nullable)./abc123poster.jpg
first_air_datedate (string)0.34%Release / first air date. Convert to datetime; note some future dates (~2025).2013-01-20
namestring0.00%Common/English title (not guaranteed unique).Game of Thrones
vote_averagefloat0.00%Mean user rating (float). Consider filtering by vote_count.8.6
vote_countint0.00%Number of votes (int). Use to filter reliable ratings (e.g., >= 10).12534

Key notes & quirks - genre_ids and origin_country are stored as list-like strings (e.g. "[18, 35]" or "['US']"). Parse them with ast.literal_eval or a JSON-aware parser before treating as lists.
- overview is missing for ~**9.3%** of rows — still suitable for most NLP tasks but expect some nulls.
- backdrop_path missing ~**6.8%**; poster_path missing ~**2.5%**.
- Some first_air_date values are future dates — verify before time-series modeling.
- name is not guaranteed unique (remakes, different regions, duplicates). Use id as primary identifier.

Small sample (first 3 rows)

This quick preview shows structure and typical values:

idnameoriginal_namelangcountryfirst_air_dategenre_idspopularityvote_avgvote_cnt
119051WednesdayWednesdayen['US']2022-11-23[10765, 9648, 35]318.7818.3929781
194766The Summer I Turned PrettyThe Summer I Turned Prettyen['US']2022-06-17[18]266.2938.173956
157239Alien: EarthAlien: Earthen['US']2021-04-08[10765, 18]229.4967.708427

Quick EDA highlights

  • Top origin countries: US, JP, KR, CN, GB.
  • Top original languages: en, ja, zh, ko, es.
  • Overviews available for ~90.7% of rows — suitable for embeddings, topic modeling, summarization.
  • Popularity & vote metrics included (use vote_count to filter reliable ratings).

Suggested uses / example projects

  • Content-based recommender using genre_ids + overview embeddings.
  • NLP: topic modeling, clustering, summarization, or fine-tuning on overview.
  • Popularity/rating prediction from metadata (genres, country, language, overview length).
  • Temporal analysis: show production/popularity trends by decade & country.
  • Image projects if you ma...
Search
Clear search
Close search
Google apps
Main menu