14 datasets found
  1. Data-cleaning through Pandas

    • kaggle.com
    Updated Feb 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Altaf Khan (2023). Data-cleaning through Pandas [Dataset]. https://www.kaggle.com/datasets/altafk/datacleaning-through-pandas
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Altaf Khan
    Description

    Dataset

    This dataset was created by Muhammad Altaf Khan

    Contents

  2. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  3. Klib library python

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  4. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  5. o

    Global Cinematic Database

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Cinematic Database [Dataset]. https://www.opendatabay.com/data/ai-ml/f31e49e3-2a4e-46c0-9b56-725ebb64af48
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset features data on over 10,000 films from TMDB, gathered using the TMDB API. It encompasses details such as film identifiers, titles, release dates, average votes, vote counts, overviews, and popularity metrics. The dataset may contain null values where information was not available from the TMDB database. It is particularly useful for new analysts looking to practise handling missing data and for developing film recommendation systems.

    Columns

    • id: Unique identifier for the film.
    • title: The name of the film.
    • overview: A brief summary or synopsis of the film.
    • release_date: The original release date of the film.
    • popularity: A numerical score indicating the film's popularity.
    • vote_average: The average vote score received by the film.
    • vote_count: The total number of votes cast for the film.

    Distribution

    The dataset contains information on over 10,000 films. The data is typically available in CSV format, structured as a pandas DataFrame. It includes unique identifiers for nearly 10,000 films. Release dates span from 17th April 1902 to 7th September 2022. Popularity scores vary widely, with the majority falling into the lower ranges but some reaching high values. Vote counts also show a broad distribution, and average vote scores range from approximately 5.00 to 8.70. Some fields within the dataset may contain null values.

    Usage

    This dataset is ideal for: * Developing and testing film recommendation systems. * Practising data cleaning and handling of missing values, particularly beneficial for new data analysts. * Exploratory data analysis of film trends and audience reception.

    Coverage

    The dataset's coverage is global. It includes films released between 17th April 1902 and 7th September 2022. No specific demographic scope is noted; coverage is based on films available through the TMDB API.

    License

    CC0

    Who Can Use It

    • Data Analysts: Especially those new to data analysis, to gain experience with data manipulation and missing value imputation.
    • Machine Learning Engineers: For building and evaluating film recommendation algorithms.
    • Researchers: Studying film industry trends, audience preferences, and cinematic history.
    • Developers: Creating applications that require film metadata.

    Dataset Name Suggestions

    • TMDB Movies Data
    • Film Insights Collection
    • Global Cinematic Database
    • Movie Popularity and Ratings
    • Open Film Dataset

    Attributes

    Original Data Source: TMDB MOVIES DATASET

  6. E

    A Replication Dataset for Fundamental Frequency Estimation

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data

    ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

    CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]

    noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]

    synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]

    noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.

    Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

    The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:

    John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.

  7. o

    Hotspots of Extinction: Country-Level Data on Threatened Vertebrates,...

    • dataverse.openforestdata.pl
    tsv
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Hotspots of Extinction: Country-Level Data on Threatened Vertebrates, Invertebrates, and Plants [Dataset]. http://doi.org/10.48370/OFD/XSYP7R
    Explore at:
    tsv(11419), tsv(10834), tsv(1404776), tsv(11701)Available download formats
    Dataset updated
    May 11, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides annual records of threatened species from 2004 to 2023, focusing on the 25 countries most impacted by biodiversity loss. For direct download of datasets. The data is organized into three categories—Vertebrates, Invertebrates, and Plants—and sourced from UNdata and the IUCN Red List. Each entry includes the country name, year, species count, and biodiversity group. It is designed to support research, education, and public engagement on global conservation priorities. Source and Collection Timeline Original Data Range: 2004–2023 Cleaned and Extracted: November 2024 Primary Sources: UNdata, IUCN Red List (via UN Statistics Division) Data Processing Summary Data Cleaning: Removed incomplete entries and excluded non-country-level data (e.g., continents or regions). Grouping: Categorized into Vertebrates, Invertebrates, and Plants. Top 25 Filter: Selected the top 25 countries per year and per category to improve visual clarity. File Generation: Created three structured CSVs using Python (Pandas). Data Format File Type: CSV (.csv) Columns Include: Country – Name of the country Year – Range from 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants

  8. h

    amazon-products

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CK, amazon-products [Dataset]. https://huggingface.co/datasets/ckandemir/amazon-products
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    CK
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Creation and Processing Overview

    This dataset underwent a comprehensive process of loading, cleaning, processing, and preparing, incorporating a range of data manipulation and NLP techniques to optimize its utility for machine learning models, particularly in natural language processing.

      Data Loading and Initial Cleaning
    

    Source: Loaded from the Hugging Face dataset repository bprateek/amazon_product_description. Conversion to Pandas DataFrame: For ease of data… See the full description on the dataset page: https://huggingface.co/datasets/ckandemir/amazon-products.

  9. US Means of Transportation to Work Census Data

    • kaggle.com
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sagar G
    Area covered
    United States
    Description

    US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

    Data Ingestion and Cleaning:

    ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

    The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

    Data Source:

    More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

    Final Word:

    I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.

  10. h

    aesthetics-wiki

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Rhone (2025). aesthetics-wiki [Dataset]. https://huggingface.co/datasets/ninar12/aesthetics-wiki
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Nina Rhone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction

    This dataset is webscraped version of aesthetics-wiki. There are 1022 aesthetics captured.

      Columns + dtype
    

    title: str description: str (raw representation, including because it could help in structuring data) keywords_spacy: str (['NOUN', 'ADJ', 'VERB', 'NUM', 'PROPN'] keywords extracted from description with POS from Spacy library) removed weird characters, numbers, spaces, stopwords

      Cleaning
    

    Standard Pandas cleaning

    Cleaned the data by… See the full description on the dataset page: https://huggingface.co/datasets/ninar12/aesthetics-wiki.

  11. h

    mt-bench-eval-critique

    • huggingface.co
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing (2024). mt-bench-eval-critique [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    Description

    This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset

    df = pd.read_json("mt_bench_eval.json", lines=True)

    ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.

  12. Goodreads Best 21st Century Book List

    • kaggle.com
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prakash Mahatra (2024). Goodreads Best 21st Century Book List [Dataset]. https://www.kaggle.com/datasets/prakashmahatra/goodreads-best-21st-century-book-list/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    Kaggle
    Authors
    Prakash Mahatra
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains data obtained from Goodreads, a popular website for book lovers, to gain insights into the best books of the 21st century. The data was scraped from the Best Books of the 21st Century list on Goodreads using the Beautiful Soup and Requests libraries in Python. After obtaining the data, cleaning and exploratory data analysis (EDA) were performed using Pandas, Plotly, Seaborn, and Matplotlib.

    The dataset contains top books of the 21st century, spanning from the 2000s to the present day. The data is scraped from a popular book website, Goodreads. Some notable books in the dataset include the Harry Potter series, A Thousand Splendid Suns, The Kite Runner, and The Fault in Our Stars.

    The dataset consists of a total of 84,033 books and comprises 15 columns.

  13. Population and Population Density Dataset.

    • kaggle.com
    Updated Jul 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zoraiz Azeem (2021). Population and Population Density Dataset. [Dataset]. https://www.kaggle.com/zoraizazeem/population-and-population-density-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2021
    Dataset provided by
    Kaggle
    Authors
    Zoraiz Azeem
    Description

    Content

    This dataset contains population and population density data from the world bank. The world bank has accurate data from the year 1950, and this data set contains projections from the year 2021 onwards. (see my notebook for more) This dataset also contains the female and male population spilts.

    Acknowledgements

    Thanks to the world bank: https://data.worldbank.org/indicator/SP.POP.TOTL

    Inspiration

    This is a very simple data set aimed at users who wan to get involved with cleaning and visualisations data in python/pandas. See my code for inspiration.

  14. RTEM Hackaton API and Data Science Tutorials

    • kaggle.com
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pony Biam (2022). RTEM Hackaton API and Data Science Tutorials [Dataset]. https://www.kaggle.com/datasets/ponybiam/onboard-api-intro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pony Biam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    RTEM Hackathon Tutorials

    This data set and associated notebooks are meant to give you a head start in accessing the RTEM Hackathon by showing some examples of data extraction, processing, cleaning, and visualisation. Data availabe in this Kaggle page is only a selected part of the whole data set extracted for the tutorials. A series of Video Tutorials are associated with this dataset and notebooks and is found on the Onboard YouTube channel.

    Part 1 - Onboard API and Onboard API Wrapper Introduction

    An introduction to the API usage and how to retrieve data from it. This notebook is outlined in several YouTube videos that discuss: - how to get started with your account and get oriented to the Kaggle environment, - get acquainted with the Onboard API, - and start using the Onboard API wrapper to extract and explore data.

    Part 2 - Meta-data and Point Exploration Demo

    How to query data points meta-data, process them and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to get started exploring building metadata/points, - select/merge point lists and export as CSV - and visualize and explore the point lists

    Part 3 - Time-series Data Extraction and Exploration Demo

    How to query time-series from data points, process and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to load and filter time-series data from sensors - resample and transform time-series data - and create heat maps and boxplots of data for exploration

    Part 4 - Example of starting point for analysis for RTEM and possible directions of analysis

    A quick example of a starting point towards the analysis of the data for some sort of solution and reference to a paper that might help get an overview of the possible directions your team can go in. This notebook is outlined in several YouTube videos that discuss: - overview of use cases and judging criteria - an example of a real-world hypothesis - further development of that simple example

    More information about the data and competition can be found on the RTEM Hackathon website.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Muhammad Altaf Khan (2023). Data-cleaning through Pandas [Dataset]. https://www.kaggle.com/datasets/altafk/datacleaning-through-pandas
Organization logo

Data-cleaning through Pandas

Data-cleaning through Pandas

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Altaf Khan
Description

Dataset

This dataset was created by Muhammad Altaf Khan

Contents

Search
Clear search
Close search
Google apps
Main menu