27 datasets found
  1. Data_Cleaning_in_Pandas_Shruthi_TR.ipynb

    • kaggle.com
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shruthi T R (2023). Data_Cleaning_in_Pandas_Shruthi_TR.ipynb [Dataset]. https://www.kaggle.com/datasets/shruthirt/data-cleaning-in-pandas-shruthi-tr-ipynb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shruthi T R
    Description

    Dataset

    This dataset was created by Shruthi T R

    Contents

  2. E

    A Replication Dataset for Fundamental Frequency Estimation

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data

    ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

    CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]

    noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]

    synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]

    noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.

    Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

    The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:

    John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.

  3. f

    Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

    • figshare.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    figshare
    Authors
    Maryam Binti Haji Abdul Halim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

  4. S&P 500 Companies Analysis Project

    • kaggle.com
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    anshadkaggle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

    Included Files:

    sp500_cleaned.csv – Cleaned dataset used for analysis

    sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

    dashboard_screenshot.png – Screenshot of Power BI dashboard

    README.md – Summary of the project and key takeaways

    This project demonstrates practical data cleaning, querying, and visualization skills.

  5. h

    aesthetics-wiki

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Rhone (2025). aesthetics-wiki [Dataset]. https://huggingface.co/datasets/ninar12/aesthetics-wiki
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Nina Rhone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction

    This dataset is webscraped version of aesthetics-wiki. There are 1022 aesthetics captured.

      Columns + dtype
    

    title: str description: str (raw representation, including because it could help in structuring data) keywords_spacy: str (['NOUN', 'ADJ', 'VERB', 'NUM', 'PROPN'] keywords extracted from description with POS from Spacy library) removed weird characters, numbers, spaces, stopwords

      Cleaning
    

    Standard Pandas cleaning

    Cleaned the data by… See the full description on the dataset page: https://huggingface.co/datasets/ninar12/aesthetics-wiki.

  6. Amazon Sales Data Analysis Project1

    • kaggle.com
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GOKUL (2024). Amazon Sales Data Analysis Project1 [Dataset]. https://www.kaggle.com/datasets/gokulvino/amazon-sales-data-analysis-project1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GOKUL
    Description

    Problem Statement: Sales management has gained importance to meet increasing competition and the need for improved methods of distribution to reduce cost and to increase profits. Sales management today is the most important function in a commercial and business enterprise. We need to extract all the Amazon sales datasets, transform them using data cleaning and data preprocessing and then finally loading it for analysis. We need to visualize sales trend month-wise, year-wise and yearly-month wise. Moreover, we need to find key metrics and factors and show meaningful relationships between attributes.

    Approach The main goal of the project is to find key metrics and factors and then show meaningful relationships between them based on different features available in the dataset.

    Data Collection : Imported data from various datasets available in the project using Pandas library.

    Data Cleaning : Removed missing values and created new features as per insights.

    Data Preprocessing : Modified the structure of data in order to make it more understandable and suitable and convenient for statistical analysis.

    Data Analysis : I started analyzing dataset using Pandas,Numpy,Matplotlib and Seaborn.

    Data Visualization : Plotted graphs to get insights about dependent and independent variables. Also used Tableau and PowerBI for data visulization.

  7. o

    Hotspots of Extinction: Country-Level Data on Threatened Vertebrates,...

    • dataverse.openforestdata.pl
    tsv
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Hotspots of Extinction: Country-Level Data on Threatened Vertebrates, Invertebrates, and Plants [Dataset]. http://doi.org/10.48370/OFD/XSYP7R
    Explore at:
    tsv(11419), tsv(10834), tsv(1404776), tsv(11701)Available download formats
    Dataset updated
    May 11, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides annual records of threatened species from 2004 to 2023, focusing on the 25 countries most impacted by biodiversity loss. For direct download of datasets. The data is organized into three categories—Vertebrates, Invertebrates, and Plants—and sourced from UNdata and the IUCN Red List. Each entry includes the country name, year, species count, and biodiversity group. It is designed to support research, education, and public engagement on global conservation priorities. Source and Collection Timeline Original Data Range: 2004–2023 Cleaned and Extracted: November 2024 Primary Sources: UNdata, IUCN Red List (via UN Statistics Division) Data Processing Summary Data Cleaning: Removed incomplete entries and excluded non-country-level data (e.g., continents or regions). Grouping: Categorized into Vertebrates, Invertebrates, and Plants. Top 25 Filter: Selected the top 25 countries per year and per category to improve visual clarity. File Generation: Created three structured CSVs using Python (Pandas). Data Format File Type: CSV (.csv) Columns Include: Country – Name of the country Year – Range from 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants

  8. h

    mt-bench-eval-critique

    • huggingface.co
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing (2024). mt-bench-eval-critique [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    Description

    This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset

    df = pd.read_json("mt_bench_eval.json", lines=True)

    ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.

  9. n

    Extirpated species in Berlin, dates of last detections, habitats, and number...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvia Keinath (2024). Extirpated species in Berlin, dates of last detections, habitats, and number of Berlin’s inhabitants [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc4k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Museum für Naturkunde
    Authors
    Silvia Keinath
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Berlin
    Description

    Species loss is highly scale-dependent, following the species-area relationship. We analysed spatio-temporal patterns of species’ extirpation on a multitaxonomic level using Berlin, the capital city of Germany. Berlin is one of the largest cities in Europe and has experienced a strong urbanisation trend since the late 19th century. We expected species’ extirpation to be exceptionally high due to the long history of urbanisation. Analysing regional Red Lists of Threatened Plants, Animals, and Fungi of Berlin (covering 9498 species), we found that 16 % of species were extirpated, a rate 5.9 times higher than at the German scale, and 47.1 times higher than at the European scale. Species’ extirpation in Berlin is comparable to that of another German city with a similarly broad taxonomic coverage, but much higher than in regional areas with less human impact. The documentation of species’ extirpation started in the 18th century and is well documented for the 19th and 20th centuries. We found an average annual extirpation of 3.6 species in the 19th century, 9.6 species in the 20th century, and the same number of extirpated species as in the 19th century were documented in the 21th century, despite the much shorter time period. Our results showed that species’ extirpation is higher at small than on large spatial scales, and might be negatively influenced by urbanisation, with different effects on different taxonomic groups and habitats. Over time, we found that species’ extirpation is highest during periods of high human alterations and is negatively affected by the number of people living in the city. But, there is still a lack of data to decouple the size of the area and the human impact of urbanisation. However, cities might be suitable systems for studying species’ extirpation processes due to their small scale and human impact. Methods Data extraction: To determine the proportion of extirpated species for Germany, we manually summarised the numbers of species classified in category 0 ‘extinct or extirpated’ and calculated the percentage in relation to the total number of species listed in the Red Lists of Threatened Species for Germany, taken from the website of the Red List Centre of Germany (Rote Liste Zentrum, 2024a). For Berlin, we used the 37 current Red Lists of Threatened Plants, Animals, and Fungi from the city-state of Berlin, covering the years from 2004 to 2023, taken from the official capital city portal of the Berlin Senate Department for Mobility, Transport, Climate Protection and Environment (SenMVKU, 2024a; see overview of Berlin Red Lists used in Table 1). We extracted all species that are listed as extinct/extirpated, i.e. classified in category 0, and additionally, if available, the date of the last record of the species in Berlin. The Red List of macrofungi of the order Boletales by Schmidt (2017) was not included in our study, as this Red List has only been compiled once in the frame of a pilot project and therefore lacks the category 0 ‘extinct or extirpated’. We used Python, version 3.7.9 (Van Rossum and Drake, 2009), the Python libraries Pandas (McKinney et al., 2010), and Camelot-py, version 0.11.0 (Vinayak Meta, 2023) in Jupyter Lab, version 4.0.6 (Project Jupyter, 2016) notebooks. In the first step, we created a metadata table of the Red Lists of Berlin to keep track of the extraction process, maintain the source reference links, and store summarised data from each Red List pdf file. At the extraction of each file, a data row was added to the metadata table which was updated throughout the rest of the process. In the second step, we identified the page range for extraction for each extracted Red List file. The extraction mechanism for each Red List file depended on the printed table layout. We extracted tables with lined rows with the Lattice parsing method (Camelot-py, 2024a), and tables with alternating-coloured rows with the Stream method (Camelot-py, 2024b). For proofing the consistency of extraction, we used the Camelot-py accuracy report along with the Pandas data frame shape property (Pandas, 2024). After initial data cleaning for consistent column counts and missing data, we filtered the data for species in category 0 only. We collated data frames together and exported them as a CSV file. In a further step, we proofread whether the filtered data was tallied with the summary tables, given in each Red List. Finally, we cleaned each Red List table to contain the species, the current hazard level (category 0), the date of the species’ last detection in Berlin, and the reference (codes and data available at: Github, 2023). When no date of last detection was given for a species, we contacted the authors of the respective Red Lists and/or used former Red Lists to find information on species’ last detections (Burger et al., 1998; Saure et al., 1998; 1999; Braasch et al., 2000; Saure, 2000). Determination of the recording time windows of the Berlin Red Lists We determined the time windows, the Berlin Red Lists look back on, from their methodologies. If the information was missing in the current Red Lists, we consulted the previous version (see all detailed time windows of the earliest assessments with references in Table B2 in Appendix B). Data classification: For the analyses of the percentage of species in the different hazard levels, we used the German Red List categories as described in detail by Saure and Schwarz (2005) and Ludwig et al. (2009). These are: Prewarning list, endangered (category 3), highly endangered (category 2), threatened by extinction or extirpation (category 1), and extinct or extirpated (category 0). To determine the number of indigenous unthreatened species in each Red List, we subtracted the number of species in the five categories and the number of non-indigenous species (neobiota) from the total number of species in each Red List. For further analyses, we pooled the taxonomic groups of the 37 Red Lists into more broadly defined taxonomic groups: Plants, lichens, fungi, algae, mammals, birds, amphibians, reptiles, fish and lampreys, molluscs, and arthropods (see categorisation in Table 1). We categorised slime fungi (Myxomycetes including Ceratiomyxomycetes) as ‘fungi’, even though they are more closely related to animals because slime fungi are traditionally studied by mycologists (Schmidt and Täglich, 2023). We classified ‘lichens’ in a separate category, rather than in ‘fungi’, as they are a symbiotic community of fungi and algae (Krause et al., 2017). For analyses of the percentage of extirpated species of each pooled taxonomic group, we set the number of extirpated species in relation to the sum of the number of unthreatened species, species in the prewarning list, and species in the categories one to three. We further categorised the extirpated species according to the habitats in which they occurred. We therefore categorised terrestrial species as ‘terrestrial’ and aquatic species as ‘aquatic’. Amphibians and dragonflies have life stages in both, terrestrial and aquatic habitats, and were categorised as ‘terrestrial/aquatic’. We also categorised plants and mosses as ‘terrestrial/aquatic’ if they depend on wetlands (see all habitat categories for each species in Table C1 in Appendix C). The available data considering the species’ last detection in Berlin ranked from a specific year, over a period of time up to a century. If a year of last detection was given with the auxiliary ‘around’ or ‘circa’, we used for further analyses the given year for temporal classification. If a year of last detection was given with the auxiliary ‘before’ or ‘after’, we assumed that the nearest year of last detection was given and categorised the species in the respective century. In this case, we used the species for temporal analyses by centuries only, not across years. If only a timeframe was given as the date of last detection, we used the respective species for temporal analyses between centuries, only. We further classified all of the extirpated species in centuries, in which species were lastly detected: 17th century (1601-1700); 18th century (1701-1800); 19th century (1801-1900); 20th century (1901-2000); 21th century (2001-now) (see all data on species’ last detection in Table C1 in Appendix C). For analyses of the effects of the number of inhabitants on species’ extirpation in Berlin, we used species that went extirpated between the years 1920 and 2012, because of Berlin’s was expanded to ‘Groß-Berlin’ in 1920 (Buesch and Haus, 1987), roughly corresponding to the cities’ current area. Therefore, we included the number of Berlin’s inhabitants for every year a species was last detected (Statistische Jahrbücher der Stadt Berlin, 1920, 1924-1998, 2000; see all data on the number of inhabitants for each year of species’ last detection in Table C1 in Appendix C). Materials and Methods from Keinath et al. (2024): 'High levels of species’ extirpation in an urban environment – A case study from Berlin, Germany, covering 1700-2023'.

  10. s

    Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle...

    • scholardata.sun.ac.za
    • data.mendeley.com
    Updated Mar 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen (2025). Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle Performance Tracking (2023) [Dataset]. http://doi.org/10.25413/sun.28554200.v1
    Explore at:
    Dataset updated
    Mar 8, 2025
    Dataset provided by
    SUNScholarData
    Authors
    Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nairobi
    Description

    This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles

  11. h

    amazon-products

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CK, amazon-products [Dataset]. https://huggingface.co/datasets/ckandemir/amazon-products
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    CK
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Creation and Processing Overview

    This dataset underwent a comprehensive process of loading, cleaning, processing, and preparing, incorporating a range of data manipulation and NLP techniques to optimize its utility for machine learning models, particularly in natural language processing.

      Data Loading and Initial Cleaning
    

    Source: Loaded from the Hugging Face dataset repository bprateek/amazon_product_description. Conversion to Pandas DataFrame: For ease of data… See the full description on the dataset page: https://huggingface.co/datasets/ckandemir/amazon-products.

  12. Roller Coaster Accidents

    • kaggle.com
    Updated Jun 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    steven (2021). Roller Coaster Accidents [Dataset]. https://www.kaggle.com/stevenlasch/roller-coaster-accidents/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    steven
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I wanted to analyze a dataset that consisted of roller coaster accidents, and I saw that there weren't any on Kaggle at the time of uploading this. So, I went online and found one particular dataset.

    The Data

    I went online and found a dataset and began cleaning and analyzing it. I found the included dataset from the website https://ridesdatabase.org/saferparks/data/. Also, keep in mind that the included dataset is the cleaned version, not the original!

    This file is a dataset that contains information about theme park accidents. So, this dataset contains 24 columns:

    • acc_id: Integer. Unique ID for each accident
    • acc_date: datetime64[ns]. This column is originally a String, but I change it to datetime in the code for easy access to years.
    • acc_state: String. The U.S. State abbreviation that the accident occurred in.
    • acc_city: String. The U.S. city that the accident happened in.
    • fix_port: String Determines if the machine is fixed F or portable P.
    • source: String. Source of the accident information.
    • bus_type: String. The place in the park that the accident occurred.
    • industry_sector: String. Groups devices according to the general category within the amusement business.
    • device_category: String. Group devices in Industry Sectors that cover a wide range, e.g., grouping devices into coasters, spinning rides, etc.
    • device_type: String. Type of ride or device involved in the accident.
    • tradename_or_generic: String. Particular make/model, where known, or indicates the generic type of ride or device.
    • manufacturer: String. The manufacturer of the faulty ride.
    • num_injured: Integer. The Number of people injured in the accident.
    • age_youngest: Float. Age of the youngest victim.
    • gender: String. Gender of the person injured.
    • acc_desc: String. Short description of the accident.
    • injury_desc: String. Short description of the severity of the injuries.
    • report: String. A link to the accident.
    • category: String. What kind of injuries did those who were injured suffer?
    • mechanical: Boolean. Was it a mechanical malfunction? See Notes #1
    • op_error: Boolean. Was it an error with the operation of the machine? See Notes #1
    • employee: Boolean. Was it an employee error? See Notes #1
    • notes: String. Other notes about the accident.
    • year: Integer. Pulls the year from the acc_date column.

    Notes

    1. When totaling the values in mechanical, op_error, or employee columns, there is no need to convert to Integer, since pandas will take their representative values—0 or 1—into account, e.g., data['mechanical'].sum() will return 935 even though the column is of Boolean type.

    2. The notebook that I included under the 'Code' tab imports the cleaned dataset which is why I omitted a data cleaning section in the notebook. If you were to import the data from the website I provided at the top of this page, you will have to clean the data on your own.

  13. H

    Global Biodiversity at Risk: Top 25 Countries by Threatened Species...

    • dataverse.harvard.edu
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramkumar Yaragarla (2025). Global Biodiversity at Risk: Top 25 Countries by Threatened Species (2004–2023) [Dataset]. http://doi.org/10.7910/DVN/ZX4WLC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Ramkumar Yaragarla
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset presents annual counts of threatened species from 2004 to 2023 across the top 25 countries most affected by biodiversity loss. Access Full Dataset I Hug Trees – Data Analytics Data is categorized into three groups—Vertebrates, Invertebrates, and Plants and compiled from UNdata and IUCN Red List sources. It includes country names, years, species counts, and biodiversity groupings, and is intended for use in research, education, and public awareness around global conservation priorities. Processing Summary Data Cleaning: Removed duplicates, corrected inconsistencies, and excluded continent-level entries. Categorization: Grouped species data into three broad classes. Ranking: Selected top 25 countries per category, per year, for visual clarity. Output: Generated three structured CSV files (one per group) using Python’s Pandas library. Data Structure File Format: CSV (.csv) Columns: Country – Country name Year – From 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants Scope and Limitations Focus: National-level trends; no sub-national or habitat-specific granularity Last Updated: April 2025 Future Enhancements: May include finer taxonomic resolution and updated threat metrics Intended Uses Educational tools and biodiversity curriculum support Conservation awareness campaigns and visual storytelling Policy dashboards and SDG 15 (Life on Land) tracking Research into biodiversity trends and hotspot identification Tools & Technologies Python (Pandas): Data filtering, aggregation, and file creation Chart.js: Rendering of interactive bar charts HTML iFrames: Seamless embedding on the I Hug Trees website

  14. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  15. Z

    The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

    • data.niaid.nih.gov
    Updated Sep 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eng, Kent X. (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
    Explore at:
    Dataset updated
    Sep 25, 2020
    Dataset provided by
    Chen, Xinlei
    Eng, Kent X.
    Noh, Hae Young
    Zhang, Lin
    Liu, Jingxiao
    Liu, Xinyu
    Zhang, Pei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

    After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

    The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

    The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

    To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

    get_dataset.ipynb: 1. os library 2. pandas library

    Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

    The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

    For questions or suggestions please e-mail Xinlei Chen

  16. Z

    Analysis of references in the IPCC AR6 WG2 Report of 2022

    • data.niaid.nih.gov
    Updated Mar 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Neylon (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6327206
    Explore at:
    Dataset updated
    Mar 11, 2022
    Dataset provided by
    Bianca Kramer
    Cameron Neylon
    License

    https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/

    Description

    This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

    References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

    We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

    This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)

    A brief descriptive analysis was provided as a blogpost on the COKI website.

    The repository contains the following content:

    Data:

    data/scholarcy/RIS/ - extracted references as RIS files

    data/scholarcy/BibTeX/ - extracted references as BibTeX files

    IPCC_AR6_WGII_dois.csv - list of DOIs

    data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report

    Processing:

    preprocessing.R - preprocessing steps for identifying and cleaning DOIs

    process.py - Python script for transforming data and linking to COKI data through Google Big Query

    Outcomes:

    Dataset on BigQuery - requires a google account for access and bigquery account for querying

    Data Studio Dashboard - interactive analysis of the generated data

    Zotero library of references extracted via Scholarcy

    PDF version of blogpost

    Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0

  17. Surveys of Data Professionals (Alex the Analyst)

    • kaggle.com
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stewie
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [Dataset Name] - About This Dataset

    Overview

    This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

    Context

    The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

    Content

    The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

    Acknowledgements

    The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

    Citation

    If you use this dataset in your work, please cite it as follows:

    How to Use

    1. Download the dataset from this link.
    2. Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

    Feel free to reach out for any additional information or clarification. Happy analyzing!

  18. Diwali_Sales_Dataset

    • kaggle.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BharathiD8 (2024). Diwali_Sales_Dataset [Dataset]. https://www.kaggle.com/datasets/bharathid8/diwali-sales-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BharathiD8
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Project Overview

    Objective: Analyze Diwali sales data to uncover trends, customer behavior, and sales performance during the festive season. - Tools Used: Python, Pandas, NumPy, Matplotlib, Seaborn

    Data Collection and Preparation

    Dataset: A dataset containing sales data for Diwali, including details like product categories, customer demographics, sales amounts, discounts, etc.

    • **Data Cleaning: **Handle missing values, remove duplicates, and correct any inconsistencies in the data.

    - Feature Engineering: Create new features if necessary, such as total sales per customer, average discount per sale, etc.

    Exploratory Data Analysis (EDA)

    Descriptive Statistics: Calculate basic statistics (mean, median, mode) to get a sense of the data distribution. Visualizations: Sales Trends: Plot sales over time to see how they varied during the Diwali season. Top-Selling Products: Identify the products or categories with the highest sales. Customer Demographics: Analyze sales by age, gender, and location to understand customer behavior. Discount Impact: Evaluate how different discount levels affected sales volume.

    Key Findings

    Customer Behavior: Insights on which customer segments contributed the most to sales. Sales Performance: Which products or categories had the highest sales, and during which days of Diwali sales peaked. Discount Effectiveness: The impact of discounts on sales and whether higher discounts led to significantly higher sales or not.

    Conclusion

    Summarize the key insights derived from the EDA. Discuss any patterns or trends that were unexpected or particularly interesting. Provide recommendations for future sales strategies based on the findings. .

  19. n

    A dataset of 5 million city trees from 63 US cities: species, location,...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2022
    Dataset provided by
    Cornell University
    The Biota of North America Program (BONAP)
    Harvard University
    Worcester Polytechnic Institute
    Stanford University
    Authors
    Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    United States
    Description

    Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

    Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

    Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.

  20. Household Energy Consumption

    • kaggle.com
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samx_sam (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/datasets/samxsam/household-energy-consumption
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    Kaggle
    Authors
    Samx_sam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🏡 Household Energy Consumption - April 2025 (90,000 Records)

    📌 Overview

    This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

    Column NameData Type CategoryDescription
    Household_IDCategorical (Nominal)Unique identifier for each household
    DateDatetimeThe date of the energy usage record
    Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
    Household_SizeNumerical (Discrete)Number of individuals living in the household
    Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
    Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
    Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

    📂 Dataset Summary

    • Rows: 90,000
    • Time Range: April 1, 2025 – April 30, 2025
    • Data Granularity: Daily per household
    • Location: Simulated global coverage
    • Format: CSV (Comma-Separated Values)

    📚 Libraries Used for Working with household_energy_consumption_2025.csv

    🔍 1. Data Manipulation & Analysis

    LibraryPurpose
    pandasReading, cleaning, and transforming tabular data
    numpyNumerical operations, working with arrays

    📊 2. Data Visualization

    LibraryPurpose
    matplotlibCreating static plots (line, bar, histograms, etc.)
    seabornStatistical visualizations, heatmaps, boxplots, etc.
    plotlyInteractive charts (time series, pie, bar, scatter, etc.)

    📈 3. Machine Learning / Modeling

    LibraryPurpose
    scikit-learnPreprocessing, regression, classification, clustering
    xgboost / lightgbmGradient boosting models for better accuracy

    🧹 4. Data Preprocessing

    LibraryPurpose
    sklearn.preprocessingEncoding categorical features, scaling, normalization
    datetime / pandasDate-time conversion and manipulation

    🧪 5. Model Evaluation

    LibraryPurpose
    sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

    ✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

    📈 Potential Use Cases

    This dataset is ideal for a wide variety of analytics and machine learning projects:

    🔮 Forecasting & Time Series Analysis

    • Predict future household energy consumption based on previous trends and weather conditions.
    • Identify seasonal and daily consumption patterns.

    💡 Energy Efficiency Analysis

    • Analyze differences in energy consumption between households with and without air conditioning.
    • Compare energy usage efficiency across varying household sizes.

    🌡️ Climate Impact Studies

    • Investigate how temperature affects electricity usage across households.
    • Model the potential impact of climate change on residential energy demand.

    🔌 Peak Load Management

    • Build models to predict and manage energy demand during peak hours.
    • Support research on smart grid technologies and dynamic pricing.

    🧠 Machine Learning Projects

    • Supervised learning (regression/classification) to predict energy consumption.
    • Clustering households by usage patterns for targeted energy programs.
    • Anomaly detection in energy usage for fault detection.

    🛠️ Example Starter Projects

    • Time-series forecasting using Facebook Prophet or ARIMA
    • Regression modeling using XGBoost or LightGBM
    • Classification of AC vs. non-AC household behavior
    • Energy-saving recommendation systems
    • Heatmaps of temperature vs. energy usage
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shruthi T R (2023). Data_Cleaning_in_Pandas_Shruthi_TR.ipynb [Dataset]. https://www.kaggle.com/datasets/shruthirt/data-cleaning-in-pandas-shruthi-tr-ipynb
Organization logo

Data_Cleaning_in_Pandas_Shruthi_TR.ipynb

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shruthi T R
Description

Dataset

This dataset was created by Shruthi T R

Contents

Search
Clear search
Close search
Google apps
Main menu