27 datasets found

Data_Cleaning_in_Pandas_Shruthi_TR.ipynb
kaggle.com
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shruthi T R (2023). Data_Cleaning_in_Pandas_Shruthi_TR.ipynb [Dataset]. https://www.kaggle.com/datasets/shruthirt/data-cleaning-in-pandas-shruthi-tr-ipynb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shruthi T R
Description
Dataset

This dataset was created by Shruthi T R

Contents
E
A Replication Dataset for Fundamental Frequency Estimation
live.european-language-grid.eu
data.niaid.nih.gov
+1more
json
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
Explore at:
jsonAvailable download formats
Dataset updated
Oct 19, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
f
Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping
figshare.com
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.28147451.v1
Dataset updated
Jan 6, 2025
Dataset provided by
figshare
Authors
Maryam Binti Haji Abdul Halim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
S&P 500 Companies Analysis Project
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
anshadkaggle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.
h
aesthetics-wiki
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Rhone (2025). aesthetics-wiki [Dataset]. https://huggingface.co/datasets/ninar12/aesthetics-wiki
Explore at:
Dataset updated
Apr 3, 2025
Authors
Nina Rhone
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Introduction

This dataset is webscraped version of aesthetics-wiki. There are 1022 aesthetics captured.

Columns + dtype

title: str description: str (raw representation, including because it could help in structuring data) keywords_spacy: str (['NOUN', 'ADJ', 'VERB', 'NUM', 'PROPN'] keywords extracted from description with POS from Spacy library) removed weird characters, numbers, spaces, stopwords

Cleaning

Standard Pandas cleaning

Cleaned the data by… See the full description on the dataset page: https://huggingface.co/datasets/ninar12/aesthetics-wiki.
Amazon Sales Data Analysis Project1
kaggle.com
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GOKUL (2024). Amazon Sales Data Analysis Project1 [Dataset]. https://www.kaggle.com/datasets/gokulvino/amazon-sales-data-analysis-project1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GOKUL
Description
Problem Statement: Sales management has gained importance to meet increasing competition and the need for improved methods of distribution to reduce cost and to increase profits. Sales management today is the most important function in a commercial and business enterprise. We need to extract all the Amazon sales datasets, transform them using data cleaning and data preprocessing and then finally loading it for analysis. We need to visualize sales trend month-wise, year-wise and yearly-month wise. Moreover, we need to find key metrics and factors and show meaningful relationships between attributes.

Approach The main goal of the project is to find key metrics and factors and then show meaningful relationships between them based on different features available in the dataset.

Data Collection : Imported data from various datasets available in the project using Pandas library.

Data Cleaning : Removed missing values and created new features as per insights.

Data Preprocessing : Modified the structure of data in order to make it more understandable and suitable and convenient for statistical analysis.

Data Analysis : I started analyzing dataset using Pandas,Numpy,Matplotlib and Seaborn.

Data Visualization : Plotted graphs to get insights about dependent and independent variables. Also used Tableau and PowerBI for data visulization.
o
Hotspots of Extinction: Country-Level Data on Threatened Vertebrates,...
dataverse.openforestdata.pl
tsv
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Hotspots of Extinction: Country-Level Data on Threatened Vertebrates, Invertebrates, and Plants [Dataset]. http://doi.org/10.48370/OFD/XSYP7R
Explore at:
tsv(11419), tsv(10834), tsv(1404776), tsv(11701)Available download formats
Unique identifier
https://doi.org/10.48370/OFD/XSYP7R
Dataset updated
May 11, 2025
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset provides annual records of threatened species from 2004 to 2023, focusing on the 25 countries most impacted by biodiversity loss. For direct download of datasets. The data is organized into three categories—Vertebrates, Invertebrates, and Plants—and sourced from UNdata and the IUCN Red List. Each entry includes the country name, year, species count, and biodiversity group. It is designed to support research, education, and public engagement on global conservation priorities. Source and Collection Timeline Original Data Range: 2004–2023 Cleaned and Extracted: November 2024 Primary Sources: UNdata, IUCN Red List (via UN Statistics Division) Data Processing Summary Data Cleaning: Removed incomplete entries and excluded non-country-level data (e.g., continents or regions). Grouping: Categorized into Vertebrates, Invertebrates, and Plants. Top 25 Filter: Selected the top 25 countries per year and per category to improve visual clarity. File Generation: Created three structured CSVs using Python (Pandas). Data Format File Type: CSV (.csv) Columns Include: Country – Name of the country Year – Range from 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants
h
mt-bench-eval-critique
huggingface.co
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). mt-bench-eval-critique [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Description

This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset

df = pd.read_json("mt_bench_eval.json", lines=True)

ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.
n
Extirpated species in Berlin, dates of last detections, habitats, and number...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvia Keinath (2024). Extirpated species in Berlin, dates of last detections, habitats, and number of Berlin’s inhabitants [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc4k
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.n5tb2rc4k
Dataset updated
Jul 9, 2024
Dataset provided by
Museum für Naturkunde
Authors
Silvia Keinath
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Berlin
Description
Species loss is highly scale-dependent, following the species-area relationship. We analysed spatio-temporal patterns of species’ extirpation on a multitaxonomic level using Berlin, the capital city of Germany. Berlin is one of the largest cities in Europe and has experienced a strong urbanisation trend since the late 19th century. We expected species’ extirpation to be exceptionally high due to the long history of urbanisation. Analysing regional Red Lists of Threatened Plants, Animals, and Fungi of Berlin (covering 9498 species), we found that 16 % of species were extirpated, a rate 5.9 times higher than at the German scale, and 47.1 times higher than at the European scale. Species’ extirpation in Berlin is comparable to that of another German city with a similarly broad taxonomic coverage, but much higher than in regional areas with less human impact. The documentation of species’ extirpation started in the 18th century and is well documented for the 19th and 20th centuries. We found an average annual extirpation of 3.6 species in the 19th century, 9.6 species in the 20th century, and the same number of extirpated species as in the 19th century were documented in the 21th century, despite the much shorter time period. Our results showed that species’ extirpation is higher at small than on large spatial scales, and might be negatively influenced by urbanisation, with different effects on different taxonomic groups and habitats. Over time, we found that species’ extirpation is highest during periods of high human alterations and is negatively affected by the number of people living in the city. But, there is still a lack of data to decouple the size of the area and the human impact of urbanisation. However, cities might be suitable systems for studying species’ extirpation processes due to their small scale and human impact. Methods Data extraction: To determine the proportion of extirpated species for Germany, we manually summarised the numbers of species classified in category 0 ‘extinct or extirpated’ and calculated the percentage in relation to the total number of species listed in the Red Lists of Threatened Species for Germany, taken from the website of the Red List Centre of Germany (Rote Liste Zentrum, 2024a). For Berlin, we used the 37 current Red Lists of Threatened Plants, Animals, and Fungi from the city-state of Berlin, covering the years from 2004 to 2023, taken from the official capital city portal of the Berlin Senate Department for Mobility, Transport, Climate Protection and Environment (SenMVKU, 2024a; see overview of Berlin Red Lists used in Table 1). We extracted all species that are listed as extinct/extirpated, i.e. classified in category 0, and additionally, if available, the date of the last record of the species in Berlin. The Red List of macrofungi of the order Boletales by Schmidt (2017) was not included in our study, as this Red List has only been compiled once in the frame of a pilot project and therefore lacks the category 0 ‘extinct or extirpated’. We used Python, version 3.7.9 (Van Rossum and Drake, 2009), the Python libraries Pandas (McKinney et al., 2010), and Camelot-py, version 0.11.0 (Vinayak Meta, 2023) in Jupyter Lab, version 4.0.6 (Project Jupyter, 2016) notebooks. In the first step, we created a metadata table of the Red Lists of Berlin to keep track of the extraction process, maintain the source reference links, and store summarised data from each Red List pdf file. At the extraction of each file, a data row was added to the metadata table which was updated throughout the rest of the process. In the second step, we identified the page range for extraction for each extracted Red List file. The extraction mechanism for each Red List file depended on the printed table layout. We extracted tables with lined rows with the Lattice parsing method (Camelot-py, 2024a), and tables with alternating-coloured rows with the Stream method (Camelot-py, 2024b). For proofing the consistency of extraction, we used the Camelot-py accuracy report along with the Pandas data frame shape property (Pandas, 2024). After initial data cleaning for consistent column counts and missing data, we filtered the data for species in category 0 only. We collated data frames together and exported them as a CSV file. In a further step, we proofread whether the filtered data was tallied with the summary tables, given in each Red List. Finally, we cleaned each Red List table to contain the species, the current hazard level (category 0), the date of the species’ last detection in Berlin, and the reference (codes and data available at: Github, 2023). When no date of last detection was given for a species, we contacted the authors of the respective Red Lists and/or used former Red Lists to find information on species’ last detections (Burger et al., 1998; Saure et al., 1998; 1999; Braasch et al., 2000; Saure, 2000). Determination of the recording time windows of the Berlin Red Lists We determined the time windows, the Berlin Red Lists look back on, from their methodologies. If the information was missing in the current Red Lists, we consulted the previous version (see all detailed time windows of the earliest assessments with references in Table B2 in Appendix B). Data classification: For the analyses of the percentage of species in the different hazard levels, we used the German Red List categories as described in detail by Saure and Schwarz (2005) and Ludwig et al. (2009). These are: Prewarning list, endangered (category 3), highly endangered (category 2), threatened by extinction or extirpation (category 1), and extinct or extirpated (category 0). To determine the number of indigenous unthreatened species in each Red List, we subtracted the number of species in the five categories and the number of non-indigenous species (neobiota) from the total number of species in each Red List. For further analyses, we pooled the taxonomic groups of the 37 Red Lists into more broadly defined taxonomic groups: Plants, lichens, fungi, algae, mammals, birds, amphibians, reptiles, fish and lampreys, molluscs, and arthropods (see categorisation in Table 1). We categorised slime fungi (Myxomycetes including Ceratiomyxomycetes) as ‘fungi’, even though they are more closely related to animals because slime fungi are traditionally studied by mycologists (Schmidt and Täglich, 2023). We classified ‘lichens’ in a separate category, rather than in ‘fungi’, as they are a symbiotic community of fungi and algae (Krause et al., 2017). For analyses of the percentage of extirpated species of each pooled taxonomic group, we set the number of extirpated species in relation to the sum of the number of unthreatened species, species in the prewarning list, and species in the categories one to three. We further categorised the extirpated species according to the habitats in which they occurred. We therefore categorised terrestrial species as ‘terrestrial’ and aquatic species as ‘aquatic’. Amphibians and dragonflies have life stages in both, terrestrial and aquatic habitats, and were categorised as ‘terrestrial/aquatic’. We also categorised plants and mosses as ‘terrestrial/aquatic’ if they depend on wetlands (see all habitat categories for each species in Table C1 in Appendix C). The available data considering the species’ last detection in Berlin ranked from a specific year, over a period of time up to a century. If a year of last detection was given with the auxiliary ‘around’ or ‘circa’, we used for further analyses the given year for temporal classification. If a year of last detection was given with the auxiliary ‘before’ or ‘after’, we assumed that the nearest year of last detection was given and categorised the species in the respective century. In this case, we used the species for temporal analyses by centuries only, not across years. If only a timeframe was given as the date of last detection, we used the respective species for temporal analyses between centuries, only. We further classified all of the extirpated species in centuries, in which species were lastly detected: 17th century (1601-1700); 18th century (1701-1800); 19th century (1801-1900); 20th century (1901-2000); 21th century (2001-now) (see all data on species’ last detection in Table C1 in Appendix C). For analyses of the effects of the number of inhabitants on species’ extirpation in Berlin, we used species that went extirpated between the years 1920 and 2012, because of Berlin’s was expanded to ‘Groß-Berlin’ in 1920 (Buesch and Haus, 1987), roughly corresponding to the cities’ current area. Therefore, we included the number of Berlin’s inhabitants for every year a species was last detected (Statistische Jahrbücher der Stadt Berlin, 1920, 1924-1998, 2000; see all data on the number of inhabitants for each year of species’ last detection in Table C1 in Appendix C). Materials and Methods from Keinath et al. (2024): 'High levels of species’ extirpation in an urban environment – A case study from Berlin, Germany, covering 1700-2023'.
s
Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle...
scholardata.sun.ac.za
data.mendeley.com
Updated Mar 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen (2025). Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle Performance Tracking (2023) [Dataset]. http://doi.org/10.25413/sun.28554200.v1
Explore at:
Unique identifier
https://doi.org/10.25413/sun.28554200.v1
Dataset updated
Mar 8, 2025
Dataset provided by
SUNScholarData
Authors
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nairobi
Description
This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles
h
amazon-products
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CK, amazon-products [Dataset]. https://huggingface.co/datasets/ckandemir/amazon-products
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
CK
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Creation and Processing Overview

This dataset underwent a comprehensive process of loading, cleaning, processing, and preparing, incorporating a range of data manipulation and NLP techniques to optimize its utility for machine learning models, particularly in natural language processing.

Data Loading and Initial Cleaning

Source: Loaded from the Hugging Face dataset repository bprateek/amazon_product_description. Conversion to Pandas DataFrame: For ease of data… See the full description on the dataset page: https://huggingface.co/datasets/ckandemir/amazon-products.
Roller Coaster Accidents
kaggle.com
Updated Jun 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
steven (2021). Roller Coaster Accidents [Dataset]. https://www.kaggle.com/stevenlasch/roller-coaster-accidents/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
steven
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted to analyze a dataset that consisted of roller coaster accidents, and I saw that there weren't any on Kaggle at the time of uploading this. So, I went online and found one particular dataset.

The Data

I went online and found a dataset and began cleaning and analyzing it. I found the included dataset from the website https://ridesdatabase.org/saferparks/data/. Also, keep in mind that the included dataset is the cleaned version, not the original!

This file is a dataset that contains information about theme park accidents. So, this dataset contains 24 columns:

acc_id: Integer. Unique ID for each accident

acc_date: datetime64[ns]. This column is originally a String, but I change it to datetime in the code for easy access to years.

acc_state: String. The U.S. State abbreviation that the accident occurred in.

acc_city: String. The U.S. city that the accident happened in.

fix_port: String Determines if the machine is fixed F or portable P.

source: String. Source of the accident information.

bus_type: String. The place in the park that the accident occurred.

industry_sector: String. Groups devices according to the general category within the amusement business.

device_category: String. Group devices in Industry Sectors that cover a wide range, e.g., grouping devices into coasters, spinning rides, etc.

device_type: String. Type of ride or device involved in the accident.

tradename_or_generic: String. Particular make/model, where known, or indicates the generic type of ride or device.

manufacturer: String. The manufacturer of the faulty ride.

num_injured: Integer. The Number of people injured in the accident.

age_youngest: Float. Age of the youngest victim.

gender: String. Gender of the person injured.

acc_desc: String. Short description of the accident.

injury_desc: String. Short description of the severity of the injuries.

report: String. A link to the accident.

category: String. What kind of injuries did those who were injured suffer?

mechanical: Boolean. Was it a mechanical malfunction? See Notes #1

op_error: Boolean. Was it an error with the operation of the machine? See Notes #1

employee: Boolean. Was it an employee error? See Notes #1

notes: String. Other notes about the accident.

year: Integer. Pulls the year from the acc_date column.

Notes

When totaling the values in mechanical, op_error, or employee columns, there is no need to convert to Integer, since pandas will take their representative values—0 or 1—into account, e.g., data['mechanical'].sum() will return 935 even though the column is of Boolean type.

The notebook that I included under the 'Code' tab imports the cleaned dataset which is why I omitted a data cleaning section in the notebook. If you were to import the data from the website I provided at the top of this page, you will have to clean the data on your own.
H
Global Biodiversity at Risk: Top 25 Countries by Threatened Species...
dataverse.harvard.edu
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramkumar Yaragarla (2025). Global Biodiversity at Risk: Top 25 Countries by Threatened Species (2004–2023) [Dataset]. http://doi.org/10.7910/DVN/ZX4WLC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZX4WLC
Dataset updated
May 7, 2025
Dataset provided by
Harvard Dataverse
Authors
Ramkumar Yaragarla
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset presents annual counts of threatened species from 2004 to 2023 across the top 25 countries most affected by biodiversity loss. Access Full Dataset I Hug Trees – Data Analytics Data is categorized into three groups—Vertebrates, Invertebrates, and Plants and compiled from UNdata and IUCN Red List sources. It includes country names, years, species counts, and biodiversity groupings, and is intended for use in research, education, and public awareness around global conservation priorities. Processing Summary Data Cleaning: Removed duplicates, corrected inconsistencies, and excluded continent-level entries. Categorization: Grouped species data into three broad classes. Ranking: Selected top 25 countries per category, per year, for visual clarity. Output: Generated three structured CSV files (one per group) using Python’s Pandas library. Data Structure File Format: CSV (.csv) Columns: Country – Country name Year – From 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants Scope and Limitations Focus: National-level trends; no sub-national or habitat-specific granularity Last Updated: April 2025 Future Enhancements: May include finer taxonomic resolution and updated threat metrics Intended Uses Educational tools and biodiversity curriculum support Conservation awareness campaigns and visual storytelling Policy dashboards and SDG 15 (Life on Land) tracking Research into biodiversity trends and hotspot identification Tools & Technologies Python (Pandas): Data filtering, aggregation, and file creation Chart.js: Rendering of interactive bar charts HTML iFrames: Seamless embedding on the I Hug Trees website
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Z
The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...
data.niaid.nih.gov
Updated Sep 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eng, Kent X. (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
Explore at:
Dataset updated
Sep 25, 2020
Dataset provided by
Chen, Xinlei
Eng, Kent X.
Noh, Hae Young
Zhang, Lin
Liu, Jingxiao
Liu, Xinyu
Zhang, Pei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

get_dataset.ipynb: 1. os library 2. pandas library

Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

For questions or suggestions please e-mail Xinlei Chen
Z
Analysis of references in the IPCC AR6 WG2 Report of 2022
data.niaid.nih.gov
Updated Mar 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cameron Neylon (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6327206
Explore at:
Dataset updated
Mar 11, 2022
Dataset provided by
Bianca Kramer
Cameron Neylon
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Description
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)

A brief descriptive analysis was provided as a blogpost on the COKI website.

The repository contains the following content:

Data:

data/scholarcy/RIS/ - extracted references as RIS files

data/scholarcy/BibTeX/ - extracted references as BibTeX files

IPCC_AR6_WGII_dois.csv - list of DOIs

data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report

Processing:

preprocessing.R - preprocessing steps for identifying and cleaning DOIs

process.py - Python script for transforming data and linking to COKI data through Google Big Query

Outcomes:

Dataset on BigQuery - requires a google account for access and bigquery account for querying

Data Studio Dashboard - interactive analysis of the generated data

Zotero library of references extracted via Scholarcy

PDF version of blogpost

Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
Surveys of Data Professionals (Alex the Analyst)
kaggle.com
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stewie
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[Dataset Name] - About This Dataset

Overview

This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

Context

The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

Content

The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

Acknowledgements

The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

Citation

If you use this dataset in your work, please cite it as follows:

How to Use

Download the dataset from this link.

Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

Feel free to reach out for any additional information or clarification. Happy analyzing!
Diwali_Sales_Dataset
kaggle.com
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BharathiD8 (2024). Diwali_Sales_Dataset [Dataset]. https://www.kaggle.com/datasets/bharathid8/diwali-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BharathiD8
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Overview

Objective: Analyze Diwali sales data to uncover trends, customer behavior, and sales performance during the festive season. - Tools Used: Python, Pandas, NumPy, Matplotlib, Seaborn

Data Collection and Preparation

Dataset: A dataset containing sales data for Diwali, including details like product categories, customer demographics, sales amounts, discounts, etc.

**Data Cleaning: **Handle missing values, remove duplicates, and correct any inconsistencies in the data.

- Feature Engineering: Create new features if necessary, such as total sales per customer, average discount per sale, etc.

Exploratory Data Analysis (EDA)

Descriptive Statistics: Calculate basic statistics (mean, median, mode) to get a sense of the data distribution. Visualizations: Sales Trends: Plot sales over time to see how they varied during the Diwali season. Top-Selling Products: Identify the products or categories with the highest sales. Customer Demographics: Analyze sales by age, gender, and location to understand customer behavior. Discount Impact: Evaluate how different discount levels affected sales volume.

Key Findings

Customer Behavior: Insights on which customer segments contributed the most to sales. Sales Performance: Which products or categories had the highest sales, and during which days of Diwali sales peaked. Discount Effectiveness: The impact of discounts on sales and whether higher discounts led to significantly higher sales or not.

Conclusion

Summarize the key insights derived from the EDA. Discuss any patterns or trends that were unexpected or particularly interesting. Provide recommendations for future sales strategies based on the findings. .
n
A dataset of 5 million city trees from 63 US cities: species, location,...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Aug 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xsrf
Dataset updated
Aug 31, 2022
Dataset provided by
Cornell University
The Biota of North America Program (BONAP)
Harvard University
Worcester Polytechnic Institute
Stanford University
Authors
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.

Household Energy Consumption

kaggle.com

Updated Apr 5, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samx_sam (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/datasets/samxsam/household-energy-consumption

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 5, 2025

Dataset provided by

Kaggle

Authors

Samx_sam

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🏡 Household Energy Consumption - April 2025 (90,000 Records)

📌 Overview

This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

Column Name	Data Type Category	Description
Household_ID	Categorical (Nominal)	Unique identifier for each household
Date	Datetime	The date of the energy usage record
Energy_Consumption_kWh	Numerical (Continuous)	Total energy consumed by the household in kWh
Household_Size	Numerical (Discrete)	Number of individuals living in the household
Avg_Temperature_C	Numerical (Continuous)	Average daily temperature in degrees Celsius
Has_AC	Categorical (Binary)	Indicates if the household has air conditioning (Yes/No)
Peak_Hours_Usage_kWh	Numerical (Continuous)	Energy consumed during peak hours in kWh

📂 Dataset Summary

Rows: 90,000
Time Range: April 1, 2025 – April 30, 2025
Data Granularity: Daily per household
Location: Simulated global coverage
Format: CSV (Comma-Separated Values)

📚 Libraries Used for Working with household_energy_consumption_2025.csv

🔍 1. Data Manipulation & Analysis

Library	Purpose
`pandas`	Reading, cleaning, and transforming tabular data
`numpy`	Numerical operations, working with arrays

📊 2. Data Visualization

Library	Purpose
`matplotlib`	Creating static plots (line, bar, histograms, etc.)
`seaborn`	Statistical visualizations, heatmaps, boxplots, etc.
`plotly`	Interactive charts (time series, pie, bar, scatter, etc.)

📈 3. Machine Learning / Modeling

Library	Purpose
`scikit-learn`	Preprocessing, regression, classification, clustering
`xgboost` / `lightgbm`	Gradient boosting models for better accuracy

🧹 4. Data Preprocessing

Library	Purpose
`sklearn.preprocessing`	Encoding categorical features, scaling, normalization
`datetime` / `pandas`	Date-time conversion and manipulation

🧪 5. Model Evaluation

Library	Purpose
`sklearn.metrics`	Accuracy, MAE, RMSE, R² score, confusion matrix, etc.

✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

📈 Potential Use Cases

This dataset is ideal for a wide variety of analytics and machine learning projects:

🔮 Forecasting & Time Series Analysis

Predict future household energy consumption based on previous trends and weather conditions.
Identify seasonal and daily consumption patterns.

💡 Energy Efficiency Analysis

Analyze differences in energy consumption between households with and without air conditioning.
Compare energy usage efficiency across varying household sizes.

🌡️ Climate Impact Studies

Investigate how temperature affects electricity usage across households.
Model the potential impact of climate change on residential energy demand.

🔌 Peak Load Management

Build models to predict and manage energy demand during peak hours.
Support research on smart grid technologies and dynamic pricing.

🧠 Machine Learning Projects

Supervised learning (regression/classification) to predict energy consumption.
Clustering households by usage patterns for targeted energy programs.
Anomaly detection in energy usage for fault detection.

🛠️ Example Starter Projects

Time-series forecasting using Facebook Prophet or ARIMA
Regression modeling using XGBoost or LightGBM
Classification of AC vs. non-AC household behavior
Energy-saving recommendation systems
Heatmaps of temperature vs. energy usage

Facebook

Twitter

Click to copy link

Link copied

Cite

Shruthi T R (2023). Data_Cleaning_in_Pandas_Shruthi_TR.ipynb [Dataset]. https://www.kaggle.com/datasets/shruthirt/data-cleaning-in-pandas-shruthi-tr-ipynb

Data_Cleaning_in_Pandas_Shruthi_TR.ipynb

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 19, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Shruthi T R

Description

Dataset

This dataset was created by Shruthi T R

Clear search

Close search

Google apps

Main menu

Data_Cleaning_in_Pandas_Shruthi_TR.ipynb

Dataset

Contents

A Replication Dataset for Fundamental Frequency Estimation

Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

S&P 500 Companies Analysis Project

aesthetics-wiki

Amazon Sales Data Analysis Project1

Hotspots of Extinction: Country-Level Data on Threatened Vertebrates,...

mt-bench-eval-critique

Extirpated species in Berlin, dates of last detections, habitats, and number...

Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle...

amazon-products

Roller Coaster Accidents

Context

The Data

Notes

Global Biodiversity at Risk: Top 25 Countries by Threatened Species...

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

Analysis of references in the IPCC AR6 WG2 Report of 2022

Surveys of Data Professionals (Alex the Analyst)

[Dataset Name] - About This Dataset

Overview

Context

Content

Acknowledgements

Citation

How to Use

Diwali_Sales_Dataset

Project Overview

Data Collection and Preparation

Exploratory Data Analysis (EDA)

Key Findings

Conclusion

A dataset of 5 million city trees from 63 US cities: species, location,...

Household Energy Consumption

🏡 Household Energy Consumption - April 2025 (90,000 Records)

📌 Overview

📂 Dataset Summary

📚 Libraries Used for Working with household_energy_consumption_2025.csv

🔍 1. Data Manipulation & Analysis

📊 2. Data Visualization

📈 3. Machine Learning / Modeling

🧹 4. Data Preprocessing

🧪 5. Model Evaluation

📈 Potential Use Cases

🔮 Forecasting & Time Series Analysis

💡 Energy Efficiency Analysis

🌡️ Climate Impact Studies

🔌 Peak Load Management

🧠 Machine Learning Projects

🛠️ Example Starter Projects

Data_Cleaning_in_Pandas_Shruthi_TR.ipynb

Dataset

Contents