Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.
ACS Subject data [2011-2019] was accessed using Python by following the below API Link:
https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:*
The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.
The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format
More information about the source of Data can be found at the URL below:
US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov
https://www.census.gov/data/developers/about.html
I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼♂️. Good Luck.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).
Original Github repo
https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">
!pip install klib
import klib
import pandas as pd
df = pd.DataFrame(data)
# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values
Take a look at this starter notebook.
Further examples, as well as applications of the functions can be found here.
Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides annual records of threatened species from 2004 to 2023, focusing on the 25 countries most impacted by biodiversity loss. For direct download of datasets. The data is organized into three categories—Vertebrates, Invertebrates, and Plants—and sourced from UNdata and the IUCN Red List. Each entry includes the country name, year, species count, and biodiversity group. It is designed to support research, education, and public engagement on global conservation priorities. Source and Collection Timeline Original Data Range: 2004–2023 Cleaned and Extracted: November 2024 Primary Sources: UNdata, IUCN Red List (via UN Statistics Division) Data Processing Summary Data Cleaning: Removed incomplete entries and excluded non-country-level data (e.g., continents or regions). Grouping: Categorized into Vertebrates, Invertebrates, and Plants. Top 25 Filter: Selected the top 25 countries per year and per category to improve visual clarity. File Generation: Created three structured CSVs using Python (Pandas). Data Format File Type: CSV (.csv) Columns Include: Country – Name of the country Year – Range from 2004 to 2023 Value – Number of threatened species Group – Vertebrates, Invertebrates, or Plants
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.
What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.
Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.
Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles
Overview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
This dataset contains population and population density data from the world bank. The world bank has accurate data from the year 1950, and this data set contains projections from the year 2021 onwards. (see my notebook for more) This dataset also contains the female and male population spilts.
Thanks to the world bank: https://data.worldbank.org/indicator/SP.POP.TOTL
This is a very simple data set aimed at users who wan to get involved with cleaning and visualisations data in python/pandas. See my code for inspiration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv
, validation_data.csv
, and test_data.csv
. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas
, numpy
, scikit-learn
, matplotlib
, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.
📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:
consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling
📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering
🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering
🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.
🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!
📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/
🔍 Let’s explore churn prediction insights together! 🎯
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Species loss is highly scale-dependent, following the species-area relationship. We analysed spatio-temporal patterns of species’ extirpation on a multitaxonomic level using Berlin, the capital city of Germany. Berlin is one of the largest cities in Europe and has experienced a strong urbanisation trend since the late 19th century. We expected species’ extirpation to be exceptionally high due to the long history of urbanisation. Analysing regional Red Lists of Threatened Plants, Animals, and Fungi of Berlin (covering 9498 species), we found that 16 % of species were extirpated, a rate 5.9 times higher than at the German scale, and 47.1 times higher than at the European scale. Species’ extirpation in Berlin is comparable to that of another German city with a similarly broad taxonomic coverage, but much higher than in regional areas with less human impact. The documentation of species’ extirpation started in the 18th century and is well documented for the 19th and 20th centuries. We found an average annual extirpation of 3.6 species in the 19th century, 9.6 species in the 20th century, and the same number of extirpated species as in the 19th century were documented in the 21th century, despite the much shorter time period. Our results showed that species’ extirpation is higher at small than on large spatial scales, and might be negatively influenced by urbanisation, with different effects on different taxonomic groups and habitats. Over time, we found that species’ extirpation is highest during periods of high human alterations and is negatively affected by the number of people living in the city. But, there is still a lack of data to decouple the size of the area and the human impact of urbanisation. However, cities might be suitable systems for studying species’ extirpation processes due to their small scale and human impact. Methods Data extraction: To determine the proportion of extirpated species for Germany, we manually summarised the numbers of species classified in category 0 ‘extinct or extirpated’ and calculated the percentage in relation to the total number of species listed in the Red Lists of Threatened Species for Germany, taken from the website of the Red List Centre of Germany (Rote Liste Zentrum, 2024a). For Berlin, we used the 37 current Red Lists of Threatened Plants, Animals, and Fungi from the city-state of Berlin, covering the years from 2004 to 2023, taken from the official capital city portal of the Berlin Senate Department for Mobility, Transport, Climate Protection and Environment (SenMVKU, 2024a; see overview of Berlin Red Lists used in Table 1). We extracted all species that are listed as extinct/extirpated, i.e. classified in category 0, and additionally, if available, the date of the last record of the species in Berlin. The Red List of macrofungi of the order Boletales by Schmidt (2017) was not included in our study, as this Red List has only been compiled once in the frame of a pilot project and therefore lacks the category 0 ‘extinct or extirpated’. We used Python, version 3.7.9 (Van Rossum and Drake, 2009), the Python libraries Pandas (McKinney et al., 2010), and Camelot-py, version 0.11.0 (Vinayak Meta, 2023) in Jupyter Lab, version 4.0.6 (Project Jupyter, 2016) notebooks. In the first step, we created a metadata table of the Red Lists of Berlin to keep track of the extraction process, maintain the source reference links, and store summarised data from each Red List pdf file. At the extraction of each file, a data row was added to the metadata table which was updated throughout the rest of the process. In the second step, we identified the page range for extraction for each extracted Red List file. The extraction mechanism for each Red List file depended on the printed table layout. We extracted tables with lined rows with the Lattice parsing method (Camelot-py, 2024a), and tables with alternating-coloured rows with the Stream method (Camelot-py, 2024b). For proofing the consistency of extraction, we used the Camelot-py accuracy report along with the Pandas data frame shape property (Pandas, 2024). After initial data cleaning for consistent column counts and missing data, we filtered the data for species in category 0 only. We collated data frames together and exported them as a CSV file. In a further step, we proofread whether the filtered data was tallied with the summary tables, given in each Red List. Finally, we cleaned each Red List table to contain the species, the current hazard level (category 0), the date of the species’ last detection in Berlin, and the reference (codes and data available at: Github, 2023). When no date of last detection was given for a species, we contacted the authors of the respective Red Lists and/or used former Red Lists to find information on species’ last detections (Burger et al., 1998; Saure et al., 1998; 1999; Braasch et al., 2000; Saure, 2000). Determination of the recording time windows of the Berlin Red Lists We determined the time windows, the Berlin Red Lists look back on, from their methodologies. If the information was missing in the current Red Lists, we consulted the previous version (see all detailed time windows of the earliest assessments with references in Table B2 in Appendix B). Data classification: For the analyses of the percentage of species in the different hazard levels, we used the German Red List categories as described in detail by Saure and Schwarz (2005) and Ludwig et al. (2009). These are: Prewarning list, endangered (category 3), highly endangered (category 2), threatened by extinction or extirpation (category 1), and extinct or extirpated (category 0). To determine the number of indigenous unthreatened species in each Red List, we subtracted the number of species in the five categories and the number of non-indigenous species (neobiota) from the total number of species in each Red List. For further analyses, we pooled the taxonomic groups of the 37 Red Lists into more broadly defined taxonomic groups: Plants, lichens, fungi, algae, mammals, birds, amphibians, reptiles, fish and lampreys, molluscs, and arthropods (see categorisation in Table 1). We categorised slime fungi (Myxomycetes including Ceratiomyxomycetes) as ‘fungi’, even though they are more closely related to animals because slime fungi are traditionally studied by mycologists (Schmidt and Täglich, 2023). We classified ‘lichens’ in a separate category, rather than in ‘fungi’, as they are a symbiotic community of fungi and algae (Krause et al., 2017). For analyses of the percentage of extirpated species of each pooled taxonomic group, we set the number of extirpated species in relation to the sum of the number of unthreatened species, species in the prewarning list, and species in the categories one to three. We further categorised the extirpated species according to the habitats in which they occurred. We therefore categorised terrestrial species as ‘terrestrial’ and aquatic species as ‘aquatic’. Amphibians and dragonflies have life stages in both, terrestrial and aquatic habitats, and were categorised as ‘terrestrial/aquatic’. We also categorised plants and mosses as ‘terrestrial/aquatic’ if they depend on wetlands (see all habitat categories for each species in Table C1 in Appendix C). The available data considering the species’ last detection in Berlin ranked from a specific year, over a period of time up to a century. If a year of last detection was given with the auxiliary ‘around’ or ‘circa’, we used for further analyses the given year for temporal classification. If a year of last detection was given with the auxiliary ‘before’ or ‘after’, we assumed that the nearest year of last detection was given and categorised the species in the respective century. In this case, we used the species for temporal analyses by centuries only, not across years. If only a timeframe was given as the date of last detection, we used the respective species for temporal analyses between centuries, only. We further classified all of the extirpated species in centuries, in which species were lastly detected: 17th century (1601-1700); 18th century (1701-1800); 19th century (1801-1900); 20th century (1901-2000); 21th century (2001-now) (see all data on species’ last detection in Table C1 in Appendix C). For analyses of the effects of the number of inhabitants on species’ extirpation in Berlin, we used species that went extirpated between the years 1920 and 2012, because of Berlin’s was expanded to ‘Groß-Berlin’ in 1920 (Buesch and Haus, 1987), roughly corresponding to the cities’ current area. Therefore, we included the number of Berlin’s inhabitants for every year a species was last detected (Statistische Jahrbücher der Stadt Berlin, 1920, 1924-1998, 2000; see all data on the number of inhabitants for each year of species’ last detection in Table C1 in Appendix C). Materials and Methods from Keinath et al. (2024): 'High levels of species’ extirpation in an urban environment – A case study from Berlin, Germany, covering 1700-2023'.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Title: 9,565 Top-Rated Movies Dataset
Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.
Key Features:
- Title: The official title of each movie.
- Overview: A brief synopsis or description of the movie's plot.
- Release Date: The release date of the movie, formatted as YYYY-MM-DD
.
- Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest.
- Vote Average: The average rating of the movie, based on user votes.
- Vote Count: The total number of votes the movie has received.
Data Source:
The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated
endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.
Data Collection Process:
- API Access: Data was retrieved programmatically using TMDb’s API.
- Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness.
- Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas
library.
- Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.
Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.
Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.
Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).
This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).
References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.
We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.
This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)
A brief descriptive analysis was provided as a blogpost on the COKI website.
The repository contains the following content:
Data:
data/scholarcy/RIS/ - extracted references as RIS files
data/scholarcy/BibTeX/ - extracted references as BibTeX files
IPCC_AR6_WGII_dois.csv - list of DOIs
data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report
Processing:
preprocessing.R - preprocessing steps for identifying and cleaning DOIs
process.py - Python script for transforming data and linking to COKI data through Google Big Query
Outcomes:
Dataset on BigQuery - requires a google account for access and bigquery account for querying
Data Studio Dashboard - interactive analysis of the generated data
Zotero library of references extracted via Scholarcy
PDF version of blogpost
Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
💁♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset provides a comprehensive collection of setlists from Taylor Swift’s official era tours, curated expertly by Spotify. The playlist, available on Spotify under the title "Taylor Swift The Eras Tour Official Setlist," encompasses a diverse range of songs that have been performed live during the tour events of this global artist. Each dataset entry corresponds to a song featured in the playlist.
Taylor Swift, a pivotal figure in both country and pop music scenes, has had a transformative impact on the music industry. Her tours are celebrated not just for their musical variety but also for their theatrical elements, narrative style, and the deep emotional connection they foster with fans worldwide. This dataset aims to provide fans and researchers an insight into the evolution of Swift's musical and performance style through her tours, capturing the essence of what makes her tour unique.
Obtaining the Data: The data was obtained directly from the Spotify Web API, specifically focusing on the setlist tracks by the artist. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.
Data Processing: To process and structure the data, Python scripts were developed using data science libraries such as pandas for data manipulation and spotipy for API interactions, specifically for Spotify data retrieval.
Workflow:
Authentication API Requests Data Cleaning and Transformation Saving the Data
Note: Popularity score reflects the score recorded on the day that retrieves this dataset. The popularity score could fluctuate daily.
This dataset, derived from Spotify focusing on Taylor Swift's The Eras Tour setlist data, is intended for educational, research, and analysis purposes only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data collection process consisted of compiling texts from three main online platforms: Project Gutenberg (https://www.gutenberg.org/cache/epub/1597/pg1597-images.html ), Andersen Stories (https://www.andersenstories.com/en/andersen_fairy-tales/list ), and the HCA Gilead website (http://hca.gilead.org.il/ ). These resources offered digitized editions of Hans Christian Andersen’s fairy tales, including multiple translations and versions. The texts were gathered using web scraping and text extraction methods. Python tools such as BeautifulSoup were utilized to extract content from HTML, while Pandas was used for organizing and cleaning the collected data. The selection criteria emphasized the relevance to Andersen’s original stories, availability in English, and the completeness of the texts. To normalize the data, HTML tags were removed, titles were made consistent, and the texts were divided into sentences or organized into structured formats based on events (Subject-Verb-Object-Temporal-Location). The datasets are freely available and legally sourced from open-domain websites that offer unrestricted access to literary works, ensuring adherence to copyright and usage regulations.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.
Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.
Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.