Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
Facebook
Twitterimport pandas as pd
Example dataset with new columns
data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Investigate historical ownership changes and registration details by initiating a reverse Whois lookup for the name CLEAN PANDA.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global clean keyboard dust market is experiencing robust growth, driven by increasing digitalization and the consequent rise in computer usage across both professional and personal settings. The market, estimated at $500 million in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 8% from 2025 to 2033. This growth is fueled by several factors, including the growing awareness of hygiene and the need for maintaining clean and functional keyboards, especially in healthcare and other sensitive environments. Furthermore, the increasing prevalence of allergies and respiratory issues linked to dust accumulation on keyboards is boosting demand for effective cleaning solutions. The market's segmentation likely includes various product types like compressed air dusters, cleaning wipes, gels, and specialized keyboard cleaning kits. The competitive landscape is relatively fragmented, with a mix of established players like Lenovo and emerging brands vying for market share. Innovation in cleaning technologies and the development of eco-friendly solutions will shape the future trajectory of this market. The market's restraints include the relatively low cost of simple cleaning methods such as using a brush or compressed air, leading some consumers to avoid specialized products. However, increasing awareness of the long-term benefits of proper keyboard hygiene, along with the development of more efficient and convenient cleaning solutions, are likely to mitigate these restraints. The regional distribution of the market is likely to reflect global computer usage patterns, with North America, Europe, and Asia-Pacific representing significant market segments. Continued technological advancements, especially in areas like ergonomic design and self-cleaning keyboards, could lead to significant market disruption and expansion in the coming years. The market's evolution will depend on factors such as consumer preferences, technological innovation, and regulatory changes related to environmental sustainability.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.
Unlike most public datasets, this one includes a diverse mix of column types:
📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)
Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.
Feel free to:
Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!
Hope you find this helpful. Looking forward to hearing from you all.
Facebook
TwitterTo contribute to the conservation of endangered animals, the utilization of model systems is critical to analyze the function of their gut microbiota. In this study, the results of a fecal microbial transplantation (FMT) experiment with germ-free (GF) mice receiving giant panda or horse fecal microbiota showed a clear clustering by donor microbial communities in GF mice, which was consistent with the results of blood metabolites from these mice. At the genus level, FMT re-established approximately 9% of the giant panda donor microbiota in GF mice compared to about 32% for the horse donor microbiota. In line with this, the difference between the panda donor microbiota and panda-mice microbiota on whole-community level was significantly larger than that between the horse donor microbiota and the horse-mice microbiota. These results were consistent with source tracking analysis that found a significantly higher retention rate of the horse donor microbiota (30.9%) than the giant panda donor microbiota (4.0%) in GF mice where the microbiota remained stable after FMT. Further analyzes indicated that the possible reason for the low retention rate of the panda donor microbiota in GF mice was a low relative abundance of Clostridiaceae in the panda donor microbiota. Our results indicate that the donor microbiota has a large effect on GF mice microbiota after FMT.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.
It is ideal for practicing:
Data cleaning
Exploratory Data Analysis (EDA)
Marketing analytics
Campaign performance insights
Dashboard creation using tools like Excel, Python, or Power BI
📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)
⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:
Inconsistent date formats
Spelling errors (e.g., "analitics", "anaytics")
Duplicate rows
Mixed units and symbols in cost/revenue columns
Missing values
Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")
🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel
Data preprocessing for machine learning
Campaign performance analysis
Conversion optimization tracking
Building dashboards in Power BI, Tableau, or Looker
💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)
Analyze click-through rates (CTR) by device or location
Clean and standardize campaign names and keywords
Investigate keyword performance vs. conversions
🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Facebook
TwitterOverview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Facebook
TwitterIn 2017, 107 random sampling plots in montane forests were collected from north to south, spanning the entire Sichuan Giant Panda habi- tat. The sampling strategy and field site information are shown in Li et al. (2019). The elevation within the sampling plots varied significantly (from ca. 2,000 to 3,600 m a.s.l.) (Li et al., 2019). The main vegetation types in those plots were coniferous and broad-leaved mixed forests, and evergreen and deciduous broad-leaved mixed forests. Using ques- tionnaires, we surveyed 72 local people from Minshan, Xiaoxiangling, and Qionglai in the Sichuan Giant Panda habitat in 2017. Those local villagers mainly participated in the local Giant Panda habitat conser- vation. The survey information included if there was any interfer- ence in the sampling plots. In addition, we observed the plant species composition and environment in the montane forests to choos...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ebitda Time Series for Nanjing Panda Electronics Co Ltd. Nanjing Panda Electronics Company Limited, together with its subsidiaries, engages in the smart transportation and safe city, industrial internet and intelligent manufacturing, and green and service-oriented electronic manufacturing businesses in China. The company offers automatic transmission equipment and industrial robots; railway transit auto fare collection (AFC) and AFC clearing center system equipment, building intellectualization products, and system integration; electronic manufacturing services (EMS); mobile, digital, and network communication systems and products; digital products; and property management, warehousing, and catering services. It also provides automatic instruments and battery changing systems, recyclable bag making machine and automatic filling systems, welding machines, CF automatic handling systems, plant system platforms, laminating machines, and TFT-LCD clean workshop equipment and systems; and smart cards, rail transit communication systems, video surveillance integrated platforms, rail transit communication clock systems, logistics services, and LTE/WLAN multi-service bearer schemes, as well as radio dispatchers, dispatching server software, network management systems, and voice short message and multimedia vehicular stations. In addition, the company offers digital processing and acquisition modules, microwave frequency sources, clock synchronization devices, mobile data access gateways, wireless mesh, and dual-mode base stations. Further, it provides research systems, machine and substrate assemblies, automatic mounting, conventional injections, spray painting, molds and dies, information management, quality assurance systems, and special polymer materials; and single mode and push receivers, digital TV set top boxes, digital TV monitor and vehicular equipment, satellite flat antennas, and descramblers. Nanjing Panda Electronics Company Limited was founded in 1936 and is headquartered in Nanjing, the People's Republic of China.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://i.imgur.com/2Egeb8R.png" alt="" title="a title">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
UserId column in the ForumMessages table has values that do not exist in the Users table.True or False.Total columns.
For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.db_abd_create_tables.sql script.clean_data.py script.
The script does the following steps for each table:
NULL.add_foreign_keys.sql script.Total columns in the database tables. I do that by running the update_totals.sql script.
Facebook
TwitterThis datasets about Netflix Movies & TV Shows. Datasets have 12 columns with some null values. To analysis of dataset are used Pandas, plotly.express and Datetime libraries. Analysis process I divided into several parts for step wise analysis and to find out trending questions on social media for Bollywood actors and actress.
There are many representations of missing data. They are Null values, missing values. I used some of methods used in data analysis process to clean missing values.
There I used some string method on column such as 'cast', 'Lested_in' to extract data
Converting an object type into datatype objects with the to_datetime function then we have a datatime object, can extract various part of data such as year, month and day
Here, I find out several eye catching question. the following questions are like as- - Show the all Movies & TV Shows released by month - Count the all types of unique rating & which rating are with most number - Salman, Shah Rukh and Akshay Kumar all movie - Find out the Movies & Series have Maximum time length - Year on Year show added on Netflix by its type - Akshay Kumar all comedies movies, Shah Rukh movies with Kajol and Salman-Akshay Movies - Who Director has made the most TV Shows - Actors and Actress who have given most Number of Movies - Find out which types of genre has most movies and TV Shows
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is generated syhthetically to create tables with following characteristics:
Empty cell percentage in following range 0,30 There is clear seperator between rows and columns (Structured). 15 <= num rows <= 30, 7 <= num columns <= 15 (Long)
Load the dataset
import io import pandas as pd from PIL import Image
def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes))
def parse_annotations(self, annotations: str) -> pd.DataFrame:… See the full description on the dataset page: https://huggingface.co/datasets/nanonets/long_dense_structured_table.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )
Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.
The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes
Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here
The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns
Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.
This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here
The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.
Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Facebook
TwitterThis dataset contains 6 months of Customer online orders. The data is simple but messy and unorganized. This for beginner and Intermediate level who want to improve there skills in Pandas, matplotlib, and seaborn.
Dataset context columns like: crawl_timestamp, product_name, product_category_tree, retail_price, discounted_price, brand.
The main focus is to clean the dataset and make it organized using pandas.
I wouldn't be here without the help of data.world. Thank You.
I have some questions for this Dataset: 1. What was the best month for sales? How much was earned that month? 2. What time should we display advertisements to maximize the likelihood of purchases? 3. Which category sold most in that six month period? 4. Top 10 products sold most in that six month period?
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.