Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.
Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.
The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.
Included Code and Data
ground truth data.zip
is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
noisy speech data.zip
is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
synthetic speech data.zip
is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl
and synthetic_speech.pkl
are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
noisy speech evaluation.py
and synthetic speech evaluation.py
are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
Pipfile
is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.
References:
US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.
ACS Subject data [2011-2019] was accessed using Python by following the below API Link:
https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:*
The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.
The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format
More information about the source of Data can be found at the URL below:
US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov
https://www.census.gov/data/developers/about.html
I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼♂️. Good Luck.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data was obtained from the Maricopa County Assessor under the search "Fast Food". The query has approximately 1342 results, with only 1000 returned due MCA Data Policies.
Due to some Subdivision Name values posessing unescaped commas that interfered with Pandas' ability to properly align the columns, some manual cleaning in Libre Office was performed by me.
Aside from a handful of Null values, the data is fairly clean and requires little from Pandas.
Here are the sums and percentage of NULLS in the dataframe.
Interestingly, there are 17
NULLS that do not have any physical addresses. This amounts to 1.7%
of values for the Address
, City
, and Zip
, and are all corresponding rows for those missing values.
I have looked into a couple of these on the Maricopa County Assessor's GIS Portal, and they do not appear to have any assigned physical addresses. This is a good avenue of exploration for EDA. Possibly an error that could be corrected, or some obscure legal reason, but interesting nonetheless.
Additionally, there are 391
NULLS in Subdivision Name
accounting for 39.1%
. This is a feature that I am interested in exploring to determine if there are any predominant groups. It could also generate a list of Entities that can be searched later to see if the dataset can be enriched beyond it's initial 1,000 record limit.
There are 348
NULLS in the MCR
column. This is the definition according to the MCA Glossary
MCR (MARICOPA COUNTY RECORDER NUMBER)
Often associated with recorded plat maps.
This seems to be an uninteresting nominal value, so I will drop this columns.
While Property Type
and Rental
have no NULLS, 100% of those values are Fast Food Restaurant
and N
(for No), and therefore offer no useful information, and will be dropped.
I will leave the S/T/R
column, although it also seems to be uninteresting nominal values, I am curious if there are predominent groups, and since it also has no NULLS, might be useful for further data enrichment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles
Overview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Netflix Dataset Exploration and Visualization
This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.
Objectives:
Analyze the distribution of content types (Movies vs. TV Shows)
Identify the most prolific countries producing Netflix content
Study the ratings and duration of shows
Handle missing values using techniques like interpolation, forward-fill, and custom replacements
Enhance readability with bar charts, horizontal plots, and annotated visuals
Key Visualizations:
Bar charts for type distribution and country-wise contributions
Handling missing data in rating, duration, and date_added
Annotated plots showing values for clarity
Tools Used:
Python 3
Pandas for data wrangling
Matplotlib for visualizations
Jupyter Notebook for hands-on analysis
Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.
Feel free to fork, adapt, and extend the work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Image Impeccable
Dataset Description
This data was produced by ThinkOnward for the Image Impeccable Challenge, using a synthetic seismic dataset generator called Synthoseis.
Created by: Mike McIntire and Jesse Pisel License: CC 4.0
Uses
How to generate a dataset
This dataset is provided as paired noisy and clean seismic volumes. Follow the following step to load the data to numpy volumes import pandas as pd import numpy as… See the full description on the dataset page: https://huggingface.co/datasets/thinkonward/image-impeccable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is an excerpt of the validation dataset used in:
Ruiz-Arias JA, Gueymard CA. Review and performance benchmarking of 1-min solar irradiance components separation methods: The critical role of dynamically-constrained sky conditions. Submitted for publication to Renewable and Sustainable Energy Reviews.
and it is ready to use in the Python package splitting_models developed during that research. See the documentation in the Python package for usage details. Below, there is a detailed description of the dataset.
The data is in a single parquet file that contains 1-min time series of solar geometry, clear-sky solar irradiance simulations, solar irradiance observations and CAELUS sky types for 5 BSRN sites, one per primary Köppen-Geiger climate, namely: Minamitorishima (mnm), JP, for equatorial climate; Alice Springs (asp), AU, for dry climate; Carpentras (car), FR, for temperate climate; Bondville (bon), US, for continental climate; and Sonnblick (son), AT, for cold/polar/snow climate. It includes one calendar year per site. The BSRN data is publicly available. See download instructions in https://bsrn.awi.de/data.
The specific variables included in the dataset are:
The dataset can be easily loaded in a Python Pandas DataFrame as follows:
import pandas as pd
data = pd.read_parquet(
The dataframe has a multi-index with two levels: times_utc and site. The former are the UTC timestamps at the center of each 1-min interval. The latter is each site's label.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
If you want to give feedback on this dataset, or wish to request it in another form (e.g csv), please fill out this survey here. We are a not-for-profit research organisation keen to see how others use our open models and tools, so all feedback is appreciated! It's a short form that takes 5 minutes to complete.
Important Note: Before downloading this dataset, please read the License and Software Attribution section at the bottom.
This dataset aligns with the work published in Centre for Net Zero's report "Hitting the Target". In this work, we simulate a range of interventions to model the situations in which we believe the UK will meet its 600,000 heat pump installation per year target by 2028. For full modelling assumptions and findings, read our report on our website.
The code for running our simulation is open source here.
This dataset contains over 9 million households that have been address matched between Energy Performance Certificates (EPC) data and Price Paid Data (PPD). The code for our address matching is here. Since these datasets are Open Government License (OGL), this dataset is too. We basically model specific columns from various datasets, as set out in our methodology section in our report, to simplify and clean up this dataset for academic use. License information is also available in the appendix of our report above.
The EPC data loaders can be found here (the data is here) and the rest of the schemas and data download locations can be found here.
Note that this dataset is not regularly maintained or updated. It is correct as of January 2022. The data was curated and tested using dbt via this Github repository and would be simple to rerun on the latest data.
The schema / data dictionary for this data can be found here.
Our recommended way of loading this data is in Python. After downloading all "parts" of the dataset to a folder. You can run:
import pandas as pd
data = pd.read_parquet("path/to/data/folder/")
Licenses and software attribution:
For EPC, PPD and UK House Price Index data:
For the EPC data, we are permitted to republish this providing we mention that all researchers who download this dataset follow these copyright restrictions. We do not explicitly release any Royal Mail address data, instead we use these fields to generate a pseudonymised "address_cluster_id" which reflects a unique combination of the address lines and postcodes, as well as other metadata. When viewing ICO and GDPR guidelines, this still counts as personal data, but we have gone to measures to pseudonymise as much as possible to fulfil our obligations as a data processor. You must read this carefully before downloading the data, and ensure that you are using it for the research purposes as determined by this copyright notice.
Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.
Contains OS data © Crown copyright and database right 2022.
Contains Office for National Statistics data licensed under the Open Government Licence v.3.0.
The OGL v3.0 license states that we are free to:
copy, publish, distribute and transmit the Information;
adapt the Information;
exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application.
However we must (where we do any of the above):
acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence;
You can see more information here.
For XOServe Off Gas Postcodes:
This dataset has been released openly for all uses here.
For the address matching:
GNU Parallel: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014
This dataset is in context of the real world data science work and how the data analyst and data scientist work.
The dataset consists of four columns Year, Level_1(Ethnic group/gender), Level_2(Age group), and population
I would sincerely thank GeoIQ for sharing this dataset with me along with tasks. Just having a basic knowledge of Pandas and Numpy and other python data science libraries is not enough. How can you execute tasks and how can you preprocess the data before making any prediction is very important. Most of the datasets in Kaggle are clean and well arranged but this dataset thought me how real world data science and analysis works. Every data science beginner must work on this dataset and try to execute the tasks. It would only give them a good exposer to the real data science world.
This dataset contains 6 months of Customer online orders. The data is simple but messy and unorganized. This for beginner and Intermediate level who want to improve there skills in Pandas, matplotlib, and seaborn.
Dataset context columns like: crawl_timestamp, product_name, product_category_tree, retail_price, discounted_price, brand.
The main focus is to clean the dataset and make it organized using pandas.
I wouldn't be here without the help of data.world. Thank You.
I have some questions for this Dataset: 1. What was the best month for sales? How much was earned that month? 2. What time should we display advertisements to maximize the likelihood of purchases? 3. Which category sold most in that six month period? 4. Top 10 products sold most in that six month period?
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.
Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.
The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.
Included Code and Data
ground truth data.zip
is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
noisy speech data.zip
is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
synthetic speech data.zip
is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl
and synthetic_speech.pkl
are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
noisy speech evaluation.py
and synthetic speech evaluation.py
are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
Pipfile
is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.
References: