a description
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Example DataFrame (Teeny-Tiny Castle)
This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.
How to Use
from datasets import load_dataset
dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
See the package documentation website on dataset.dataobservatory.eu. Report bugs and suggestions on Github: https://github.com/dataobservatory-eu/dataset/issues The primary aim of dataset is to build well-documented data.frames, tibbles or data.tables that follow the W3C Data Cube Vocabulary based on the statistical SDMX data cube model. Such standard R objects (data.fame, data.table, tibble, or well-structured lists like json) become highly interoperable and can be placed into relational databases, semantic web applications, archives, repositories. They follow the FAIR principles: they are findable, accessible, interoperable and reusable. Our datasets: Contain Dublin Core or DataCite (or both) metadata that makes the findable and easier accessible via online libraries. See vignette article Datasets With FAIR Metadata. Their dimensions can be easily and unambigously reduced to triples for RDF applications; they can be easily serialized to, or synchronized with semantic web applications. See vignette article From dataset To RDF. Contain processing metadata that greatly enhance the reproducibility of the results, and the reviewability of the contents of the dataset, including metadata defined by the DDI Alliance, which is particularly helpful for not yet processed data; Follow the datacube model of the Statistical Data and Metadata eXchange, therefore allowing easy refreshing with new data from the source of the analytical work, and particularly useful for datasets containing results of statistical operations in R; Correct exporting with FAIR metadata to the most used file formats and straighforward publication to open science repositories with correct bibliographical and use metadata. See Export And Publish a dataset. Relatively lightweight in dependencies and easily works with data.frame, tibble or data.table R objects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Dataframe Detection is a dataset for object detection tasks - it contains Student Responses On Exams annotations for 1,052 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The archive contains two datasets that have been used to empirically evaluate MAT-Builder, a system to generate multiple aspect trajectories.
The first one is located in the "rome" folder and contains 26395 trajectories from 3181 individuals. The trajectories move over the city of Rome and were collected from OpenStreetMap. The folder contains also auxiliary datasets, i.e., the set of POIs within the province of Rome's boundaries (downloaded from OpenStreetMap) (see the "poi" subfolder), historical weather information (downloaded from Meteostat https://meteostat.net/it/) (see the "weather" subfolder), and a dataset of social media posts from the individuals which was generated synthetically (see the "tweets" subfolder). All the datasets are pandas dataframes, except for the POI dataset which is a geopandas
DataFrame. All the datasets have been stored according to the parquet format.
The second one is located in the "geolife" folder, and contains the GeoLife dataset. The dataset contains 17621 trajectories from 178 users. The timestamps of the trajectory samples have been adjusted from the GMT to the GMT+8 timezone. As in the former dataset's case, this folder contains also a dataset of POIs, a dataset of historical weather information, and a dataset of social media posts that were generated synthetically.
For more information on the MAT-Builder project (i.e., published papers, how to use to datasets, how the information within the datasets is structured, and so on) we refer to the MAT-Builder's GitHub page: https://github.com/chiarap2/MAT_Builder.
This dataset was created by Anton Kostin
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).
The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.
The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:
The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:
There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paper: The Balance-Scale Task Revisited: A Comparison of Statistical Models for Rule-Based and Information-Integration Theories of Proportional Reasoning Abe Hofman, Ingmar Visser, Brenda Jansen & Han van der Maas; 15-2-2015 ————————————————————————The “dataBS.Rdata” file include four dataframes based on two different datasets: A paper-and-pencil dataset collected by Jansen & van der Maas (1997), and a online dataset collected with the Math Garden. Description of the for dataframes: 1) student_info_pp: Student information of paper-and-pencil dataset - id = student id - age = student age 2) student_info_mg: Student information of Math Garden dataset - id = student id - age = student age - new = student has not played the task before data collection started - practise = number of items made by students before the data collection started 3) responses_pp: Response information of paper-and-pencil dataset in long format 4) responses_mg: Response information of Math Garden dataset in long format - id = student id - it = item id - item_type = item type as defined in paper - product_difference = difference between the product of weights and distance on each side of the fulcrum - weight_difference = difference between the weights on each side of the fulcrum - distance_difference = difference between the distance of the weights on each side of the fulcrum - resp = response; left, balance, right - cor = 0 incorrect; 1 correct
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.
The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.
For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.
Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.
The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.
Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.
Single pendulum Double pendulum 3-body problem
Number of trajectories 500 2000 5000
final time in all_runs T (one period of the pendulum) 10 10
final time in training data 0.25*T 5 5
step size in training data 0.1 0.1 0.5
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We performed CODEX (co-detection by indexing) multiplexed imaging on 24 sections of the human intestine from 3 donors (B004, B005, B006) using a panel of 47 oligonucleotide-barcoded antibodies. We also performed CODEX imaging on both human tonsil and Barrett's esophagus (BE) using a panel of 57 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), single cell segmentation, and column marker z-normalization by tissue. Output of this process were dataframes of 870,000 cells and 220,000 cells respectively with fluorescence values quantified from each marker. Methods See README file.
Libraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON.
Each dataset is stored in separate folder which contains 4 files:
json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset
data_testing: data frame with data used to test trained model
data_training: data frame with data used to train models
results: direct unfiltered data from database
Files are written in feather format. Here is an example of data structure for each file in repository.
File was compressed using 7-Zip available at https://www.7-zip.org/.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
This dataset was created by KingOfDayDream
It contains the following files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf).
In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts.
More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload.
All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf".
Preferred Citation:
-Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488.
-URL to this Zenodo post https://zenodo.org/record/6376160
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains pandas DataFrames that represent filtered versions of CMS Open Data (in the form of ROOT files) available on the CERN OpenData Portal. This dataset specifically contains data from a DYToMuMu process (Drell-Yan process resulting in two Muons in the final state), which is a simulated process created during the 2012 LHC run. A total of 121 (99 for real collision data) relevant variables are contained in the filtered pandas DataFrames that can be found here. A list of variables can be found below, for a full explanation of them, please refer to the following paper (PLACEHOLDER, REFERENCE PAPER HERE): nEvent, runNum, lumisection, evtNum; nMuon, vecMuon_PT, vecMuon_Eta, vecMuon_Phi, vecMuon_PTErr, vecMuon_Q, vecMuon_StaPt, vecMuon_StaEta, vecMuon_StaPhi, vecMuon_TrkIso03, vecMuon_EcalIso03, vecMuon_HcalIso03; nVertex, vecVertex_nTracksfit, vecVertex_ndof, vecVertex_Chi2, vecVertex_X, vecVertex_Y, vecVertex_Z; nEle, vecEle_PT, vecEle_Eta, vecEle_Phi, vecEle_Q, vecEle_TrkIso03, vecEle_EcalIso03, vecEle_HcalIso03, vecEle_D0, vecEle_Dz; nTau, vecTau_PT, vecTau_Eta, vecTau_Phi, vecTau_Q, vecTau_RawIso3Hits, vecTau_RawIsoMVA3oldDMwoLT, vecTau_RawIsoMVA3oldDMwLT, vecTau_RawIsoMVA3newDMwoLT, vecTau_RawIsoMVA3newDMwLT; nPhoton, vecPhoton_PT, vecPhoton_Eta, vecPhoton_Phi, vecPhoton_Hovere, vecPhoton_Sthovere, vecPhoton_HasPixelSeed, vecPhoton_IsConv, vecPhoton_PassElectronVeto; nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; nJets, vecJet_PT, vecJet_Eta, vecJet_Phi, vecJet_D0, vecJet_Dz, vecJet_nCharged, vecJet_nNeutrals, vecJet_nParticles, vecJet_Beta, vecJet_BetaStar, vecJet_dR2Mean, vecJet_Q, vecJet_Mass, vecJet_Area, vecJet_Energy, vecJet_chEmEnergy, vecJet_neuEmEnergy, vecJet_chHadEnergy, vecJet_neuHadEnergy, vecJet_ID, vecJet_Num, vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx; nPF, vecPF_PT, vecPF_Eta, vecPF_Phi, vecPF_Mass, vecPF_E, vecPF_Q, vecPF_PfType, vecPF_EcalE, vecPF_HcalE, vecPF_ndof, vecPF_Chi2, vecPF_pvId, vecPF_X, vecPF_Y, vecPF_Z, vecPF_JetNum; fMET_PT, fMET_Eta, fMET_Phi; HLT_Mu17_Mu8, HLT_Mu24, HLT_MET120_v, HLT_Ele27, HLT_HT350. For the datasets containing data from real collisions at the LHC, the following variables are NOT contained: nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data
ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]
noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]
synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]
noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.
Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.
The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
Dataset Details
This is Bemba-to-English dataset for machine translation task. This dataset is a customized version of the from FLORES-200. It includes parallel sentences between Bemba and English.
Preprocessing Notes
Drop some unused columns like URL, domain, topic, has_image, has_hyperlink. Merge the Bemba and English DataFrames on the ID column. Rename columns name from sentence_bem into text_bem and sentence_en into text_en. Convert dataframe into DatasetDict.… See the full description on the dataset page: https://huggingface.co/datasets/kreasof-ai/flores200-eng-bem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.
main.ipynb
: Jupyter notebook with the main data analysis workflow.energy.py
: Methods for the calculation of energy release rates.regression.py
: Methods for the regression analyses.visualization.py
: Methods for generating visualizations.df_mmft.pkl
: Pickled DataFrame with experimental data gathered in the present work.df_legacy.pkl
: Pickled DataFrame with literature data.pandas
, matplotlib
, numpy
, scipy
, tqdm
, uncertainties
, weac
pip install -r requirements.txt
.main.ipynb
notebook in Jupyter Notebook or JupyterLab.df_mmft.pkl
and df_legacy.pkl
, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:df_mmft.pkl
exp_id
: Unique identifier for each experiment.datestring
: Date of the experiment as a string.datetime
: Timestamp of the experiment.bunker
: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.slope_incl
: Inclination of the slope in degrees.h_sledge_top
: Distance from sample top surface to the sled in mm.h_wl_top
: Distance from sample top surface to weak layer in mm.h_wl_notch
: Distance from the notch root to the weak layer in mm.rc_right
: Critical cut length in mm, measured on the front side of the sample.rc_left
: Critical cut length in mm, measured on the back side of the sample.rc
: Mean of rc_right
and rc_left
.densities
: List of density measurements in kg/m^3 for each distinct slab layer of each sample.densities_mean
: Daily mean of densities
.layers
: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.layers_mean
: Daily mean of layers
.surface_lineload
: Surface line load of added surface weights in N/mm.wl_thickness
: Weak-layer thickness in mm.notes
: Additional notes regarding the experiment or observations.L
: Length of the slab–weak-layer assembly in mm.df_legacy.pkl
#
: Record number.rc
: Critical cut length in mm.slope_incl
: Inclination of the slope in degrees.h
: Slab height in mm.density
: Mean slab density in kg/m^3.L
: Lenght of the slab–weak-layer assembly in mm.collapse_height
: Weak-layer height reduction through collapse.layers_mean
: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.wl_thickness
: Weak-layer thickness in mm.surface_lineload
: Surface line load from added weights in N/mm.For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.
a description