32 datasets found

h
warvan-ml-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
Explore at:
Authors
warvan
Description
Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
An Empirical Study on Energy Usage Patterns of Different Variants of Data...
figshare.com
zip
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princy Chauhan (2024). An Empirical Study on Energy Usage Patterns of Different Variants of Data Processing Libraries [Dataset]. http://doi.org/10.6084/m9.figshare.27611421.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611421.v1
Dataset updated
Nov 5, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Princy Chauhan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
BCG Data Science Simulation
kaggle.com
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVITR KUMAR SWAIN (2025). BCG Data Science Simulation [Dataset]. https://www.kaggle.com/datasets/pavitrkumar/bcg-data-science-simulation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVITR KUMAR SWAIN
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
** Feature Engineering for Churn Prediction**

🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.

📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:

consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling

📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering

🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering

🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.

🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!

📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/

🔍 Let’s explore churn prediction insights together! 🎯
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Antici; Francesco Antici; Andrea Bartolini; Andrea Bartolini; Jens Domke; Jens Domke; Zeynep Kiziltan; Zeynep Kiziltan; Keiji Yamamoto; Keiji Yamamoto (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. http://doi.org/10.5281/zenodo.11467483
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11467483
Dataset updated
Jun 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Antici; Francesco Antici; Andrea Bartolini; Andrea Bartolini; Jens Domke; Jens Domke; Zeynep Kiziltan; Zeynep Kiziltan; Keiji Yamamoto; Keiji Yamamoto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

# Importing pandas library

import pandas as pd

# Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()
r
Data from: Dataset with condition monitoring vibration data annotated with...
researchdata.se
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel (2025). Dataset with condition monitoring vibration data annotated with technical language, from paper machine industries in northern Sweden [Dataset]. http://doi.org/10.5878/hxc0-bd07
Explore at:
(200308), (124)Available download formats
Unique identifier
https://doi.org/10.5878/hxc0-bd07
Dataset updated
Jun 17, 2025
Dataset provided by
Luleå University of Technology
Authors
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Sweden
Description
Labelled industry datasets are one of the most valuable assets in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best, in particular labelled datasets. Recent studies have showcased that industry annotations can be used to train artificial intelligence models directly on industry data ( https://doi.org/10.36001/ijphm.2022.v13i2.3137 , https://doi.org/10.36001/phmconf.2023.v15i1.3507 ), but while many industry datasets also contain text descriptions or logbooks in the form of annotations and maintenance work orders, few, if any, are publicly available. Therefore, we release a dataset consisting with annotated signal data from two large (80mx10mx10m) paper machines, from a Kraftliner production company in northern Sweden. The data consists of 21 090 pairs of signals and annotations from one year of production. The annotations are written in Swedish, by on-site Swedish experts, and the signals consist primarily of accelerometer vibration measurements from the two machines. The dataset is structured as a Pandas dataframe and serialized as a pickle (.pkl) file and a JSON (.json) file. The first column (‘id’) is the ID of the samples; the second column (‘Spectra’) are the fast Fourier transform and envelope-transformed vibration signals; the third column (‘Notes’) are the associated annotations, mapped so that each annotation is associated with all signals from ten days before the annotation date, up to the annotation date; and finally the fourth column (‘Embeddings’) are pre-computed embeddings using Swedish SentenceBERT. Each row corresponds to a vibration measurement sample, though there is no distinction in this data between which sensor or machine part each measurement is from.
A Replication Dataset for Fundamental Frequency Estimation
zenodo.org
live.european-language-grid.eu
+1more
bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3904389
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.

Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

Included Code and Data

ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

CMU-ARCTIC (consensus truth) [1]

FDA (corpus truth and consensus truth) [2]

KEELE (corpus truth and consensus truth) [3]

MOCHA-TIMIT (consensus truth) [4]

PTDB-TUG (corpus truth and consensus truth) [5]

TIMIT (consensus truth) [6]

noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:

NOISEX [7]

QUT-NOISE [8]

synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.

noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:

AUTOC [9]

AMDF [10]

BANA [11]

CEP [12]

CREPE [13]

DIO [14]

DNN [15]

KALDI [16]

MAPS

MBSC [17]

NLS [18]

PEFAC [19]

PRAAT [20]

RAPT [21]

SACC [22]

SAFE [23]

SHR [24]

SIFT [25]

SRH [26]

STRAIGHT [27]

SWIPE [28]

YAAPT [29]

YIN [30]

noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:

Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.

Fine Pitch Error (FPE), the mean error of grossly correct estimates.

High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.

Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.

Fine Remaining Bias (FRB), the median error of GREs.

True Positive Rate (TPR), the percentage of true positive voicing estimates.

False Positive Rate (FPR), the percentage of false positive voicing estimates.

False Negative Rate (FNR), the percentage of false negative voicing estimates.

F₁, the harmonic mean of precision and recall of the voicing decision.

Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

References:

John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.

Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.

F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.

Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.

Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.

Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.

David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.

Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.

Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.

Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.

Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.

Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.

Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.

Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.

Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically
m
COVID-19 Scholarly Production Dataset
data.mendeley.com
Updated Jul 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gisliany Alves (2020). COVID-19 Scholarly Production Dataset [Dataset]. http://doi.org/10.17632/kx7wwc8dzp.5
Explore at:
Unique identifier
https://doi.org/10.17632/kx7wwc8dzp.5
Dataset updated
Jul 7, 2020
Authors
Gisliany Alves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.
Z
Dataset for paper "Mitigating the effect of errors in source parameters on...
data.niaid.nih.gov
zenodo.org
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Nicholas Rawlinson
Nienke Blom
Phil-Simon Hardalupas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

This dataset contains:

The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

A number of Python scripts that are used in above notebooks.

two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

Datasets corresponding to the different figures.

One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

travel-time arrival predictions from every source to all stations (df_stations...pkl)

misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

How to use this dataset:

To set up the conda environment:

make sure you have anaconda/miniconda

make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

Additionally in your conda env, install basemap and cartopy:

conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

Figure 1: separate notebook, Fig1_event_98.py

Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

Figures 3-7: Figures_perturbation_study.py

Figures 8-10: Figures_toy_inversions.py

To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

To recreate the complete Salvus project: This can be done using:

the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

References:

Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Data from: ESAT: Environmental Source Apportionment Toolkit Python package
catalog.data.gov
s.cnmilf.com
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). ESAT: Environmental Source Apportionment Toolkit Python package [Dataset]. https://catalog.data.gov/dataset/esat-environmental-source-apportionment-toolkit-python-package
Explore at:
Dataset updated
Nov 29, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The Environmental Source Apportionment Toolkit (ESAT) is an open-source software package that provides API and CLI functionality to create source apportionment workflows specifically targeting environmental datasets. Source apportionment in environment science is the process of mathematically estimating the profiles and contributions of multiple sources in some dataset, and in the case of ESAT, while considering data uncertainty. There are many potential use cases for source apportionment in environmental science research, such as in the fields of air quality, water quality and potentially many others. The ESAT toolkit is written in Python and Rust, and uses common packages such as numpy, scipy and pandas for data processing. The source apportionment algorithms provided in ESAT include two variants of non-negative matrix factorization (NMF), both of which have been written in Rust and contained within the python package. A collection of data processing and visualization features are included for data and model analytics. The ESAT package includes a synthetic data generator and comparison tools to evaluate ESAT model outputs.
Fracture toughness of mixed-mode anticracks in highly porous materials...
zenodo.org
bin, text/x-python +1
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl (2024). Fracture toughness of mixed-mode anticracks in highly porous materials dataset and data processing [Dataset]. http://doi.org/10.5281/zenodo.11443644
Explore at:
text/x-python, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11443644
Dataset updated
Sep 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.

Contents

main.ipynb: Jupyter notebook with the main data analysis workflow.

energy.py: Methods for the calculation of energy release rates.

regression.py: Methods for the regression analyses.

visualization.py: Methods for generating visualizations.

df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.

df_legacy.pkl: Pickled DataFrame with literature data.

Prerequisites

To run the scripts and notebooks, you need:

Python 3.12 or higher

Jupyter Notebook or JupyterLab

Libraries: pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weac

Setup

Download the zip file or clone this repository to your local machine.

Ensure that Python and Jupyter are installed.

Install required Python libraries using pip install -r requirements.txt.

Running the Analysis

Open the main.ipynb notebook in Jupyter Notebook or JupyterLab.

Execute the cells in sequence to reproduce the analysis.

Data Description

The data included in this repository is encapsulated in two pickled DataFrame files, df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:

df_mmft.pkl

Includes data such as experiment identifiers, datetime, and physical measurements like slope inclination and critical cut lengths.

exp_id: Unique identifier for each experiment.

datestring: Date of the experiment as a string.

datetime: Timestamp of the experiment.

bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.

slope_incl: Inclination of the slope in degrees.

h_sledge_top: Distance from sample top surface to the sled in mm.

h_wl_top: Distance from sample top surface to weak layer in mm.

h_wl_notch: Distance from the notch root to the weak layer in mm.

rc_right: Critical cut length in mm, measured on the front side of the sample.

rc_left: Critical cut length in mm, measured on the back side of the sample.

rc: Mean of rc_right and rc_left.

densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.

densities_mean: Daily mean of densities.

layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

layers_mean: Daily mean of layers.

surface_lineload: Surface line load of added surface weights in N/mm.

wl_thickness: Weak-layer thickness in mm.

notes: Additional notes regarding the experiment or observations.

L: Length of the slab–weak-layer assembly in mm.

df_legacy.pkl

Contains robustness data such as radii of curvature, slope inclination, and various geometrical measurements.

#: Record number.

rc: Critical cut length in mm.

slope_incl: Inclination of the slope in degrees.

h: Slab height in mm.

density: Mean slab density in kg/m^3.

L: Lenght of the slab–weak-layer assembly in mm.

collapse_height: Weak-layer height reduction through collapse.

layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

wl_thickness: Weak-layer thickness in mm.

surface_lineload: Surface line load from added weights in N/mm.

For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Citation

Please cite the following paper if you use this analysis or the accompanying datasets:

Adam, V., Bergfeld, B., Weißgraeber, P. van Herwijnen, A., Rosendahl, P.L., Fracture toughness of mixed-mode anticracks in highly porous materials. Nature Communincations 15, 7379 (2024). https://doi.org/10.1038/s41467-024-51491-7
O
Time series
data.open-power-system-data.org
csv, sqlite, xlsx
Updated Oct 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Muehlenpfordt (2020). Time series [Dataset]. http://doi.org/10.25832/time_series/2020-10-06
Explore at:
csv, sqlite, xlsxAvailable download formats
Unique identifier
https://doi.org/10.25832/time_series/2020-10-06
Dataset updated
Oct 6, 2020
Dataset provided by
Open Power System Data
Authors
Jonathan Muehlenpfordt
Time period covered
Jan 1, 2015 - Oct 1, 2020
Variables measured
utc_timestamp, DE_wind_profile, DE_solar_profile, DE_wind_capacity, DK_wind_capacity, SE_wind_capacity, CH_solar_capacity, DE_solar_capacity, DK_solar_capacity, AT_price_day_ahead, and 290 more
Description
Load, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity prices, electricity consumption (load) as well as wind and solar power generation and capacities. The data is aggregated either by country, control area or bidding zone. Geographical coverage includes the EU and some neighbouring countries. All variables are provided in hourly resolution. Where original data is available in higher resolution (half-hourly or quarter-hourly), it is provided in separate files. This package version only contains data provided by TSOs and power exchanges via ENTSO-E Transparency, covering the period 2015-mid 2020. See previous versions for historical data from a broader range of sources. All data processing is conducted in Python/pandas and has been documented in the Jupyter notebooks linked below.
Taylor Swift | The Eras Tour Official Setlist Data
kaggle.com
Updated May 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yuka_with_data (2024). Taylor Swift | The Eras Tour Official Setlist Data [Dataset]. https://www.kaggle.com/datasets/yukawithdata/taylor-swift-the-eras-tour-official-setlist-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
yuka_with_data
Description
💁‍♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.

Dataset Description:

This dataset provides a comprehensive collection of setlists from Taylor Swift’s official era tours, curated expertly by Spotify. The playlist, available on Spotify under the title "Taylor Swift The Eras Tour Official Setlist," encompasses a diverse range of songs that have been performed live during the tour events of this global artist. Each dataset entry corresponds to a song featured in the playlist.

Taylor Swift, a pivotal figure in both country and pop music scenes, has had a transformative impact on the music industry. Her tours are celebrated not just for their musical variety but also for their theatrical elements, narrative style, and the deep emotional connection they foster with fans worldwide. This dataset aims to provide fans and researchers an insight into the evolution of Swift's musical and performance style through her tours, capturing the essence of what makes her tour unique.

Data Collection and Processing:

Obtaining the Data: The data was obtained directly from the Spotify Web API, specifically focusing on the setlist tracks by the artist. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.

Data Processing: To process and structure the data, Python scripts were developed using data science libraries such as pandas for data manipulation and spotipy for API interactions, specifically for Spotify data retrieval.

Workflow:

Authentication API Requests Data Cleaning and Transformation Saving the Data

Attribute Descriptions:

artist_name: the name of the artist (Taylor Swift)

track_name: the title of the track

is_explicit: Indicates whether the track contains explicit content

album_release_date: The date when the track was released

genres: A list of genres associated with Beyoncé

danceability: A measure from 0.0 to 1.0 indicating how suitable a track is for - dancing based on a combination of musical elements

valence: A measure from 0.0 to 1.0 indicating the musical positiveness conveyed by a track

energy: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity

loudness: The overall loudness of a track in decibels (dB)

acousticness: A measure from 0.0 to 1.0 whether the track is acoustic

instrumentalness: Predicts whether a track contains no vocals

liveness: Detects the presence of an audience in the recordings speechiness: Detects the presence of spoken words in a track

key: The key the track is in. Integers map to pitches using standard Pitch Class notation

tempo: The overall estimated tempo of a track in beats per minute (BPM)

mode: Modality of the track

duration_ms: The length of the track in milliseconds

time_signature: An estimated overall time signature of a track

popularity: A score between 0 and 100, with 100 being the most popular

Note: Popularity score reflects the score recorded on the day that retrieves this dataset. The popularity score could fluctuate daily.

Potential Applications:

Predictive Analytics: Researchers might use this dataset to predict future setlist choices for tours based on album success, song popularity, and fan feedback.

Disclaimer and Responsible Use:

This dataset, derived from Spotify focusing on Taylor Swift's The Eras Tour setlist data, is intended for educational, research, and analysis purposes only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.

Compliance with Terms of Service: Users should adhere to Spotify's Terms of Service and Developer Policies when utilizing this dataset.

Copyright Notice: The dataset presents music track information including names and artist details for analytical purposes and does not convey any rights to the music itself. Users must ensure that their use does not infringe on the copyright holders' rights. Any analysis, distribution, or derivative work should respect the intellectual property rights of all involved parties and comply with applicable laws.

No Warranty Disclaimer: The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.

Ethical Use: Users are encouraged to consider the ethical implications of their analyses and the potential impact on artists and the broader community.

Data Accuracy and Timeliness: The dataset reflects a snapshot in time and may not represent the most current information available. Users are encouraged to verify the data's accuracy and timeliness.

Source Verification: For the most accurate and up-to-date information, users are encouraged to refer directly to Spotify's official website.

Independence Declaration: ...
Z
Data from: Redocking the PDB
data.niaid.nih.gov
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flachsenberg, Florian (2023). Redocking the PDB [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7579501
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Gutermuth, Torben
Flachsenberg, Florian
Ehrt, Christiane
Rarey, Matthias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints

All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:

import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets

PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22

PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ

PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22

PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ

PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand
H
Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...
dataverse.harvard.edu
Updated Nov 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Narla (2024). Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven Approach [Dataset]. http://doi.org/10.7910/DVN/HF0PFX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/HF0PFX
Dataset updated
Nov 19, 2024
Dataset provided by
Harvard Dataverse
Authors
Gautam Narla
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This study investigates the application of machine learning (ML) models in stock market forecasting, with a focus on their integration using PineScript, a domain-specific language for algorithmic trading. Leveraging diverse datasets, including historical stock prices and market sentiment data, we developed and tested various ML models such as neural networks, decision trees, and linear regression. Rigorous backtesting over multiple timeframes and market conditions allowed us to evaluate their predictive accuracy and financial performance. The neural network model demonstrated the highest accuracy, achieving a 75% success rate, significantly outperforming traditional models. Additionally, trading strategies derived from these ML models yielded a return on investment (ROI) of up to 12%, compared to an 8% benchmark index ROI. These findings underscore the transformative potential of ML in refining trading strategies, providing critical insights for financial analysts, investors, and developers. The study draws on insights from 15 peer-reviewed articles, financial datasets, and industry reports, establishing a robust foundation for future exploration of ML-driven financial forecasting. Tools and Technologies Used †PineScript PineScript, a scripting language integrated within the TradingView platform, was the primary tool used to develop and implement the machine learning models. Its robust features allowed for custom indicator creation, strategy backtesting, and real-time market data analysis. †Python Python was utilized for data preprocessing, model training, and performance evaluation. Key libraries included: Pandas
f
Metaverse Gait Authentication Dataset (MGAD)
figshare.com
csv
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sandeep ravikanti (2025). Metaverse Gait Authentication Dataset (MGAD) [Dataset]. http://doi.org/10.6084/m9.figshare.28387664.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28387664.v1
Dataset updated
Feb 11, 2025
Dataset provided by
figshare
Authors
sandeep ravikanti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset OverviewThe Metaverse Gait Authentication Dataset (MGAD) is a large-scale dataset for gait-based biometric authentication in virtual environments. It consists of gait data from 5,000 simulated users, generated using Unity 3D and processed using OpenPose and MediaPipe. This dataset is ideal for researchers working on biometric authentication, gait analysis, and AI-driven identity verification systems.2. Data Structure & FormatFile Format: CSVNumber of Samples: 5,000 usersNumber of Features: 16 gait-based featuresColumns: Each row represents a user with corresponding gait feature valuesSize: Approximately (mention size in MB/GB after upload)3. Feature DescriptionsThe dataset includes 16 extracted gait features:Stride Length (m): Average distance covered in one gait cycle.Step Frequency (steps/min): Number of steps taken per minute.Stance Phase Duration (s): Stance phase in a gait cycle.Swing Phase Duration (s): Duration of the swing phase in a gait cycle.Double Support Phase Duration (s): Time both feet are in contact with the ground.Step Length (m): Distance between consecutive foot placements.Cadence Variability (%): Variability in step rate.Hip Joint Angle (°): Maximum angle variation in the hip joint.Knee Joint Angle (°): Maximum flexion-extension knee angle.Ankle Joint Angle (°): Angle variation at the ankle joint.Avg. Vertical GRF (N): Average vertical ground reaction force.Avg. Anterior-Posterior GRF (N): Ground reaction force in the forward-backward direction.Avg. Medial-Lateral GRF (N): Ground reaction force in the side-to-side direction.Avg. COP Excursion (mm): Center of pressure movement during stance phase.Foot Clearance during Swing Phase (mm): Minimum height of the foot during the swing phase.Gait Symmetry Index (%): Measure of symmetry between left and right gait cycles.4. How to Use the DatasetLoad the dataset in Python using Pandas:Use the features for machine learning models in biometric authentication.Apply preprocessing techniques like normalization and feature scaling.Train and evaluate deep learning or ensemble models for gait recognition.5. Citation & LicenseIf you use this dataset, please cite it as follows:Sandeep Ravikanti, "Metaverse Gait Authentication Dataset (MGAD)," IEEE DataPort, 2025. DOI: https://dx.doi.org/10.21227/rvh5-88426. Contact InformationFor inquiries or collaborations, please contact: bitsrmit2023@gmail.com
O
Household Data
data.open-power-system-data.org
csv, sqlite, xlsx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrian Minde (2020). Household Data [Dataset]. https://data.open-power-system-data.org/household_data/
Explore at:
xlsx, csv, sqliteAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
Open Power System Data
Authors
Adrian Minde
Time period covered
Dec 11, 2014 - May 1, 2019
Variables measured
interpolated, utc_timestamp, cet_cest_timestamp, DE_KN_industrial2_pv, DE_KN_industrial3_ev, DE_KN_residential1_pv, DE_KN_residential3_pv, DE_KN_residential4_ev, DE_KN_residential4_pv, DE_KN_residential6_pv, and 61 more
Description
Detailed household load and solar generation in minutely to hourly resolution. This data package contains measured time series data for several small businesses and residential households relevant for household- or low-voltage-level power system modeling. The data includes solar power generation as well as electricity consumption (load) in a resolution up to single device consumption. The starting point for the time series, as well as data quality, varies between households, with gaps spanning from a few minutes to entire days. All measurement devices provided cumulative energy consumption/generation over time. Hence overall energy consumption/generation is retained, in case of data gaps due to communication problems. Measurements were conducted 1-minute intervals, with all data made available in an interpolated, uniform and regular time interval. All data gaps are either interpolated linearly, or filled with data of prior days. Additionally, data in 15 and 60-minute resolution is provided for compatibility with other time series data. Data processing is conducted in Jupyter Notebooks/Python/pandas.
Z
Data and Processing from "Carbon-centric dynamics of Earth's marine...
data.niaid.nih.gov
zenodo.org
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fennel, Katja (2024). Data and Processing from "Carbon-centric dynamics of Earth's marine phytoplankton" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10949681
Explore at:
Dataset updated
Oct 6, 2024
Dataset provided by
Fennel, Katja
Stoer, Adam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
Brief Summary:

This documentation is for associated data and code for:

A. Stoer, K. Fennel, Carbon-centric dynamics of Earth's marine phytoplankton. Proceedings of the National Academy of Sciences (2024).

To cite this software and data, please use:

A. Stoer, K. Fennel, Data and processing from "Carbon-centric dynamics of Earth's marine phytoplankton". Zenodo. https://doi.org/10.5281/zenodo.10949682. Deposited 1 October 2024.

List of folders and subfolders and what they contain:

raw data: Contains raw data used in the analysis. This folder does not contain the satellite imagery, which will need to be downloaded from the NASA Ocean Color website (https://oceancolor.gsfc.nasa.gov/).

bgc-argo float data (subfolder): Includes Argo data from its original source or put into a similar Argo format

global region data (subfolder): Includes data used to subset the Argo profiles into each 10deg lat region and basin.

graff et al 2015 data (subfolder): Include the data digitized from Graff et al.'s Fig. 2.

processed data: data processing by this study (Stoer and Fennel, 2024)

processed bgc-argo data (subfolder): A binned processed file is present for each Argo float used in the analysis. Note these files include those describe in Table S1 (these are later processed in "3_stock_bloom_calc.py")

processed satellite data (subfolder): includes a 10-deg latitude averaged for each satellite image processed (called "chl_sat_df_merged.csv"). This is later used to calculate a satellite chlorophyll-a climatology in "3_stock_bloom_calc.py".

processed chla-irrad data (subfolder): includes the quality-controlled light diffuse attenuation data coupled with the chlorophyll-a fluorescence data to calculate slope factor corrections (the file is called "processed chla-irrad data.csv").

processed topography data (subfolder): includes smoothed topography data (file named "ETOPO_2022_v1_60s_N90W180_surface_mod.tiff").

software:

0_ftp_argo_data_download.py: This program downloads the Argo data from the Global Data Assembly Center's FTP. Running this program will provide new Argo float profiles. However, there will be new floats and profiles present if downloaded. This will not match the historical record of Argo floats used in this analysis but could be useful for replicating this analysis when more data becomes available. The historical record of BGC-Argo floats are present in "/raw data/bgc-argo float data/" path. If you wish to downloaded other float data, see Gordon et al. (2020), Hamilton and Leidos (2017) and the data from the misclab website (https://misclab.umeoce.maine.edu/floats/).

1_argo_data_processing.py: This program quality-controls and bins the biogeochemical data into a consistent format. This includes corrections and checks, like the spike/noise test or the non-photochemical quenching correction.

2_sat_data_processing.py: this program processes the satellite data downloaded from the NASA Ocean Color website.

3_stock_bloom_calc.py: this is the main program used to described the results of the study. The program takes the processed Argo data and groups it into regions and calculates slope factors, phytoplankton carbon & chlorophyll-a, global stocks, and bloom metrics.

4_stock_calc_longhurst_province.py: This program repeats the global stocks calculations performed in "3_stock_bloom_calc.py" but bases the grouping on Longhurst Biogeochemical Provinces.

How to Replicate this Analysis:

Each program should be run in the order listed above. Path names where the data files have been downloaded will need to be updated in the code.

To use the exact same Sprof files as used in the paper, skip running "0_ftp_argo_data_download.py" and start with "1_argo_data_processing.py" instead. Use the float data from the folder "bgc-argo float data". The program "0_ftp_argo_data_download.py" downloads the latest data from Argo database, so it is useful for updating the analysis. The program "1_argo_data_processing.py" may also be skipped to save time and the processed BGC-Argo float data may be used instead (see folder named "processed bgc-argo data").

Similarly, the program "2_sat_data_processing.py" may also be skipped, which otherwise can take multiple hours to process. The raw data is available from the NASA Ocean Color website (https://oceancolor.gsfc.nasa.gov/). The processed data from "2_sat_data_processing.py" is available so this step may be skipped to save time as well.

The program "3_stock_bloom_calc.py" will require running "ocean_toolbox.py" (see below) in another tab. The portion of the program that involves QC for the irradiance profiles has been commented out to save processing time, and the pre-processed data used in the study has been linked instead (see folder "processed light data"). Similarly, pre-processed topography data is present in this repository. The original Earth Topography data can be accessed at https://www.ncei.noaa.gov/products/etopo-global-relief-model.

A version of "3_stock_bloom_calc.py" using Longhurst provinces is available for exploring alternative groupings and their effects on stock calculations. See the program named "4_stock_calc_longhurst_province.py". You will need to download the Longhurst biogeochemical provinces from https://www.marineregions.org/.

To explore the effects of different slope factors, averaging methods, bbp spectral slopes, etc, the user will likely want to make changes to "3_stock_bloom_calc.py". Please do not hesitate to contact the correponding author (Adam Stoer) for guidance or questions.

ocean_toolbox.py:

import statsmodels.formula.api as smfimport osimport matplotlib.pyplot as pltimport numpy as npfrom uncertainties import unumpy as unpfrom scipy import stats

def file_grab(root,find,start): #grabs files by file extensions and location filelst = [] for subdir, dirs, files in os.walk(root): for file in files: filepath = subdir + os.sep + file if filepath.endswith(find): if filepath.startswith(start): filelst.append(filepath) return filelst

def sep_bbp(data, name_z, name_chla, name_bbp): ''' data: Pandas Dataframe containing the profile data name_z: name of the depth variable in data name_chla: name of the chlorophyll-a variable in data name_bbp: name of the particle backscattering variable in data returns: the data variable with particle backscattering partitioned into phytoplankton (bbpphy) and non-algal particle components (bbpnap). ''' #name_chla = 'chla' #name_z = 'depth' #name_bbp = 'bbp470' dcm = data[data.loc[:,name_chla]==data.loc[:,name_chla].max()][name_z].values[0] # Find depth of deep chla maximum part_prof = data[(data.loc[:,name_bbp]=1), name_z].min() # Find depth where bbp NAP and bbp intersect data.loc[data[name_z]>=z_lim, 'bbp_back'] = data.loc[data[name_z]>=z_lim, name_bbp].tolist() data.loc[data[name_z]z_lim),'bbpphy'] = 0 # Subtract bbp NAP from bbp for bbp from phytoplankton

return data['bbpphy'], z_lim

def bbp_to_cphy(bbp_data, sf): ''' data: Pandas Dataframe containing the profile data name_bbp: name of the particulate backscattering variable in data name_bbp_err: name of particulate backscattering error variable in data returns: the data variable with particle backscattering [/m] converted into phytoplankton carbon [mg/m^3]. ''' cphy_data = bbp_data.mul(sf)

return cphy_data
Z
TrafficDator Madrid
data.niaid.nih.gov
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilarri, Sergio (2024). TrafficDator Madrid [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10435153
Explore at:
Dataset updated
Apr 6, 2024
Dataset provided by
Gómez, Iván
Ilarri, Sergio
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Madrid
Description
Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.

Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).

Technical Details:

Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.

Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.

Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.

Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.

Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.

Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.

Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:

id: Unique sensor identifier.

date: Date and time of the measurement.

longitude and latitude: Geographical coordinates of the sensor.

day type: Information about the day being a working day, holiday, or festive Sunday.

intensity: Measured traffic intensity.

Additional data like wind, temperature, precipitation, etc.

Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.

Acknowledgment and Funding:

This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.

In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.

For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.

Facebook

Twitter

Click to copy link

Link copied

Cite

warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset

warvan-ml-dataset

warvan/warvan-ml-dataset

Explore at:

Authors

warvan

Description

Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

  Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

  Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

Clear search

Close search

Google apps

Main menu

warvan-ml-dataset

An Empirical Study on Energy Usage Patterns of Different Variants of Data...

BCG Data Science Simulation

** Feature Engineering for Churn Prediction**

Reddit r/AskScience Flair Dataset

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

Data from: Dataset with condition monitoring vibration data annotated with...

A Replication Dataset for Fundamental Frequency Estimation

COVID-19 Scholarly Production Dataset

Dataset for paper "Mitigating the effect of errors in source parameters on...

Data from: ESAT: Environmental Source Apportionment Toolkit Python package

Fracture toughness of mixed-mode anticracks in highly porous materials...

Contents

Prerequisites

Setup

Running the Analysis

Data Description

df_mmft.pkl

df_legacy.pkl

License

Citation

Time series

Taylor Swift | The Eras Tour Official Setlist Data

Dataset Description:

Data Collection and Processing:

Attribute Descriptions:

Potential Applications:

Disclaimer and Responsible Use:

Data from: Redocking the PDB

Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...

Metaverse Gait Authentication Dataset (MGAD)

Household Data

Data and Processing from "Carbon-centric dynamics of Earth's marine...

TrafficDator Madrid

warvan-ml-dataset

warvan/warvan-ml-dataset

Feature Engineering for Churn Prediction

`df_mmft.pkl`

`df_legacy.pkl`