15 datasets found

f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
EDA - Percentage of University Center clients taking action as a result of...
performance.commerce.gov
application/rdfxml +5
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Economic Development Administration (2025). EDA - Percentage of University Center clients taking action as a result of the assistance facilitated by the University Center [Dataset]. https://performance.commerce.gov/KPI-EDA/EDA-Percentage-of-University-Center-clients-taking/prgp-nn7t
Explore at:
xml, csv, application/rdfxml, tsv, application/rssxml, jsonAvailable download formats
Dataset updated
Mar 6, 2025
Dataset authored and provided by
Economic Development Administrationhttp://www.eda.gov/
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
This measure determines the perceived value added by the University Centers (UCs) to their clients. EDA funds UCs to provide technical assistance and specialized services (for example, feasibility studies, marketing research, economic analysis, environmental services, and technology transfer) to local officials and communities. This assistance improves the community’s capacity to plan and manage successful development projects. UCs develop client profiles and report findings to EDA, which evaluates the performance of each center once every 3 years and verifies the data. “Taking action as a result of the assistance facilitated” means to implement an aspect of the technical assistance provided by the UC in one of several areas: economic development initiatives and training session development; linkages to crucial resources; economic development planning; project management; community investment package development; geographic information system services; strategic partnering to public or private sector entities; increased organizational capacity; feasibility plans; marketing studies; technology transfer; new company, product, or patent development; and other services.
IMDB Dataset & Dictionary
kaggle.com
Updated Feb 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Kapoor (2021). IMDB Dataset & Dictionary [Dataset]. https://www.kaggle.com/kapoorshivam/imdb-dataset-dictionary/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivam Kapoor
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
We all love movies! I remember watching my first movie with my family when I was 5 and 3 years later, I still love movies. But have you ever wondered how some people rate movies as good or bad, awesome or mehh! That's correct. Different people have different perspectives on how they like or dislike movies. To help us select from a plethora of movie option out there, IMDB platform provides us honest reviews by the people for the people.

Long story short, this assignment will take you through different aspects of how a movie is reviewed by different people from across the globe based on their star cast, genre, story length and many more aspects.

So here is what you need to do! Few points: 1. Download the dataset & the dictionary that will help you learn the different columns in the dataset 2. Start exploring the data by performing EDA (wiki what’s EDA, if you are a dummy like I was initially) 3. Get back to this notebook to check what all I did for exploring through the data and then follow the subtasks & checkpoints!

Simple? Isn’t it! Do complete the exercise & let me know in the comments if you found this exercise helpful? There’s always a scope for improvement. Tell me what more could have been added to this notebook! Hope you’ll have a good time exploring data.
t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Multidimensional Dataset for APA Investigations in Cancer Patients
zenodo.org
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Cascella; Marco Cascella; Alfonso Maria Ponsiglione; Alfonso Maria Ponsiglione; Vittorio Santoriello; Vittorio Santoriello; Ornella Piazza; Ornella Piazza; Francesco Amato; Francesco Amato; Maria Romano; Maria Romano (2024). Multidimensional Dataset for APA Investigations in Cancer Patients [Dataset]. http://doi.org/10.5281/zenodo.13711426
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13711426
Dataset updated
Sep 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marco Cascella; Marco Cascella; Alfonso Maria Ponsiglione; Alfonso Maria Ponsiglione; Vittorio Santoriello; Vittorio Santoriello; Ornella Piazza; Ornella Piazza; Francesco Amato; Francesco Amato; Maria Romano; Maria Romano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains data collected from patients suffering from cancer-related pain. The features extracted from clinical data (including typical cancer phenomena such as breakthrough pain) and the biosignal acquisitions contributed to the definition of a multidimensional dataset. This unique database can be useful for the characterization of the patient’s pain experience from a qualitative and quantitative perspective. We implemented measurable biosignals-related indicators of the individual’s pain response and of the overall Autonomic Nervous System (ANS) functioning. The most peculiar features extracted from EDA and ECG signals can be adopted to investigate the status and complex functioning of the ANS through the study of sympatho-vagal activations. Specifically, while EDA is mainly related sympathetic activation, the Heart Rate Variability (HRV), which can be derived from ECG recordings, is strictly related to the interplay between sympathetic and parasympathetic functioning.

As far as the EDA signal, two types of analyzes have been performed: (i) the Trough-To-Peak analysis (TTP), or min-max analysis, aimed at measuring the difference between the Skin Conductance (SC) at the peak of a response and its previous minimum within pre-established time-windows; (ii) the Continuous Decomposition Analysis (CDA), aimed at performing a decomposition of SC data into continuous signals of tonic (basic level of conductance) and phasic (short-duration changes in the SC) activity. Before applying the TPP analysis or the CDA, the signal was filtered by means of a fifth-order Butterworth low-pass filter with a cutoff frequency of 1 Hz and downsampled up to 10 Hz to reducing the computational burden of the analysis. The application of TPP and CDA allowed the detection and measurement of SC Responses (SCR) and the following parameters have been calculated for both TPP and CDA methodologies:

Total number of detected SCRs.

Maximum value of SCRs [measured in μS].

Minimum value of SCRs [measured in μS].

Arithmetic mean of the SCRs [measured in μS].

Maximum interval between SCRs [measured in ms].

Minimum interval between SCRs [measured in ms].

Arithmetic mean of the intervals between SCRs [measured in ms].

Concerning the ECG, the RR series of interbeat intervals (i.e., the time between successive R waves of the QRS complex on the ECG waveform) has been computed to extract time-domain parameters of the HRV. The R peak detection was carried out by adopting the Pan–Tompkins algorithm for QRS detection and R peak identification. The corresponding RR series of interbeat intervals were derived as the difference between successive R peaks.

The ECG-derived RR time series was then filtered by means of a recursive procedure to remove the intervals differing most from the mean of the surrounding RR intervals. Then, both the Time-Domain Analysis (TDA) and Frequency-Domain Analysis (FDA) of the HRV have been carried out to extract the main features characterizing the variability of the heart rhythm. Time-domain parameters are obtained from statistical analysis of the intervals between heart beats and are used to describe how much variability in the heartbeats is present at various time scales.

The parameters computed through the TDA include the following:

Arithmetic mean of the RR time series [measured in ms].

The standard deviation of the RR time series [measured in ms].

Mean value of heart rate [measured in bpm].

Standard deviation of the heart rate [measured in bpm].

Root Mean Square of Successive Differences of RR intervals [measured in ms], which is sensitive to high-frequency heart period fluctuations in the respiratory frequency range and has been used as an index of vagal cardiac control.

Number of successive RR intervals whose difference is higher than 50 ms.

Percentage of successive RR intervals higher than 50 ms.

Number of successive RR intervals whose difference is higher than 50 ms.

Frequency-domain parameters reflect the distribution of spectral power across different frequencies bands and are used to assess specific components of HRV (e.g., thermoregulation control loop, baroreflex control loop, and respiration control loop, which are regulated by both sympathetic and vagal nerves of the ANS).
The parameters computed through the FDA have been computed by adopting the Welch's Fourier periodogram method based on the Discrete Fourier Transform (DFT), which allows the expression of the RR series in the discrete frequency domain. However, due to the non-stationarity of the RR series, Welch Fourier periodogram method is used for dealing with non-stationarity. Specifically, Welch's periodogram divides the signal into specific periods of constant length appliying the Fast Fourier Transform (FFT) trasforming individually these parts of the signal. The periodogram is basically a way of estimating power spectral density of a time series.

The FDA parameters include the following:

Peak value in the Very Low Frequency Band of the HRV power density spectrum [measured in Hz].

Peak value in the Low Frequency Band of the HRV power density spectrum [measured in Hz].

Peak value in the High Frequency Band of the HRV power density spectrum [measured in Hz].

Power in the Very Low Frequency Band of the HRV power density spectrum [measured in ms^2].

Power in the Low Frequency Band of the HRV power density spectrum [measured in ms^2].

Power in the High Frequency Band of the HRTotal Power of the HRV power density spectrum [measured in ms^2].

Total Power of the HRV power density spectrum [measured in ms^2].

Percentage power in the Very Low Frequency Band of the HRV power density spectrum with respect to the total power.

Percentage power in the Low Frequency Band of the HRV power density spectrum with respect to the total power.

Percentage power in the High Frequency Band of the HRV power density spectrum with respect to the total power.

Normalized power in the Low Frequency Band of the HRV power density spectrum with respect to the sum of LF and HF power.

Normalized power in the High Frequency Band of the HRV power density spectrum with respect to the sum of LF and HF power.

Sympathovagal balance measured as the ration between power in LF and power in the LF band.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
ERA5 Reanalysis Monthly Means
rda.ucar.edu
data.ucar.edu
Updated Oct 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (2017). ERA5 Reanalysis Monthly Means [Dataset]. http://doi.org/10.5065/D63B5XW1
Explore at:
Unique identifier
https://doi.org/10.5065/D63B5XW1
Dataset updated
Oct 6, 2017
Dataset provided by
University Corporation for Atmospheric Research
Authors
European Centre for Medium-Range Weather Forecasts
Time period covered
Jan 1, 2008 - Dec 31, 2017
Area covered
Description
Please note: Please use ds633.1 to access RDA maintained ERA-5 Monthly Mean data, see ERA5 Reanalysis (Monthly Mean 0.25 Degree Latitude-Longitude Grid), RDA dataset ds633.1. This dataset is no longer being updated, and web access has been removed.

After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time, though the first segment of data to be released will span the period 2010-2016.

ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (18 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, e.g. mean rates and accumulations, that are not available from the analyses. Together, the hourly analysis and twice daily forecast parameters form the basis of the monthly means (and monthly diurnal means) found in this dataset.

Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles.

NCAR's Data Support Section (DSS) is performing and supplying a grid transformed version of ERA5, in which variables originally represented as spectral coefficients or archived on a reduced Gaussian grid are transformed to a regular 1280 longitude by 640 latitude N320 Gaussian grid. In addition, DSS is also computing horizontal winds (u-component, v-component) from spectral vorticity and divergence where these are available. Finally, the data is reprocessed into single parameter time series.

Please note: As of November 2017, DSS is also producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5 for CISL RDA at NCAR. The netCDF-4/HDF5 version is the de facto RDA ERA5 online data format. The GRIB1 data format is only available via NCAR's High Performance Storage System (HPSS). We encourage users to evaluate the netCDF-4/HDF5 version for their work, and to use the currently existing GRIB1 files as a reference and basis of comparison. To ease this transition, there is a one-to-one correspondence between the netCDF-4/HDF5 and GRIB1 files, with as much GRIB1 metadata as possible incorporated into the attributes of the netCDF-4/HDF5 counterpart.
University Salaries
kaggle.com
Updated Mar 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tyson Pond (2021). University Salaries [Dataset]. https://www.kaggle.com/datasets/tysonpo/university-salaries/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tyson Pond
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

This data contains salaries of University of Vermont (UVM) faculty from 2009 to 2021. We present two datasets. The second dataset is richer because it contains information on faculty departments/colleges; however, it contains less rows due to how we chose to join this data. 1. salaries_without_dept.csv contains all of the data we extracted from the PDFs. The four columns are: Year, Faculty Name, Primary Job Title, and Base Pay. There are 47,479 rows. 2. salaries_final.csv contains the same columns as [1], but also joins with data about the faculty's "Department" and "College" (for a total of six columns). There are only 14,470 rows in this dataset because we removed rows for which we could not identify the Department/College of the faculty.

Data collection

All data is publicly available on the University of Vermont website. I downloaded all PDFs from https://www.uvm.edu/oir/faculty-and-staff. Then I used a Python package (Camelot) to parse the tabular PDFs and used regex matching to ensure data was correctly parsed. I performed some initial cleaning (removed dollar signs from monetary values, etc.). At this stage, I saved the data to salaries_without_dept.csv.

I also wanted to know what department and college each faculty belonged to. I used http://catalogue.uvm.edu/undergraduate/faculty/fulltime (plus Python's lxml package to parse the HTML) to determine "Department" and then manually built an encoding to map "Department" to "College". Note that this link provides faculty information for 2020, thus after joining we end up only with faculty that are still employed as of 2020 (this should be taken into consideration). Secondly, this link does not include UVM administration (and possibly some other personnel) so they are not present in this dataset. Thirdly, there were several different ways names were reported (sometimes even the same person has their name reported differently in different years). We tried joining first on LastName+FirstName and then on LastName+FirstInitial but did not bother using middle name. To handle ambiguity, we removed duplicates (e.g. we removed Martin, Jacob and Martin, Jacob William as they were not distinguishable by our criteria). The joined data is available in salaries_final.csv.

Note: perhaps "College" was not the best naming, since faculty of UVM Libraries and other miscellaneous fields are included.

Data dictionary

The column definitions are self-explanatory, but the "College" abbreviation meanings are unclear to a non-UVM-affiliate. We've included data_dictionary.csv to explain what each "College" abbreviation means. You can use this dictionary to filter out miscellaneous "colleges" (e.g. UVM Libraries) and only include colleges within the undergraduate program (e.g. filter out College of Medicine).

Uses

Despite there only being a few (six) columns, I think this is quite a rich dataset and could also be paired with other UVM data or combined with data from other universities. This dataset is mainly for data analytics and exploratory data analysis (EDA), but perhaps it could also be used for forecasting (however, there's only 12 time values so you'd probably want to make use of "College" or "Primary Job Title"). Interesting EDA questions could be: 1. "Are the faculty in arts & humanities departments being paid less?" This news article -- UVM to eliminate 23 programs in the College of Arts and Sciences -- suggests so. Give a quantitative answer.
2. "Are lecturers declining in quantity and pay?" This news article -- ‘I’m going to miss this:’ Three cut lecturers reflect on time at UVM -- suggests so. Give a quantitative answer. 3. "How does the College of Medicine compare to the undergraduate colleges in terms of number of faculty and pay?" See data_dictionay.csv for which colleges are in the undergraduate program. 4. "How long does it take for a faculty member to become a full professor?" Yes, this is also answerable from the data because Primary Job Title updates when a faculty member is promoted.

Future updates

I do not plan to maintain this dataset. If I get the chance, I may update it with future year salaries.
o
BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...
explore.openaire.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado (2023). BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification [Dataset]. http://doi.org/10.5281/zenodo.6913045
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6913045
Dataset updated
Jan 26, 2023
Authors
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado
Description
This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.
Stock Market: Historical Data of Top 10 Companies
kaggle.com
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Pitroda (2023). Stock Market: Historical Data of Top 10 Companies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/stock-market-historical-data-of-top-10-companies/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Khushi Pitroda
Description
The dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.

Data Analysis Tasks:

1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.

2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.

3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.

4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.

5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.

Machine Learning Tasks:

1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).

2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).

3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.

4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.

5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.

The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.

It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.

This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.

By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.

Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.

In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.
f
Data from: Seismic Pattern Changes Before the 2011 Tohoku Earthquake...
figshare.com
docx
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomokazu Konishi (2025). Seismic Pattern Changes Before the 2011 Tohoku Earthquake Revealed by exploratory data analysis [Dataset]. http://doi.org/10.6084/m9.figshare.29094065.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29094065.v1
Dataset updated
May 18, 2025
Dataset provided by
figshare
Authors
Tomokazu Konishi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tohoku Region
Description
The ability to predict earthquakes is invaluable, especially in high-risk seismic zones, yet precise predictions remain elusive. One potential reason is the limited integration of statistical approaches in earthquake research. In this study, I employed exploratory data analysis (EDA), a data-driven parametric statistical method, to investigate seismic records from Japan, using data provided by the Japan Meteorological Agency. The intervals between earthquakes closely followed an exponential distribution, defined by a single parameter, λ, representing event frequency. In contrast to the conventional Gutenberg-Richter law, earthquake magnitudes conformed to a normal distribution, characterised by two parameters: µ (mean) and σ (scale). After establishing these distributions and their parameters, significant shifts became evident. Before the 2011 Tohoku Pacific Ocean earthquake, notable changes emerged: earthquake intervals initially shortened, possibly reflecting energy destabilisation around asperities at plate boundaries, followed by an increase in the magnitude scale. These patterns suggest tectonic plates are heterogeneous, with varying boundary rigidity. Additionally, moving averages of magnitude exhibited substantial fluctuations, reaching unusually high levels. Identifying such anomalies accurately required understanding baseline distributions under normal conditions. Broadening the use of EDA across diverse seismic datasets could improve prediction accuracy. This study underscores the importance of statistical methodologies in seismic research and provides critical insights for enhancing seismic risk assessment.
Store Sales - T.S Forecasting...Merged Dataset
kaggle.com
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shramana Bhattacharya
Description
This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

submission.csv - A sample submission file in the correct format.

Credit Card Approval Prediction

kaggle.com

Updated Mar 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Seanny (2020). Credit Card Approval Prediction [Dataset]. https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 24, 2020

Dataset provided by

Kaggle

Authors

Seanny

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

A Credit Card Dataset for Machine Learning!

Don't ask me where this data come from, the answer is I don't know!

Context

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

Task

Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.

Content & Explanation

There're two tables could be merged by ID:

application_record.csv
Feature name	Explanation	Remarks
`ID`	Client number
`CODE_GENDER`	Gender
`FLAG_OWN_CAR`	Is there a car
`FLAG_OWN_REALTY`	Is there a property
`CNT_CHILDREN`	Number of children
`AMT_INCOME_TOTAL`	Annual income
`NAME_INCOME_TYPE`	Income category
`NAME_EDUCATION_TYPE`	Education level
`NAME_FAMILY_STATUS`	Marit...

u
ERA5 Reanalysis Model Level Data
data.ucar.edu
rda.ucar.edu
+2more
netcdf
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (2025). ERA5 Reanalysis Model Level Data [Dataset]. http://doi.org/10.5065/XV5R-5344
Explore at:
netcdfAvailable download formats
Unique identifier
https://doi.org/10.5065/XV5R-5344
Dataset updated
Jul 8, 2025
Dataset provided by
Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
Authors
European Centre for Medium-Range Weather Forecasts
Time period covered
Jan 1, 1979 - Apr 30, 2025
Area covered
Description
After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time. ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (12 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, for example mean rates and accumulations, that are not available from the analyses. Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles. Please note: DECS is producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5...
f
Data from: Insight into the Bonding and Aggregation of Alkyllithiums by...
acs.figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Annika Münch; Lena Knauer; Holger Ott; Christian Sindlinger; Regine Herbst-Irmer; Carsten Strohmann; Dietmar Stalke (2023). Insight into the Bonding and Aggregation of Alkyllithiums by Experimental Charge Density Studies and Energy Decomposition Analyses [Dataset]. http://doi.org/10.1021/jacs.0c06035.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/jacs.0c06035.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Annika Münch; Lena Knauer; Holger Ott; Christian Sindlinger; Regine Herbst-Irmer; Carsten Strohmann; Dietmar Stalke
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In this Article, the organolithiums ((−)-sparteine)LitBu, [(ABCO)LitBu]2 (2), and (ABCO)2(LiiPr)4 are investigated by means of experimental and theoretical charge density determination to elucidate the nature of the Li–C and Li–N bonds. Furthermore, the valence shell charge concentrations (VSCCs) in the nonbonding region of the deprotonated Cα-atom will provide some insight on the localization of the carbanionic lone pair. Analysis of the electron density (ρ(rBCP)), Laplacian (∇2ρ(rBCP)), and the energy decomposition (EDA) confirmed that the Li–C/N bond exhibits astonishingly similar characteristics, to reveal an increasingly polar contact with decreasing aggregate size. This explains former observations on the incorporation of halide salts in organolithium reagents. Furthermore, it could be shown that the bonding properties of the iPr group are similar to those of the tBu substituent. The accuracy of fit to all previously determined properties in organolithiums is remarkable.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE)

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.1c00244.s002

Dataset updated

Jun 8, 2023

Dataset provided by

ACS Publications

Authors

Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

Clear search

Close search

Google apps

Main menu

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

EDA - Percentage of University Center clients taking action as a result of...

IMDB Dataset & Dictionary

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Multidimensional Dataset for APA Investigations in Cancer Patients

Cdd Dataset

ERA5 Reanalysis Monthly Means

University Salaries

Overview

Data collection

Data dictionary

Uses

Future updates

BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...

Stock Market: Historical Data of Top 10 Companies

Data from: Seismic Pattern Changes Before the 2011 Tohoku Earthquake...

Store Sales - T.S Forecasting...Merged Dataset