Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
This measure determines the perceived value added by the University Centers (UCs) to their clients. EDA funds UCs to provide technical assistance and specialized services (for example, feasibility studies, marketing research, economic analysis, environmental services, and technology transfer) to local officials and communities. This assistance improves the community’s capacity to plan and manage successful development projects. UCs develop client profiles and report findings to EDA, which evaluates the performance of each center once every 3 years and verifies the data. “Taking action as a result of the assistance facilitated” means to implement an aspect of the technical assistance provided by the UC in one of several areas: economic development initiatives and training session development; linkages to crucial resources; economic development planning; project management; community investment package development; geographic information system services; strategic partnering to public or private sector entities; increased organizational capacity; feasibility plans; marketing studies; technology transfer; new company, product, or patent development; and other services.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
We all love movies! I remember watching my first movie with my family when I was 5 and 3 years later, I still love movies. But have you ever wondered how some people rate movies as good or bad, awesome or mehh! That's correct. Different people have different perspectives on how they like or dislike movies. To help us select from a plethora of movie option out there, IMDB platform provides us honest reviews by the people for the people.
Long story short, this assignment will take you through different aspects of how a movie is reviewed by different people from across the globe based on their star cast, genre, story length and many more aspects.
So here is what you need to do! Few points: 1. Download the dataset & the dictionary that will help you learn the different columns in the dataset 2. Start exploring the data by performing EDA (wiki what’s EDA, if you are a dummy like I was initially) 3. Get back to this notebook to check what all I did for exploring through the data and then follow the subtasks & checkpoints!
Simple? Isn’t it! Do complete the exercise & let me know in the comments if you found this exercise helpful? There’s always a scope for improvement. Tell me what more could have been added to this notebook! Hope you’ll have a good time exploring data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains data collected from patients suffering from cancer-related pain. The features extracted from clinical data (including typical cancer phenomena such as breakthrough pain) and the biosignal acquisitions contributed to the definition of a multidimensional dataset. This unique database can be useful for the characterization of the patient’s pain experience from a qualitative and quantitative perspective. We implemented measurable biosignals-related indicators of the individual’s pain response and of the overall Autonomic Nervous System (ANS) functioning. The most peculiar features extracted from EDA and ECG signals can be adopted to investigate the status and complex functioning of the ANS through the study of sympatho-vagal activations. Specifically, while EDA is mainly related sympathetic activation, the Heart Rate Variability (HRV), which can be derived from ECG recordings, is strictly related to the interplay between sympathetic and parasympathetic functioning.
As far as the EDA signal, two types of analyzes have been performed: (i) the Trough-To-Peak analysis (TTP), or min-max analysis, aimed at measuring the difference between the Skin Conductance (SC) at the peak of a response and its previous minimum within pre-established time-windows; (ii) the Continuous Decomposition Analysis (CDA), aimed at performing a decomposition of SC data into continuous signals of tonic (basic level of conductance) and phasic (short-duration changes in the SC) activity. Before applying the TPP analysis or the CDA, the signal was filtered by means of a fifth-order Butterworth low-pass filter with a cutoff frequency of 1 Hz and downsampled up to 10 Hz to reducing the computational burden of the analysis. The application of TPP and CDA allowed the detection and measurement of SC Responses (SCR) and the following parameters have been calculated for both TPP and CDA methodologies:
Concerning the ECG, the RR series of interbeat intervals (i.e., the time between successive R waves of the QRS complex on the ECG waveform) has been computed to extract time-domain parameters of the HRV. The R peak detection was carried out by adopting the Pan–Tompkins algorithm for QRS detection and R peak identification. The corresponding RR series of interbeat intervals were derived as the difference between successive R peaks.
The ECG-derived RR time series was then filtered by means of a recursive procedure to remove the intervals differing most from the mean of the surrounding RR intervals. Then, both the Time-Domain Analysis (TDA) and Frequency-Domain Analysis (FDA) of the HRV have been carried out to extract the main features characterizing the variability of the heart rhythm. Time-domain parameters are obtained from statistical analysis of the intervals between heart beats and are used to describe how much variability in the heartbeats is present at various time scales.
The parameters computed through the TDA include the following:
Frequency-domain parameters reflect the distribution of spectral power across different frequencies bands and are used to assess specific components of HRV (e.g., thermoregulation control loop, baroreflex control loop, and respiration control loop, which are regulated by both sympathetic and vagal nerves of the ANS).
The parameters computed through the FDA have been computed by adopting the Welch's Fourier periodogram method based on the Discrete Fourier Transform (DFT), which allows the expression of the RR series in the discrete frequency domain. However, due to the non-stationarity of the RR series, Welch Fourier periodogram method is used for dealing with non-stationarity. Specifically, Welch's periodogram divides the signal into specific periods of constant length appliying the Fast Fourier Transform (FFT) trasforming individually these parts of the signal. The periodogram is basically a way of estimating power spectral density of a time series.
The FDA parameters include the following:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Please note: Please use ds633.1 to access RDA maintained ERA-5 Monthly Mean data, see ERA5 Reanalysis (Monthly Mean 0.25 Degree Latitude-Longitude Grid), RDA dataset ds633.1. This dataset is no longer being updated, and web access has been removed.
After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time, though the first segment of data to be released will span the period 2010-2016.
ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (18 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, e.g. mean rates and accumulations, that are not available from the analyses. Together, the hourly analysis and twice daily forecast parameters form the basis of the monthly means (and monthly diurnal means) found in this dataset.
Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles.
NCAR's Data Support Section (DSS) is performing and supplying a grid transformed version of ERA5, in which variables originally represented as spectral coefficients or archived on a reduced Gaussian grid are transformed to a regular 1280 longitude by 640 latitude N320 Gaussian grid. In addition, DSS is also computing horizontal winds (u-component, v-component) from spectral vorticity and divergence where these are available. Finally, the data is reprocessed into single parameter time series.
Please note: As of November 2017, DSS is also producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5 for CISL RDA at NCAR. The netCDF-4/HDF5 version is the de facto RDA ERA5 online data format. The GRIB1 data format is only available via NCAR's High Performance Storage System (HPSS). We encourage users to evaluate the netCDF-4/HDF5 version for their work, and to use the currently existing GRIB1 files as a reference and basis of comparison. To ease this transition, there is a one-to-one correspondence between the netCDF-4/HDF5 and GRIB1 files, with as much GRIB1 metadata as possible incorporated into the attributes of the netCDF-4/HDF5 counterpart.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data contains salaries of University of Vermont (UVM) faculty from 2009 to 2021. We present two datasets. The second dataset is richer because it contains information on faculty departments/colleges; however, it contains less rows due to how we chose to join this data.
1. salaries_without_dept.csv
contains all of the data we extracted from the PDFs. The four columns are: Year, Faculty Name, Primary Job Title, and Base Pay. There are 47,479 rows.
2. salaries_final.csv
contains the same columns as [1], but also joins with data about the faculty's "Department" and "College" (for a total of six columns). There are only 14,470 rows in this dataset because we removed rows for which we could not identify the Department/College of the faculty.
All data is publicly available on the University of Vermont website. I downloaded all PDFs from https://www.uvm.edu/oir/faculty-and-staff. Then I used a Python package (Camelot) to parse the tabular PDFs and used regex matching to ensure data was correctly parsed. I performed some initial cleaning (removed dollar signs from monetary values, etc.). At this stage, I saved the data to salaries_without_dept.csv
.
I also wanted to know what department and college each faculty belonged to. I used http://catalogue.uvm.edu/undergraduate/faculty/fulltime (plus Python's lxml package to parse the HTML) to determine "Department" and then manually built an encoding to map "Department" to "College". Note that this link provides faculty information for 2020, thus after joining we end up only with faculty that are still employed as of 2020 (this should be taken into consideration). Secondly, this link does not include UVM administration (and possibly some other personnel) so they are not present in this dataset. Thirdly, there were several different ways names were reported (sometimes even the same person has their name reported differently in different years). We tried joining first on LastName+FirstName and then on LastName+FirstInitial but did not bother using middle name. To handle ambiguity, we removed duplicates (e.g. we removed Martin, Jacob and Martin, Jacob William as they were not distinguishable by our criteria). The joined data is available in salaries_final.csv
.
Note: perhaps "College" was not the best naming, since faculty of UVM Libraries and other miscellaneous fields are included.
The column definitions are self-explanatory, but the "College" abbreviation meanings are unclear to a non-UVM-affiliate. We've included data_dictionary.csv
to explain what each "College" abbreviation means. You can use this dictionary to filter out miscellaneous "colleges" (e.g. UVM Libraries) and only include colleges within the undergraduate program (e.g. filter out College of Medicine).
Despite there only being a few (six) columns, I think this is quite a rich dataset and could also be paired with other UVM data or combined with data from other universities. This dataset is mainly for data analytics and exploratory data analysis (EDA), but perhaps it could also be used for forecasting (however, there's only 12 time values so you'd probably want to make use of "College" or "Primary Job Title"). Interesting EDA questions could be:
1. "Are the faculty in arts & humanities departments being paid less?" This news article -- UVM to eliminate 23 programs in the College of Arts and Sciences -- suggests so. Give a quantitative answer.
2. "Are lecturers declining in quantity and pay?" This news article -- ‘I’m going to miss this:’ Three cut lecturers reflect on time at UVM -- suggests so. Give a quantitative answer.
3. "How does the College of Medicine compare to the undergraduate colleges in terms of number of faculty and pay?" See data_dictionay.csv
for which colleges are in the undergraduate program.
4. "How long does it take for a faculty member to become a full professor?" Yes, this is also answerable from the data because Primary Job Title updates when a faculty member is promoted.
I do not plan to maintain this dataset. If I get the chance, I may update it with future year salaries.
This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.
The dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.
Data Analysis Tasks:
1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.
2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.
3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.
4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.
5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.
Machine Learning Tasks:
1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).
3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.
4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.
5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.
The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.
It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.
This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.
By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.
Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.
In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ability to predict earthquakes is invaluable, especially in high-risk seismic zones, yet precise predictions remain elusive. One potential reason is the limited integration of statistical approaches in earthquake research. In this study, I employed exploratory data analysis (EDA), a data-driven parametric statistical method, to investigate seismic records from Japan, using data provided by the Japan Meteorological Agency. The intervals between earthquakes closely followed an exponential distribution, defined by a single parameter, λ, representing event frequency. In contrast to the conventional Gutenberg-Richter law, earthquake magnitudes conformed to a normal distribution, characterised by two parameters: µ (mean) and σ (scale). After establishing these distributions and their parameters, significant shifts became evident. Before the 2011 Tohoku Pacific Ocean earthquake, notable changes emerged: earthquake intervals initially shortened, possibly reflecting energy destabilisation around asperities at plate boundaries, followed by an increase in the magnitude scale. These patterns suggest tectonic plates are heterogeneous, with varying boundary rigidity. Additionally, moving averages of magnitude exhibited substantial fluctuations, reaching unusually high levels. Identifying such anomalies accurately required understanding baseline distributions under normal conditions. Broadening the use of EDA across diverse seismic datasets could improve prediction accuracy. This study underscores the importance of statistical methodologies in seismic research and provides critical insights for enhancing seismic risk assessment.
This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.
##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)
Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)
**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*
submission.csv - A sample submission file in the correct format.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Don't ask me where this data come from, the answer is I don't know!
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.
Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.
There're two tables could be merged by ID
:
application_record.csv | ||
---|---|---|
Feature name | Explanation | Remarks |
ID | Client number | |
CODE_GENDER | Gender | |
FLAG_OWN_CAR | Is there a car | |
FLAG_OWN_REALTY | Is there a property | |
CNT_CHILDREN | Number of children | |
AMT_INCOME_TOTAL | Annual income | |
NAME_INCOME_TYPE | Income category | |
NAME_EDUCATION_TYPE | Education level | |
NAME_FAMILY_STATUS | Marit... |
After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time. ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (12 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, for example mean rates and accumulations, that are not available from the analyses. Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles. Please note: DECS is producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In this Article, the organolithiums ((−)-sparteine)LitBu, [(ABCO)LitBu]2 (2), and (ABCO)2(LiiPr)4 are investigated by means of experimental and theoretical charge density determination to elucidate the nature of the Li–C and Li–N bonds. Furthermore, the valence shell charge concentrations (VSCCs) in the nonbonding region of the deprotonated Cα-atom will provide some insight on the localization of the carbanionic lone pair. Analysis of the electron density (ρ(rBCP)), Laplacian (∇2ρ(rBCP)), and the energy decomposition (EDA) confirmed that the Li–C/N bond exhibits astonishingly similar characteristics, to reveal an increasingly polar contact with decreasing aggregate size. This explains former observations on the incorporation of halide salts in organolithium reagents. Furthermore, it could be shown that the bonding properties of the iPr group are similar to those of the tBu substituent. The accuracy of fit to all previously determined properties in organolithiums is remarkable.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.