22 datasets found

R
WIDEa: a Web Interface for big Data exploration, management and analysis
entrepot.recherche.data.gouv.fr
Updated Sep 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philippe Santenoise; Philippe Santenoise (2021). WIDEa: a Web Interface for big Data exploration, management and analysis [Dataset]. http://doi.org/10.15454/AGU4QE
Explore at:
Unique identifier
https://doi.org/10.15454/AGU4QE
Dataset updated
Sep 12, 2021
Dataset provided by
Recherche Data Gouv
Authors
Philippe Santenoise; Philippe Santenoise
License
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QEhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QE
Description
WIDEa is R-based software aiming to provide users with a range of functionalities to explore, manage, clean and analyse "big" environmental and (in/ex situ) experimental data. These functionalities are the following, 1. Loading/reading different data types: basic (called normal), temporal, infrared spectra of mid/near region (called IR) with frequency (wavenumber) used as unit (in cm-1); 2. Interactive data visualization from a multitude of graph representations: 2D/3D scatter-plot, box-plot, hist-plot, bar-plot, correlation matrix; 3. Manipulation of variables: concatenation of qualitative variables, transformation of quantitative variables by generic functions in R; 4. Application of mathematical/statistical methods; 5. Creation/management of data (named flag data) considered as atypical; 6. Study of normal distribution model results for different strategies: calibration (checking assumptions on residuals), validation (comparison between measured and fitted values). The model form can be more or less complex: mixed effects, main/interaction effects, weighted residuals.
d
DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...
catalog.data.gov
data.openei.org
+1more
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2025). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://catalog.data.gov/dataset/deepen-global-standardized-categorical-exploration-datasets-for-magmatic-plays-f1ecf
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
National Renewable Energy Laboratory
Description
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.
Shopping Mall
kaggle.com
zip
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description
Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Insurance_claims
kaggle.com
data.mendeley.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
Explore at:
zip(68984 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
Miannotti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

https://data.mendeley.com/datasets/992mh7dk9y/2

Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

Load the Dataset File

insurance_df = pd.read_csv('insurance_claims.csv')

Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...
osti.gov
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caliandro, Nils; King, Rachel; Taverna, Nicole (2023). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1995526-deepen-global-standardized-categorical-exploration-datasets-magmatic-plays
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
United States Department of Energyhttp://energy.gov/
Authors
Caliandro, Nils; King, Rachel; Taverna, Nicole
Description
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal datasetmore » is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.« less
Data from: The Gravity Loading Countermeasure Skinsuit
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). The Gravity Loading Countermeasure Skinsuit [Dataset]. https://data.nasa.gov/dataset/The-Gravity-Loading-Countermeasure-Skinsuit/9b48-p3n8
Explore at:
application/rssxml, application/rdfxml, json, csv, xml, tsvAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Astronauts lose considerable bone mass during long duration spaceflight. These losses are one of the major concerns for proposed exploration class missions to the Moon, Mars, and Near Earth Objects due to the increase in fracture risk associated with reduced bone strength. These losses are seen even with the intervention of current exercise countermeasures. Although it is possible that the newest exercise machine, the Advanced Resistive Exercise Device (ARED), will be more effective in preventing bone and muscle losses, its size may be prohibitive in bringing it on interplanetary missions. In order for astronauts to be able to perform successful exploration tasks, they need to arrive at their destinations healthy and capable of doing work. Because of this, new countermeasures will need to be developed to prevent musculoskeletal deconditioning.

The goal of this proposal is to meet this need by producing a wearable countermeasure suit. The suits primary goal will be to impose static loading, similar to that produced by gravity, on the user. In addition, incorporating a dynamic loading component, such as a form of vibration, may enhance the effectiveness of the suit. Finally, integrating the suit with existing countermeasures could serve to improve the overall efficacy of the entire countermeasure program. While astronauts currently exercise for around 2 hours, this suit could be worn for a longer period of time, including while the astronauts are performing other tasks. The suits will also be lightweight and easily packable, which is a major consideration for space exploration missions.

The research objectives of this proposal are as follows: To produce a comprehensive model of suit-body interactions to aid in suit design To investigate the integration of the suit with existing countermeasures To investigate forms of dynamic loading, and their effects on subject comfort and performance To build and characterize prototype countermeasure suits The model from aim 1 will be created using Matlab and body modeling software, and will be used to inform suit design. It will be used to compute the effects of integrating the suit with existing countermeasures on overall suit characteristics. Dynamic loading mechanisms will be evaluated on their loading qualities and influence on subject comfort and performance. After the suits are constructed, their loading and comfort traits will be characterized.

The Gravity Loading Countermeasure Skinsuit will reduce the musculoskeletal deconditioning seen during long duration spaceflight, allowing for a more robust and effective exploration program. The technologies developed in this proposal will also have applications in the medical field, for treating bed rest patients and in healing musculoskeletal injuries.
c
Data from: Smart metering and energy access programs: an approach to energy...
esango.cput.ac.za
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bennour Bacar (2023). Smart metering and energy access programs: an approach to energy poverty reduction in sub-Saharan Africa [Dataset]. http://doi.org/10.25381/cput.22264042.v1
Explore at:
Unique identifier
https://doi.org/10.25381/cput.22264042.v1
Dataset updated
May 31, 2023
Dataset provided by
Cape Peninsula University of Technology
Authors
Bennour Bacar
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Sub-Saharan Africa
Description
Ethical clearance reference number: refer to the uploaded document Ethics Certificate.pdf.

General (0)

0 - Built diagrams and figures.pdf: diagrams and figures used for the thesis

Analysis of country data (1)

0 - Country selection.xlsx: In this analysis the sub-Saharan country (Niger) is selected based on the kWh per capita data obtained from sources such as the United Nations and the World Bank. Other data used from these sources includes household size and electricity access. Some household data was projected using linear regression. Sample sizes VS error margins were also analyzed for the selection of a smaller area within the country.

Smart metering experiment (2)

The figures (PNG, JPG, PDF) include:

- The experiment components and assembly - The use of device (meter and modem) softwar tools to program and analyse data - Phasor and meter detail - Extracted reports and graphs from the MDMS

The datasets (CSV, XLSX) include:

- Energy load profile and register data recorded by the smart meter and collected by both meter configuration and MDM applications. - Data collected also includes events, alarm and QoS data.

Data applicability to SEAP (3)

3 - Energy data and SEAP.pdf: as part of the Smart Metering VS SEAP framework analysis, a comparison between SEAP's data requirements, the applicable energy data to those requirements, the benefits, and the calculation of indicators where applicable. 3 - SEAP indicators.xlsx: as part of the Smart Metering VS SEAP framework analysis, the applicable calculation of indicators for SEAP's data requirements.

Load prediction by machine learning (4)

The coding (IPYNB, PY, HTML, ZIP) shows the preparation and exploration of the energy data to train the machine learning model. The datasets (CSV, XLSX), sequentially named, are part of the process of extracting, transforming and loading the data into a machine learning algorithm, identifying the best regression model based on metrics, and predicting the data.

HRES analysis and optimization (5)

The figures (PNG, JPG, PDF) include:

- Household load, based on the energy data from the smart metering experiment and the machine learning exercise - Pre-defined/synthetic load, provided by the software when no external data (household load) is available, and - The HRES designed - Application-generated reports with the results of the analysis, for both best case HRES and fully renewable scenarios.

The datasets (XLSX) include the 12-month input load for the simulation, and the input/output analysis and calculations. 5 - Gorou_Niger_20220529_v3.homer: software (Homer Pro) file with the simulated HRES

· Conferences (6)

6 – IEEE_MISTA_2022_paper_51.pdf: paper (research in progress) presented at the IEEE MISTA 2022 conference, occurred in March-2022, and published in the respective proceeding, 6 - IEEE_MISTA_2022_proceeding.pdf. 6 - ITAS_2023.pdf: paper (final research) recently presented at the ITAS 2023 conference in Doha, Qatar, in March-2023. 6 - Smart Energy Seminar 2023.pptx: PowerPoint slide version of the paper, recently presented at the Smart Energy Seminar held at CPUT, in March-2023.
Marine Loading Arms Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Marine Loading Arms Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/marine-loading-arms-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jun 7, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Germany, United States, Canada, France, United Kingdom
Description
Snapshot img

Marine Loading Arms Market Size 2025-2029

The marine loading arms market size is valued to increase by USD 82 million, at a CAGR of 4.1% from 2024 to 2029. New oil and gas exploration policies will drive the marine loading arms market.

Major Market Trends & Insights

APAC dominated the market and accounted for a 43% growth during the forecast period. By Application - Crude oil segment was valued at USD 130.10 million in 2023 By Type - Manual marine loading arms segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 35.06 million Market Future Opportunities: USD 82.00 million CAGR from 2024 to 2029 : 4.1%

Market Summary

The market is a critical component of the global oil and gas industry, facilitating the transfer of petroleum products from vessels to onshore storage and distribution facilities. A single data point illustrates the market's significance: it was valued at USD 3.5 billion in 2020. Advancements in technology have significantly influenced the market's evolution. The advent of motion-recognizing marine loading arms has streamlined the loading process, enhancing efficiency and safety. However, the high cost associated with these advanced systems presents a challenge for market growth. Despite this hurdle, the market continues to adapt and innovate. Manufacturers are exploring cost-effective solutions, such as modular and lightweight designs, to make marine loading arms more accessible to a wider range of customers. Furthermore, the integration of automation and remote monitoring systems is expected to drive market expansion. The market's future direction lies in enhancing operational efficiency, ensuring safety, and reducing environmental impact. As the industry navigates the complexities of new exploration policies and evolving market dynamics, marine loading arms will remain a vital link in the oil and gas supply chain.

What will be the Size of the Marine Loading Arms Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Marine Loading Arms Market Segmented?

The marine loading arms industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Crude oil LG IG Type Manual marine loading arms Hydraulic marine loading arms Method Top loading Bottom Loading Material Carbon steel Stainless steel Aluminum Others Geography North America US Canada Europe France Germany Italy UK APAC China India Japan South Korea Rest of World (ROW)

By Application Insights

The crude oil segment is estimated to witness significant growth during the forecast period.

The market is undergoing continuous evolution, with the crude oil segment experiencing significant growth due to the increasing demand for efficient and safe transfer systems in the oil and gas industry. Marine loading arms play a crucial role in the loading and unloading of crude oil from tankers to storage facilities or pipelines, ensuring minimal spillage and enhanced safety. The global trade of crude oil, particularly from regions like the Middle East and Asia-Pacific, has necessitated the adoption of advanced marine loading arms. These innovations include material selection criteria that prioritize vapor recovery systems, automated loading systems, and seal technology.

Request Free Sample

The Crude oil segment was valued at USD 130.10 million in 2019 and showed a gradual increase during the forecast period.

Rotary joints, product compatibility testing, high-pressure loading arms, safety interlocks systems, and operational efficiency metrics are becoming increasingly important. Furthermore, leak detection systems, environmental impact assessment, balancing systems, corrosion protection methods, emergency shutdown systems, and vessel compatibility are all critical considerations. The integration of subsea and offshore loading arms, remote control systems, fluid transfer optimization, cryogenic loading arms, throughput optimization, flexible joints, hydraulic power units, and loading arm integrity is also transforming the market.

Request Free Sample

Regional Analysis

APAC is estimated to contribute 43% to the growth of the global market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

See How Marine Loading Arms Market Demand is Rising in APAC Request Free Sample

The market in APAC is experiencing notable growth due to the expanding application sectors, including crude oil imports, refined products exports, and the IGs in
4
Data underlying chapter 4 of the PhD dissertation: Multi-fidelity...
data.4tu.nl
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikoleta Dimitra Charisi; Emile Defer; Hans Hopman; Austin Kana (2024). Data underlying chapter 4 of the PhD dissertation: Multi-fidelity probabilistic design framework for early-stage design of novel vessels [Dataset]. http://doi.org/10.4121/fc643c31-5428-48dc-bcf3-c8a24d49331a.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/fc643c31-5428-48dc-bcf3-c8a24d49331a.v1
Dataset updated
Nov 26, 2024
Dataset provided by
4TU.ResearchData
Authors
Nikoleta Dimitra Charisi; Emile Defer; Hans Hopman; Austin Kana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the code and data supporting the results presented in Chapter 4 of the dissertation "Multi-Fidelity Probabilistic Design Framework for Early-Stage Design of Novel Vessels" and the paper "Multi-fidelity design framework to support early-stage design exploration of the AXE frigates: the vertical bending moment case". The research explores the potential of harnessing multi-fidelity models for early-stage predictions of wave-induced loads, with a specific focus on wave-induced vertical bending moments. The assessed models include the application of both linear and nonlinear Gaussian processes and compositional kernels to improve predictions of wave-induced loads, specifically focusing on wave-induced vertical bending moments. The case study focuses on the early-stage exploration of the AXE frigates. Multi-fidelity models were constructed using both frequency- and time-domain methods to evaluate the vertical bending moments experienced by the hull.

The data include: (1) the parametric model developed in Rhino and Grasshopper used to generate the hull mesh, (2) the simulation data, (3) the data associated with the analyzed cases, and (4)the Python scripts can be found in this gitlab repository. The analysis solvers used to calculate the vertical bending moments for calculating the vertical bending moments are not included in this repository.
CreditCardFraudDetection
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rashmita Chauhan (2024). CreditCardFraudDetection [Dataset]. https://www.kaggle.com/datasets/rashmitachauhan/creditcardfrauddetection
Explore at:
zip(32546 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Rashmita Chauhan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This end-to-end machine learning pipeline addresses the problem of credit card fraud detection. The pipeline consists of several key steps: Problem Statement: The notebook begins by outlining the importance of detecting credit card fraud for financial institutions and customers. It frames the task as a binary classification problem. Data Loading and Exploration: The pipeline uses a credit card transaction dataset, likely from Kaggle. It explores the data structure, checks for missing values, and examines the class distribution of fraudulent vs. non-fraudulent transactions. Data Preprocessing: The features are separated from the target variable. The data is split into training and testing sets. Feature scaling is applied using StandardScaler. SMOTE (Synthetic Minority Over-sampling Technique) is used to address class imbalance in the training data. Model Training: Two models are trained: Logistic Regression and Random Forest Classifier. Both models are fitted on the resampled training data. Model Evaluation: The models are evaluated using various metrics including accuracy, precision, recall, F1-score, and ROC AUC score. Confusion matrices are plotted for both models to visualize their performance. For the Random Forest model, feature importance is calculated and visualized. Discussion and Recommendations: The notebook concludes with a discussion on the strengths and limitations of the approach. It provides business recommendations based on the model results. The importance of model explainability is addressed, suggesting the use of SHAP values for more detailed interpretations. This pipeline demonstrates a comprehensive approach to fraud detection, from data preprocessing to model evaluation and business recommendations. It addresses common challenges in fraud detection such as class imbalance and the need for interpretable results
ner dataset and code of paper "Accelerating the Exploration of Information...
figshare.com
application/x-rar
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous anonymous (2024). ner dataset and code of paper "Accelerating the Exploration of Information in Chinese Geological Texts Using Pretrained Model and Self Attention" [Dataset]. http://doi.org/10.6084/m9.figshare.25416583.v3
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25416583.v3
Dataset updated
Apr 2, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
anonymous anonymous
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
The datasets and code used in this study are publicly available.Datasets:The datasets used in this study have been divided into training, testing, and validation sets.Code:This code repository includes Python scripts that replicate the experimental setup described in the paper. The code is organized into the following modules:Data Preprocessing: This module contains code for loading, cleaning, and transforming the datasets.Model Training: This module includes code for training various named entity recognition models using pre-trained language models. The code also includes implementations of the ablation experiments and data augmentation techniques described in the paper.Evaluation: This module contains code for evaluating the performance of the trained models on the held-out data sets.The data and code are provided to facilitate reproducibility and further research on named entity recognition in the Chinese language.Clarification: None of the authors are affiliated with Tsinghua University.
SP500_data
kaggle.com
zip
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
Explore at:
zip(39005 bytes)Available download formats
Dataset updated
May 28, 2023
Authors
Franco Dicosola
Description
Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.
m
From Expansion to Elimination, DATA
data.mendeley.com
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devan Wiley (2025). From Expansion to Elimination, DATA [Dataset]. http://doi.org/10.17632/5v54mctvxs.1
Explore at:
Unique identifier
https://doi.org/10.17632/5v54mctvxs.1
Dataset updated
Oct 7, 2025
Authors
Devan Wiley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project performs a Bayesian hierarchical analysis to investigate the factors influencing energy cost burden across different ZIP codes and years. Using panel data from multiple Excel files spanning several years (2012-2022), the project aims to model the relationship between energy cost burden and various predictors including tax_returns, uptake (presumably related to program participation or energy efficiency measures), and percent_white.

The core of the analysis involves:

Data Loading and Preprocessing: Combining data from multiple years, handling missing values, and standardizing predictor variables. Hierarchical Modeling: Building a Bayesian hierarchical model using PyMC that accounts for variation across both ZIP codes and years through the use of random effects. Inference: Performing inference using both variational inference (ADVI) and Markov Chain Monte Carlo (MCMC) methods, specifically the No-U-Turn Sampler (NUTS), to estimate the posterior distributions of the model parameters. Diagnostics and Comparison: Analyzing the convergence diagnostics (R-hat, ESS, divergences) for the MCMC samples and comparing the results obtained from ADVI and NUTS to understand the reliability of the different inference methods for this model and dataset. Exploratory Analysis: Including steps for basic data exploration such as summary statistics, correlation analysis, and time trends of key variables. The project highlights the importance of using robust MCMC methods like NUTS for complex models, especially when simpler approximations like ADVI might yield conflicting conclusions, and includes steps to improve sampler performance and assess convergence.
f
Data from: Effect of Solvents on Proline Modified at the Secondary Sphere: A...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danilo M. Lustosa; Shahar Barkai; Ido Domb; Anat Milo (2023). Effect of Solvents on Proline Modified at the Secondary Sphere: A Multivariate Exploration [Dataset]. http://doi.org/10.1021/acs.joc.1c02778.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.joc.1c02778.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Danilo M. Lustosa; Shahar Barkai; Ido Domb; Anat Milo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The critical influence of solvent effects on proline-catalyzed aldol reactions has been extensively described. Herein, we apply multivariate regression strategies to probe the influence of different solvents on an aldol reaction catalyzed by proline modified at its secondary sphere with boronic acids. In this system, both in situ binding of the boronic acid to proline and the outcome of the aldol reaction are impacted by the solvent-controlled microenvironment. Thus, with the aim of uncovering mechanistic insight and an ancillary aim of identifying methodological improvements, we designed a set of experiments, spanning 15 boronic acids in five different solvents. Based on hypothesized intermediates or interactions that could be responsible for the selectivity in these reactions, we proposed several structural configurations for the library of boronic acids. Subsequently, we compared the statistical models correlating the outcome of the reaction in different solvents with molecular descriptors produced for each of these proposed configurations. The models allude to the importance of different interactions in controlling selectivity in each of the studied solvents. As a proof-of-concept for the practicality of our approach, the models in chloroform ultimately led to lowering the ketone loading to only two equivalents while retaining excellent yield and enantio- and diastereo-selectivity.
Z
QLKNN11D training set
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karel Lucas van de Plassche; Jonathan Citrin (2023). QLKNN11D training set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8011147
Explore at:
Dataset updated
Jun 8, 2023
Dataset provided by
DIFFER
Authors
Karel Lucas van de Plassche; Jonathan Citrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
QLKNN11D training set

This dataset contains a large-scale run of ~1 billion flux calculations of the quasilinear gyrokinetic transport model QuaLiKiz. QuaLiKiz is applied in numerous tokamak integrated modelling suites, and is openly available at https://gitlab.com/qualikiz-group/QuaLiKiz/. This dataset was generated with the 'QLKNN11D-hyper' tag of QuaLiKiz, equivalent to 2.8.1 apart from the negative magnetic shear filter being disabled. See https://gitlab.com/qualikiz-group/QuaLiKiz/-/tags/QLKNN11D-hyper for the in-repository tag.

The dataset is appropriate for the training of learned surrogates of QuaLiKiz, e.g. with neural networks. See https://doi.org/10.1063/1.5134126 for a Physics of Plasmas publication illustrating the development of a learned surrogate (QLKNN10D-hyper) of an older version of QuaLiKiz (2.4.0) with a 300 million point 10D dataset. The paper is also available on arXiv https://arxiv.org/abs/1911.05617 and the older dataset on Zenodo https://doi.org/10.5281/zenodo.3497066. For an application example, see Van Mulders et al 2021 https://doi.org/10.1088/1741-4326/ac0d12, where QLKNN10D-hyper was applied for ITER hybrid scenario optimization. For any learned surrogates developed for QLKNN11D, the effective addition of the alphaMHD input dimension through rescaling the input magnetic shear (s) by s = s - alpha_MHD/2, as carried out in Van Mulders et al., is recommended.

Related repositories:

General QuaLiKiz documentation https://qualikiz.com

QuaLiKiz/QLKNN input/output variables naming scheme https://qualikiz.com/QuaLiKiz/Input-and-output-variables

Training, plotting, filtering, and auxiliary tools https://gitlab.com/Karel-van-de-Plassche/QLKNN-develop

QuaLiKiz related tools https://gitlab.com/qualikiz-group/QuaLiKiz-pythontools

FORTRAN QLKNN implementation with wrapper for Python and MATLAB https://gitlab.com/qualikiz-group/QLKNN-fortran

Weights and biases of 'hyperrectangle style' QLKNN https://gitlab.com/qualikiz-group/qlknn-hype

Data exploration

The data is provided in 43 netCDF files. We advise opening single datasets using xarray or multiple datasets out-of-core using dask. For reference, we give the load times and sizes of a single variable that just depends on the scan size dimx below. This was tested single-core on a Intel Xeon 8160 CPU at 2.1 GHz and 192 GB of DDR4 RAM. Note that during loading, more memory is needed than the final number.

Timing of dataset loading Amount of datasets Final in-RAM memory (GiB)

Loading time single var (M:SS)

1 10.3 0:09 5 43.9 1:00 10 63.2 2:01 16 98.0 3:25 17 Out Of Memory x:xx

Full dataset

The full dataset of QuaLiKiz in-and-output data is available on request. Note that this is 2.2 TiB of netCDF files!

💊 FDA Drug Dataset- 1000+ Drug Entries🧪

kaggle.com

zip

Updated Oct 24, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Shiv_D24Coder (2023). 💊 FDA Drug Dataset- 1000+ Drug Entries🧪 [Dataset]. https://www.kaggle.com/datasets/shivd24coder/fda-drug-dataset-1000-drug-entries/versions/1

Explore at:

zip(556233 bytes)Available download formats

Dataset updated

Oct 24, 2023

Authors

Shiv_D24Coder

License

https://www.usa.gov/government-works/https://www.usa.gov/government-works/

Description

Key Features

Column Names	Description
country	The country where the product is located.
city	The city where the product is located.
address_1	The first line of the product's address.
reason_for_recall	The reason for the product recall.
address_2	The second line of the product's address.
product_quantity	The quantity of the product being recalled.
code_info	Product-specific code or information.
center_classification_date	The date of classification by the center.
distribution_pattern	The distribution pattern of the product.
state	The state where the product is located.
product_description	A description of the product.
report_date	The date when the recall report was filed.
classification	The classification of the recall (e.g., Class I, Class II, Class III).
openfda	OpenFDA data related to the product.
recalling_firm	The firm or company initiating the recall.
recall_number	The unique identifier for the recall.
initial_firm_notification	The method of initial notification to the firm.
product_type	The type of product (e.g., Food, Drug).
event_id	The event identifier.
termination_date	The date when the recall was terminated (if applicable).
more_code_info	Additional code or information.
recall_initiation_date	The date when the recall was initiated.
postal_code	The postal code of the product location.
voluntary_mandated	Whether the recall is voluntary or mandated by authorities.
status	The current status of the recall (e.g., Ongoing, Terminated).

How to use this dataset

1. Data Access: Retrieve the dataset from the provided source or API to access FDA records of product recalls in the United States.

2. Data Exploration: Thoroughly explore the dataset by loading it into your preferred data analysis tool. Familiarize yourself with the columns and their meanings.

3. Filter and Sort: Tailor your analysis by filtering and sorting the data as per your research needs. For example, filter by "product_type" or sort by "report_date" for specific insights.

4. Recall Analysis: Examine the "reason_for_recall" column to understand the reasons behind product recalls. This is crucial for assessing common issues in recalled products.

5. Visualization: Create visualizations, such as graphs and charts, to convey your findings effectively. These can help in identifying trends and patterns in the recall data.

If you find this dataset useful, give it an upvote – it's a small gesture that goes a long way! Thanks for your support. 😄

Preprocessed Text-Based Emotion Dataset
kaggle.com
zip
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gamer_Quant (2025). Preprocessed Text-Based Emotion Dataset [Dataset]. https://www.kaggle.com/datasets/lopure/original-preprocessed-dataset-for-emotion-text
Explore at:
zip(32503698 bytes)Available download formats
Dataset updated
Oct 13, 2025
Authors
Gamer_Quant
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
TEXT PREPROCESSING PIPELINE :

DATASET INFORMATION: Total samples: 16,000 Input shape (X): (16000, 19) Labels shape (y): (16000,)

PREPROCESSING STEPS COMPLETED: 1. Data Loading & Exploration 2. Text Cleaning & Standardization (URLs, HTML, contractions) 3. Punctuation Removal 4. Lowercasing 5. Tokenization 6. Stopword Removal 7. Part-of-Speech (POS) Tagging 8. Lemmatization (using POS tags for accuracy) 9. Vocabulary Building 10. Sequence Padding (uniform length) 11. Embedding Matrix Creation (300-dimensional, trainable)

VOCABULARY & EMBEDDINGS: Vocabulary size: 12,308 words Embedding dimensions: 300

For pipeline Refer : https://github.com/Lokesh-102214/Mood-Text

You can now use these files to train your CNN-LSTM emotion recognition model.
Iris Flower Visualization using Python
kaggle.com
zip
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Kashyap (2023). Iris Flower Visualization using Python [Dataset]. https://www.kaggle.com/datasets/imharshkashyap/iris-flower-visualization-using-python
Explore at:
zip(1307 bytes)Available download formats
Dataset updated
Oct 24, 2023
Authors
Harsh Kashyap
Description
The "Iris Flower Visualization using Python" project is a data science project that focuses on exploring and visualizing the famous Iris flower dataset. The Iris dataset is a well-known dataset in the field of machine learning and data science, containing measurements of four features (sepal length, sepal width, petal length, and petal width) for three different species of Iris flowers (Setosa, Versicolor, and Virginica).

In this project, Python is used as the primary programming language along with popular libraries such as pandas, matplotlib, seaborn, and plotly. The project aims to provide a comprehensive visual analysis of the Iris dataset, allowing users to gain insights into the relationships between the different features and the distinct characteristics of each Iris species.

The project begins by loading the Iris dataset into a pandas DataFrame, followed by data preprocessing and cleaning if necessary. Various visualization techniques are then applied to showcase the dataset's characteristics and patterns. The project includes the following visualizations:

1. Scatter Plot: Visualizes the relationship between two features, such as sepal length and sepal width, using points on a 2D plane. Different species are represented by different colors or markers, allowing for easy differentiation.

2. Pair Plot: Displays pairwise relationships between all features in the dataset. This matrix of scatter plots provides a quick overview of the relationships and distributions of the features.

3. Andrews Curves: Represents each sample as a curve, with the shape of the curve representing the corresponding Iris species. This visualization technique allows for the identification of distinct patterns and separability between species.

4. Parallel Coordinates: Plots each feature on a separate vertical axis and connects the values for each data sample using lines. This visualization technique helps in understanding the relative importance and range of each feature for different species.

5. 3D Scatter Plot: Creates a 3D plot with three features represented on the x, y, and z axes. This visualization allows for a more comprehensive understanding of the relationships between multiple features simultaneously.

Throughout the project, appropriate labels, titles, and color schemes are used to enhance the visualizations' interpretability. The interactive nature of some visualizations, such as the 3D Scatter Plot, allows users to rotate and zoom in on the plot for a more detailed examination.

The "Iris Flower Visualization using Python" project serves as an excellent example of how data visualization techniques can be applied to gain insights and understand the characteristics of a dataset. It provides a foundation for further analysis and exploration of the Iris dataset or similar datasets in the field of data science and machine learning.
Part of data.
plos.figshare.com
zip
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taochang Li; Ang Li; Limin Hou (2025). Part of data. [Dataset]. http://doi.org/10.1371/journal.pone.0318094.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0318094.s001
Dataset updated
Jan 30, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Taochang Li; Ang Li; Limin Hou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To address the susceptibility of conventional vector control systems for permanent magnet synchronous motors (PMSMs) to motor parameter variations and load disturbances, a novel control method combining an improved Grasshopper Optimization Algorithm (GOA) with a variable universe fuzzy Proportional-Integral (PI) controller is proposed, building upon standard fuzzy PI control. First, the diversity of the population and the global exploration capability of the algorithm are enhanced through the integration of the Cauchy mutation strategy and uniform distribution strategy. Subsequently, the fusion of Cauchy mutation and opposition-based learning, along with modifications to the optimal position, further improves the algorithm’s ability to escape local optima. The improved GOA is then employed to optimize the contraction-expansion factor of the variable universe fuzzy PI controller, achieving enhanced control performance for PMSMs. Additionally, to address the high torque and current ripple issues commonly associated with traditional PI controllers in the current loop, Model Predictive Control (MPC) is adopted to further improve control performance. Finally, experimental results validate the effectiveness of the proposed control scheme, demonstrating precise motor speed control, rapid and stable current tracking, as well as improved system robustness.
5_Year_French_Wealth_Analysis_CLEANED_for_ML_NN
kaggle.com
zip
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santiago PATINO SERNA (2023). 5_Year_French_Wealth_Analysis_CLEANED_for_ML_NN [Dataset]. https://www.kaggle.com/datasets/santiagopatioserna/5-year-french-wealth-analysis-cleaned-for-ml-nn
Explore at:
zip(22880 bytes)Available download formats
Dataset updated
Oct 6, 2023
Authors
Santiago PATINO SERNA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
French
Description
🚀 Welcome to Godi.AI's Predictive Analysis of IFI Tax for French Cities

Hello, business innovator! 🌎 I'm Santiago PATINO SERNA, the CEO and data scientist at Godi.AI. This notebook presents the data reading and processing steps (for a detailed explanation, please refer to this other Notebook on Github). We delve into feature engineering followed by the development of six unique models. Each model's hyperparameters are optimized automatically. Following this, we evaluate each model's performance using cross-validation. With the results in hand, Generative AI steps in to assess the models. Acting as an expert data scientist, Generative AI determines the optimal model for deployment, explaining its selection in depth.

📊 Notebook Overview

Notebook on Github

Predictive Analysis of IFI Tax for French Cities

Data Loading: Initializing necessary libraries and importing the IFI tax dataset.

Feature Engineering: Transforming raw data into informative features to enhance model performance.

Model Development: Creating six unique models, each with automated hyperparameter optimization.

Performance Evaluation: Assessing each model's predictive capability using cross-validation techniques.

Generative AI Model Analysis

Delve into: - Comparative analysis of all models' performance. - Generative AI's role as an expert data scientist in selecting the optimal model. - A detailed explanation of the chosen model and its implications for deployment.

📸 Key Visual Insights

https://drive.google.com/uc?export=view&id=1AYVkEcBwaNRV7SYe3b4NVvB4GZAxTamA" alt="Pair Plot of Feature Engineering variable">
Analysis of the RAW variables plus the feature-engineered ones. In this image, variables have been cleaned of outliers and normalized.

https://drive.google.com/uc?export=view&id=1AYZxrO84QWsuCw3qU2kv9Z0kxclMNCQS" alt="KDE of the Predictable Variable Y">
KDE analysis to view the distribution of the predictable Y variable, equivalent to the number of tax payers multiplied by the average tax for each city. This variable has been normalized and is without outliers.

https://drive.google.com/uc?export=view&id=1AiKaGMoQ7V1PF4QgQZXDHBL_vGGMxe0y" alt="Results of the different tested models">
Numerical analysis of various metrics across all models to select the optimal one.

https://drive.google.com/uc?export=view&id=1Ahiym3e69XDaTOqpktqRGYhHawW5v_ks" alt="AI Analysis to select the best model">
This shows the AI's analysis of the previous chart to select the best model.

https://drive.google.com/uc?export=view&id=1AbNa3WRHvWRkZo_czA6L5Rl8ZQkPLjKh" alt="Results XGboost model"> Results of the model selected by the generative AI as the best.

https://drive.google.com/uc?export=view&id=1AlAF2HVU3I0YDkyI6WdrGF3mdJP6M2vq" alt="Graphical representation of the best predictive model selected by GEN AI"> Graphical results of all models, compared to the best one selected by the AI.

🤖 About Godi.AI

Godi.AI is the startup reshaping how businesses experience AI. We focus on speed, ROI, and guiding businesses on their digitalization journey. Dive into our standout apps, or explore tailored solutions with our Godi.AI Freelancer method.

Special Note: For businesses seeking an even more customized experience, reach out to me directly. As a proud Polytechnique Paris graduate, I am here to turn your data into insightful, actionable decisions.

📩 Get Started: Ready to embark on this transformative journey with Godi.AI? Reach out to us!

📔 Explore the Notebook: For a detailed dive, explore the Notebook in Github on GitHub.

Facebook

Twitter

Click to copy link

Link copied

Cite

Philippe Santenoise; Philippe Santenoise (2021). WIDEa: a Web Interface for big Data exploration, management and analysis [Dataset]. http://doi.org/10.15454/AGU4QE

WIDEa: a Web Interface for big Data exploration, management and analysis

Explore at:

Unique identifier

https://doi.org/10.15454/AGU4QE

Dataset updated

Sep 12, 2021

Dataset provided by

Recherche Data Gouv

Authors

Philippe Santenoise; Philippe Santenoise

License

https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QEhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QE

Description

WIDEa is R-based software aiming to provide users with a range of functionalities to explore, manage, clean and analyse "big" environmental and (in/ex situ) experimental data. These functionalities are the following, 1. Loading/reading different data types: basic (called normal), temporal, infrared spectra of mid/near region (called IR) with frequency (wavenumber) used as unit (in cm-1); 2. Interactive data visualization from a multitude of graph representations: 2D/3D scatter-plot, box-plot, hist-plot, bar-plot, correlation matrix; 3. Manipulation of variables: concatenation of qualitative variables, transformation of quantitative variables by generic functions in R; 4. Application of mathematical/statistical methods; 5. Creation/management of data (named flag data) considered as atypical; 6. Study of normal distribution model results for different strategies: calibration (checking assumptions on residuals), validation (comparison between measured and fitted values). The model form can be more or less complex: mixed effects, main/interaction effects, weighted residuals.

Clear search

Close search

Google apps

Main menu

WIDEa: a Web Interface for big Data exploration, management and analysis

DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

Shopping Mall

Insurance_claims

Load the Dataset File

DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

Data from: The Gravity Loading Countermeasure Skinsuit

Data from: Smart metering and energy access programs: an approach to energy...

Marine Loading Arms Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Data underlying chapter 4 of the PhD dissertation: Multi-fidelity...

CreditCardFraudDetection

ner dataset and code of paper "Accelerating the Exploration of Information...

SP500_data

From Expansion to Elimination, DATA

Data from: Effect of Solvents on Proline Modified at the Secondary Sphere: A...

QLKNN11D training set

💊 FDA Drug Dataset- 1000+ Drug Entries🧪

Key Features

How to use this dataset

Preprocessed Text-Based Emotion Dataset

Iris Flower Visualization using Python

Part of data.

5_Year_French_Wealth_Analysis_CLEANED_for_ML_NN

🚀 Welcome to Godi.AI's Predictive Analysis of IFI Tax for French Cities

📊 Notebook Overview

Predictive Analysis of IFI Tax for French Cities

Generative AI Model Analysis

📸 Key Visual Insights

🤖 About Godi.AI

WIDEa: a Web Interface for big Data exploration, management and analysis