Newsle led the global machine learning industry in 2021 with a market share of ***** percent, followed by TensorFlow and Torch. The source indicates that machine learning software is utilized for the application of artificial intelligence (AI) that allows systems the ability to automatically or "artificially" learn and improve functions based on experience without being specifically programmed to do so.
In 2021, with ** percent, improving customer experience represents the top artificial intelligence and machine learning use cases. The deployment of machine learning and artificial intelligence can advance a variety of business processes.
According to a survey conducted among healthcare providers in the United States in April 2021, ** percent of respondents reported that in their hospital or health systems artificial intelligence (AI)/machine learning efforts were in the pilot stage and the rollout was to be decided, while a further ** percent said that it is in the early stage initiatives.
On May 21st, 2021, we held the webinar "Covid-19 and AI: unexpected challenges and lessons". This short note presents its highlights.
According to a recent survey, 56 percent of respondents state experiencing issues with security and auditability requirements when deploying machine learning and artificial intelligence in 2021. Auditability is the degree to which transaction from the originator to the approver and final disposition can be traced.
In 2021, the AI and machine learning medical device market was valued at around *** billion U.S. dollars globally. By 2032, the market was forecast to increase to a value of **** billion U.S. dollars.
The dataset "California STD Statistics (2001-2021).csv" contains information about reported cases of sexually transmitted diseases (STDs) (chlamydia, gonorrhea, and early syphilis, which includes primary, secondary, and early latent syphilis) across different counties in the United States from the year 2001 to 2021. The data includes details on the number of cases, population estimates, and calculated rates of infection. It is segmented by disease type, county, year, and sex, providing a comprehensive overview of STD prevalence and trends over a 20-year period.
Column Descriptions
Disease: The type of sexually transmitted disease (e.g., Chlamydia, Gonorrhea).
County: The name of the county where the data was collected.
Year: The year when the data was recorded.
Sex: The sex of the population (Female, Male, Total).
Cases: The number of reported cases of the disease.
Population: The estimated population of the county for the given year and sex.
Rate: The rate of infection per 100,000 people.
Lower 95% CI: The lower bound of the 95% confidence interval for the rate.
Upper 95% CI: The upper bound of the 95% confidence interval for the rate.
Annotation Code: Additional annotation codes that are sparsely populated.
Acknowledgement: All rights reserved by CalHHS
https://data.chhs.ca.gov/pages/terms
Usage: CalHHS Open Data Portal Terms of Use
License: CalHHS reserves all rights and terms to use this data
you will find it here on those links
https://data.chhs.ca.gov/pages/terms
https://data.chhs.ca.gov/dataset/stds-in-california-by-disease-county-year-and-sex
LAST MODIFIED: June 4, 2024.
https://www.fnfresearch.com/privacy-policyhttps://www.fnfresearch.com/privacy-policy
[197+ Pages Report] Global AI in HIV/AIDS market size & share expected to reach revenue of USD 400.7 Million by 2026, with a CAGR of 8.9% during the projected period. AI has been transforming the landscape of technology breakthroughs with its impact being felt across several sectors.
Overview
This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results
folder) and the Singularity image for (optionally) re-running experiments.
For the Python tool used to generate synthetic data, please refer to Synthia.
Requirements
*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc
directory (e.g. #PBS -lwalltime=72:00:00
).
Usage
To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:
qsub hpc/fit.sh
then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:
qsub hpc/stats.sh
qsub hpc/ml_control.sh
qsub hpc/ml_synth.sh
Finally, to plot all artifacts included in the paper use:
qsub hpc/plot.sh
Licence
Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.
According to the survey, ** percent of machine learning, data science, and artificial intelligence developers work with unstructured text data, which makes it the most popular type of data for developers. Tabular data is the second most popular type of data, with ** percent usage.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Scraped data via https://www.convertcsv.com/html-table-to-csv.htm, converted to .csv by me. Original reddit post: https://www.reddit.com/r/MachineLearning/comments/qzjuvk/discussion_neurips_2021_finally_accepted/
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
https://www.fnfresearch.com/privacy-policyhttps://www.fnfresearch.com/privacy-policy
[219+ Pages Report] Global artificial intelligence market size & share projected a value of USD 299.64 Billionby 2026, and is growing at a CAGR value of 35.6% during 2021-2026.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV proximate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetProximate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Proximate," contains 803 HHV data from 17 pieces of literature. The names of the worksheet column indicate the elements of the proximate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeProximate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, "runStudyProximate.m," is the article's main program (Kijkarncharoensin & Innet, 2021) to analyze the performance consistency of the biomass HHV model through the proximate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2021). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Proximate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The publication of tourism statistics often does not keep up with the highly dynamic tourism demand trends, especially critical during crises. Alternative data sources such as digital traces and web searches represent an important source to potentially fill this gap, since they are generally timely, and available at detailed spatial scale. In this study we explore the potential of human mobility data from the Google Community Mobility Reports to nowcast the number of monthly nights spent at sub-national scale across 11 European countries in 2020, 2021, and the first half of 2022. Using a machine learning implementation, we found that this novel data source is able to predict the tourism demand with high accuracy, and we compare its potential in the tourism domain to web search and mobile phone data. This result paves the way for a more frequent and timely production of tourism statistics by researchers and statistical entities, and their usage to support tourism monitoring and management, although privacy and surveillance concerns still hinder an actual data innovation transition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The first column shows the available countries with ISO 3166–1 alpha-2 country codes (https://www.iso.org/iso-3166-country-codes.html. Last accessed the 2022/05/16).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development.
The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).
The files are in simple comma separated table format without headers.
The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]:
Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].
The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]:
4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.
do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).
k-means function is used from the Matlab Statistics and Machine Learning Toolbox.
Additional software used in the do_clust_valid_DataFig.m:
Author's auxiliary formatting scripts script/
datetick_cst.m
do_fitfig.m
do_skipticks.m
do_skipticks_y.m
Colormaps are generated using cbrewer.m (Charles, 2021).
Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data of groudwater nitrate and related data in North China Plain (NCP). The data including nitrate concentration of groudwater collected from more than 4,000 sites (wells) in NCP from 2005 to 2021. The groundwater samples were collected in 2005–2021, and the collection was conducted in May (before rainy season) and October (after rainy season) in each year for every site.During sampling, basic information about well location, groundwater depth, farmland planting pattern and soil types were collected. Sampling wells were divided into three types according to depth, shallow (0–30 m), medium (30–100 m) and deep (> 100 m). The planting pattern mainly involved intensive croplands, grain crops, vegetable crops and orchards. Soil types of each sampling site were obtained from the China soil database (http://vdb3.soil.csdb.cn/).The socio-economic and agricultural information of the study areas (take the districts of municipalities and prefecture-level cities of provinces as basic units) were acquired via the China Statistical Yearbook (http://www.stats.gov.cn/sj/ndsj/). The data includes agricultural planting area, grain crop area, vegetable planting area, orchard planting area, total facility agricultural area; fertilizer amount, nitrogen fertilizer amount, unit area nitrogen fertilizer amount; total output value of agricultural, forestry, animal and fishery husbandry, agricultural output value, forestry output value, animal husbandry output value, fishery output value; Gross Domestic Product (GDP), per capita GDP; total population, and rural population.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides evidence supporting the hypothesis that institutional shorting, ETF outflows, whale wallet movements, and media sentiment drive Bitcoin’s volatility and price manipulation. Central to this dataset is the Decker Sentiment-Short Interest Model (DSSIM)—an original equation developed by Nicolin Decker to quantify the relationship between market sentiment and institutional short interest. By combining sentiment scores from Natural Language Processing (NLP) and short positioning data, DSSIM offers a flexible framework for analyzing volatility in Bitcoin and other assets.
The dataset spans January 2021 to December 2024, capturing daily market activity and key price events. Each file aligns with DSSIM’s variables, enabling replication and further analysis of the findings in the doctoral-level thesis The Economic Bomb: A Strategic Financial Warfare Tactic.
Key Components: BTC_Price_Data.csv: Daily BTC/USD closing prices from Binance, Coinbase, and Bitstamp, serving as the baseline for volatility and return calculations.
ETF_Holdings_Over_Time_Thesis.csv: Daily BTC holdings of ETFs (Grayscale, BlackRock, and Fidelity), illustrating cumulative outflows and their liquidity impact.
ETF_Outflows_Price_Impact_Data.csv: Correlates ETF outflows with BTC volatility, highlighting timing and magnitude.
Institutional_Shorting_Data.csv: Daily BTC short positions from Binance, BitMEX, Bybit, and OKX, serving as input for DSSIM’s short interest variable.
Whale_Wallet_Movements.csv: Tracks large BTC wallet movements, revealing sell-offs preceding price crashes and influencing DSSIM’s residual noise component.
Market_Liquidity_Data.csv: Daily BTC trading volume, order book depth, and liquidity ratios, validating DSSIM’s predictive capabilities.
Media_Sentiment_Scores.csv: Daily sentiment from Twitter, Reddit, Google News, and YouTube, forming DSSIM’s sentiment variable.
Monte_Carlo_Simulation_Results.csv: Simulates 1,000 BTC price paths to assess potential volatility under market stress.
VAR_Model_Data.csv: Analyzes ETF outflows’ delayed impact on BTC returns using vector autoregression.
Volatility_Clustering_Data.csv: Tracks daily BTC returns and 30-day rolling volatility, confirming persistent volatility after institutional actions.
GARCH_Model_Data.csv: Models BTC volatility using GARCH, validating volatility clustering during market shocks.
The dataset includes adjustments for major market events, such as the May 2021 Flash Crash, June 2022 Liquidation Crisis, and March 2023 Banking Crisis, ensuring realistic volatility patterns aligned with DSSIM’s modeling of sentiment shifts and institutional shorting.
Researchers can use DSSIM’s structure and data to explore similar dynamics in other cryptocurrencies, equities, commodities, and forex markets, advancing financial analysis and predictive modeling.
Access the full dataset: https://drive.google.com/drive/folders/1pnwqBTMF_QSJoC5QcNAPSQpVtOST2n8c?usp=drive_link
Newsle led the global machine learning industry in 2021 with a market share of ***** percent, followed by TensorFlow and Torch. The source indicates that machine learning software is utilized for the application of artificial intelligence (AI) that allows systems the ability to automatically or "artificially" learn and improve functions based on experience without being specifically programmed to do so.