Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
Facebook
TwitterThis chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
Facebook
TwitterThis chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes all experimental data used for the PhD thesis of Cong Liu, entitled "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection". These data are generated by instrumenting both synthetic and real-life software systems, and are formated according to the IEEE XES format. See http://www.xes-standard.org/ and https://www.win.tue.nl/ieeetfpm/lib/exe/fetch.php?media=shared:downloads:2017-06-22-xes-software-event-v5-2.pdf for more explanations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Data Science Platform Market Size 2025-2029
The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.
Major Market Trends & Insights
North America dominated the market and accounted for a 48% growth during the forecast period.
By Deployment - On-premises segment was valued at USD 38.70 million in 2023
By Component - Platform segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 1.00 million
Market Future Opportunities: USD 763.90 million
CAGR : 40.2%
North America: Largest market in 2023
Market Summary
The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations.
According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.
What will be the Size of the Data Science Platform Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?
The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Deployment
On-premises
Cloud
Component
Platform
Services
End-user
BFSI
Retail and e-commerce
Manufacturing
Media and entertainment
Others
Sector
Large enterprises
SMEs
Application
Data Preparation
Data Visualization
Machine Learning
Predictive Analytics
Data Governance
Others
Geography
North America
US
Canada
Europe
France
Germany
UK
Middle East and Africa
UAE
APAC
China
India
Japan
South America
Brazil
Rest of World (ROW)
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period.
In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.
Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.
API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.
Request Free Sample
The On-premises segment was valued at USD 38.70 million in 2019 and showed
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Missing Attribute Values: None
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The full text of this article can be freely accessed on the publisher's website.
Facebook
TwitterThis dataset contains the key elements used in the paper Collective Intelligence Architecture for IoT Using Federated Process Mining which range from complex event processing to process mining applied over multiple datasets. The information included is organized into the following sections:
1.- CEPApp.siddhi: It contains the rules and configurations used for pattern detection and real-time event processing.
2.- ProcessStorage.sol: Smart contract code used in the case study implemented on solidity using Polygon blockchain platform.
3.- Datasets Used ({adlinterweave_dataset, adlmr_dataset, twor_dataset}.zip): Three datasets used in the study, each with events that have been processed using the CEP engine. The datasets are divided according to the rooms of the house:
_room.csv: CSV file with the data related to the interactions of the room stay.
_bathroom.csv: CSV file with the data related to the interactions of the bathroom stay.
_other.csv: CSV file with the data related to the interactions of the rest of the rooms.
4.- CEP Engine Processing Results ({cepresult_adlinterweave, cepresult_adlmr, cepresult_twor}.json): Output generated by the Siddhi CEP engine, stored in JSON format. The data is categorized into different files based on the type of detected activity:
_room.json: Contains the events related to the stay in the room.
_bathroom.json: Contains the events related to the bathing stay.
_other.json: Contains the events related to the rest of the rooms.
5.- Federated Event Logs ({xesresult_adlinterweave, xesresult_adlmr, xesresult_twor}.xes): Federated event logs in XES format, standard in process mining. Contains event traces obtained after the execution of the Event Log Integrator.
6.- Process Mining Results: Models generated from the processed event logs:
Process Trees ({procestree_adlinterweave, procestree_adlmr, procestree_twor}.svg): structured representation of the detected workflows.
Petri Nets ({petrinet_adlinterweave, petrinet_adlmr, petrinet_twor}.svg): Mathematical model of the discovered processes, useful for compliance analysis and simulations.
Disco Results ({disco_adlinterweave, disco_adlmr, disco_twor}.pdf): Process models discovered with the Disco tool.
ProM Results ({prom_adlinterweave, prom_adlmr, prom_twor}.pdf): Models generated with ProM tool.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.
PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.
The document types are:
Here are a few example entries from the CSV file:
This dataset can be used for:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Although bacterial population behavior has been investigated in a variety of foods in the past 40 years, it is difficult to obtain desired information from the mere juxtaposition of experimental data. We predicted the changes in the number of bacteria and visualize the effects of pH, aw, and temperature using a data mining approach. Population growth and inactivation data on eight pathogenic and food spoilage bacteria under 5,025 environmental conditions were obtained from the ComBase database (www.combase.cc), including 15 food categories, and temperatures ranging from 0°C to 25°C. The eXtreme gradient boosting tree was used to predict population behavior. The root mean square error of the observed and predicted values was 1.23 log CFU/g. The data mining model extracted the growth inhibition for the investigated bacteria against aw, temperature, and pH using the SHapley Additive eXplanations value. A data mining approach provides information concerning bacterial population behavior and how food ecosystems affect bacterial growth and inactivation.
Facebook
TwitterThe COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.”
in-the-news: On March 16, 2020, the White House issued a “call to action to the tech community” regarding the dataset, asking experts “to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.”
Included in this dataset:
Commercial use subset (includes PMC content) -- 9000 papers, 186Mb Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb PMC custom license subset -- 1426 papers, 19Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb Each paper is represented as a single JSON object. The schema is available here.
We also provide a comprehensive metadata file of 29,000 coronavirus and COVID-19 research articles with links to PubMed, Microsoft Academic and the WHO COVID-19 database of publications (includes articles without open access full text):
Metadata file (readme) -- 47Mb Source: https://pages.semanticscholar.org/coronavirus-research Updated: Weekly License: https://data.world/kgarrett/covid-19-open-research-dataset/workspace/file?filename=COVID.DATA.LIC.AGMT.pdf
This data is for training how using data analysis 🤝🎉
Please appreciate the effort with an upvote 👍 😃😃
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Data Information: WISDM (WIireless Sensor Data Mining) smart phone-based sensor , collecting data from 36 different users in six different activities.
Number of examples: 1,098,207
Number of attributes: 6
Missing attribute values: None
Data processing:
1.Replace the nanoseconds with seconds in the timestamp column, and remove the user column, because each user will perform the same action.
2.Use the sliding window method to transform the data into sequences, and then split each label into training and testing sets, ensuring each label has 8:2 ratio in both the training and testing sets.
3.Shuffle the order of the labels in both training and testing sets and interleave them to prevent two sequences with the same label from being consecutively lined up.
Activity:
0 = Downstairs 100,427 (9.1%)
1 = Jogging 342,177 (31.2%)
2 = Sitting 59,939 (5.5%)
3 = Standing 48,395 (4.4%)
4 = Upstair 122,869 (11.2%)
5 = Walking 424,400 (38.6%)
Resource:
The dataset are collected by WISDM Lab [https://www.cis.fordham.edu/wisdm/dataset.php]
Jeffrey W. Lockhart, Gary M. Weiss, Jack C. Xue, Shaun T. Gallagher, Andrew B. Grosner, and Tony T. Pulickal (2011). "Design Considerations for the WISDM Smart Phone-Based Sensor Mining Architecture," Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (at KDD-11), San Diego, CA. [https://www.cis.fordham.edu/wisdm/includes/files/Lockhart-Design-SensorKDD11.pdf]
Facebook
TwitterThese data accompany the 2018 manuscript published in PLOS One titled "Mapping the yearly extent of surface coal mining in Central Appalachia using Landsat and Google Earth Engine". In this manuscript, researchers used the Google Earth Engine platform and freely-accessible Landsat imagery to create a yearly dataset (1985 through 2015) of surface coal mining in the Appalachian region of the United States of America.This specific dataset is a GeoTIFF file depicting when an area was most recently mined, from the period 1985 through 2015. The raster values depict the year that mining was most recently detected by the paper's processing model. A year of "1984" indicates mining that likely was most recently mined at some point prior to 1985. These pre-1985 mining data are derived from a prior study; see https://skytruth.org/wp/wp-content/uploads/2017/03/SkyTruth-MTR-methodology.pdf for more information. This dataset does not indicate for how long an area was a mine or when mining began in a given area.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bacterial small RNAs (sRNAs) play a vital role in pathogenesis by enabling rapid, efficient networks of gene attenuation during infection. In recent decades, there has been a surge in the number of proposed and biochemically-confirmed sRNAs in both Gram-positive and Gram-negative pathogens. However, limited homology, network complexity, and condition specificity of sRNA has stunted complete characterization of the activity and regulation of these RNA regulators. To streamline the discovery of the expression of sRNAs, and their post-transcriptional activities, we propose an integrative in vivo data-mining approach that couples DNA protein occupancy, RNA-seq, and RNA accessibility data with motif identification and target prediction algorithms. We benchmark the approach against a subset of well-characterized E. coli sRNAs for which a degree of in vivo transcriptional regulation and post-transcriptional activity has been previously reported, finding support for known regulation in a large proportion of this sRNA set. We showcase the abilities of our method to expand understanding of sRNA RseX, a known envelope stress-linked sRNA for which a cellular role has been elusive due to a lack of native expression detection. Using the presented approach, we identify a small set of putative RseX regulators and targets for experimental investigation. These findings have allowed us to confirm native RseX expression under conditions that eliminate H-NS repression as well as uncover a post-transcriptional role of RseX in fimbrial regulation. Beyond RseX, we uncover 163 putative regulatory DNA-binding protein sites, corresponding to regulation of 62 sRNAs, that could lead to new understanding of sRNA transcription regulation. For 32 sRNAs, we also propose a subset of top targets filtered by engagement of regions that exhibit binding site accessibility behavior in vivo. We broadly anticipate that the proposed approach will be useful for sRNA-reliant network characterization in bacteria. Such investigations under pathogenesis-relevant environmental conditions will enable us to deduce complex rapid-regulation schemes that support infection.
Facebook
TwitterSource: https://archive.ics.uci.edu/ml/datasets/forest+fires
Citation Request: This dataset is public available for research. The details are described in [Cortez and Morais, 2007]. Please include this citation if you plan to use this database:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf
Title: Forest Fires
Sources Created by: Paulo Cortez and An�bal Morais (Univ. Minho) @ 2007
Past Usage:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, 2007. (http://www.dsi.uminho.pt/~pcortez/fires.pdf)
In the above reference, the output "area" was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.
Relevant Information:
This is a very difficult regression task. It can be used to test regression methods. Also, it could be used to test outlier detection methods, since it is not clear how many outliers are there. Yet, the number of examples of fires with a large burned area is very small.
Number of Instances: 517
Number of Attributes: 12 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez and Morais, 2007].
Missing Attribute Values: None
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Multi-aspect Reviews dataset primarily encompasses beer review data from RateBeer and BeerAdvocate, with a focus on multiple rated dimensions providing a comprehensive insight into sensory aspects such as taste, look, feel, and smell. This dataset facilitates the analysis of different facets of reviews, thus aiding in a deeper understanding of user preferences and product characteristics.
Basic Statistics: - RateBeer - Number of users: 40,213 - Number of items: 110,419 - Number of ratings/reviews: 2,855,232 - Timespan: Apr 2000 - Nov 2011
Metadata: - Reviews: Textual reviews provided by users. - Aspect-specific ratings: Ratings on taste, look, feel, smell, and overall impression. - Product Category: Categories of beer products. - ABV (Alcohol By Volume): Indicates the alcohol content in the beer.
Examples:
- RateBeer Example
json
{
"beer/name": "John Harvards Simcoe IPA",
"beer/beerId": "63836",
"beer/brewerId": "8481",
"beer/ABV": "5.4",
"beer/style": "India Pale Ale (IPA)",
"review/appearance": "4/5",
"review/aroma": "6/10",
"review/palate": "3/5",
"review/taste": "6/10",
"review/overall": "13/20",
"review/time": "1157587200",
"review/profileName": "hopdog",
"review/text": "On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass."
}
Download Links: - BeerAdvocate Data - RateBeer Data - Sentences with aspect labels (annotator 1) - Sentences with aspect labels (annotator 2)
Citations: - Learning attitudes and attributes from multi-aspect reviews, Julian McAuley, Jure Leskovec, Dan Jurafsky, International Conference on Data Mining (ICDM), 2012. pdf - From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews, Julian McAuley, Jure Leskovec, WWW, 2013. pdf
Use Cases: 1. Aspect-Based Sentiment Analysis (ABSA): Analyzing sentiments on different aspects of beers like taste, look, feel, and smell to gain deeper insights into user preferences and opinions. 2. Recommendation Systems: Developing personalized recommendation systems that consider multiple aspects of user preferences. 3. Product Development: Utilizing the feedback on various aspects to improve the product. 4. Consumer Behavior Analysis: Studying how different aspects influence consumer choice and satisfaction. 5. Competitor Analysis: Comparing ratings on different aspects with competitors to identify strengths and weaknesses. 6. Trend Analysis: Identifying trends in consumer preferences over time across different aspects. 7. Marketing Strategies: Formulating marketing strategies based on insights drawn from aspect-based reviews. 8. Natural Language Processing (NLP): Developing and enhancing NLP models to understand and categorize multi-aspect reviews. 9. Learning User Expertise Evolution: Studying how user expertise evolves through reviews and ratings over time. 10. Training Machine Learning Models: Training supervised learning models to predict aspect-based ratings from review text.
This dataset is extremely valuable for researchers, marketers, product developers, and machine learning practitioners looking to delve into multi-dimensional review analysis and understand user-product interaction on a granular level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Automatic Identification And Data Capture Market Size 2024-2028
The automatic identification and data capture market size is valued to increase by USD 21.52 billion, at a CAGR of 8.1% from 2023 to 2028. Increasing applications of RFID will drive the automatic identification and data capture market.
Market Insights
North America dominated the market and accounted for a 47% growth during the 2024-2028.
By Product - RFID products segment was valued at USD 18.41 billion in 2022
By segment2 - segment2_1 segment accounted for the largest market revenue share in 2022
Market Size & Forecast
Market Opportunities: USD 79.34 million
Market Future Opportunities 2023: USD 21520.40 million
CAGR from 2023 to 2028 : 8.1%
Market Summary
The Automatic Identification and Data Capture (AIDC) market encompasses technologies and solutions that enable businesses to capture and process data in real time. This market is driven by the increasing adoption of RFID technology, which offers benefits such as improved supply chain visibility, inventory management, and operational efficiency. The growing popularity of smart factories, where automation and data-driven processes are integral, further fuels the demand for AIDC solutions. However, the market also faces challenges, including security concerns. With the increasing use of AIDC technologies, there is a growing need to ensure data privacy and security. This has led to the development of advanced encryption techniques and access control mechanisms to mitigate potential risks. A real-world business scenario illustrating the importance of AIDC is in the retail industry. Retailers use AIDC technologies such as RFID tags and barcode scanners to manage inventory levels, track stock movements, and optimize supply chain operations. By automating data capture processes, retailers can reduce manual errors, improve order fulfillment accuracy, and enhance the overall customer experience. Despite the challenges, the AIDC market continues to grow, driven by the need for real-time data processing and automation across various industries.
What will be the size of the Automatic Identification And Data Capture Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleThe Automatic Identification and Data Capture (AIDC) market continues to evolve, driven by advancements in technology and increasing business demands. AIDC solutions, including barcode scanners, RFID systems, and OCR technology, enable organizations to streamline processes, enhance data accuracy, and improve operational efficiency. According to recent research, the use of RFID technology in the retail sector has surged by 25% over the past five years, underpinning its significance in inventory management and supply chain optimization. Moreover, the integration of AIDC technologies with cloud computing services and data visualization dashboards offers real-time data access and analysis, empowering businesses to make informed decisions. For instance, a manufacturing firm can leverage RFID data to monitor production lines, optimize workflows, and ensure compliance with industry regulations. AIDC systems are also instrumental in enhancing data security and privacy, with advanced encryption protocols and access control features ensuring data integrity and confidentiality. By adopting AIDC technologies, organizations can not only improve their operational efficiency but also gain a competitive edge in their respective industries.
Unpacking the Automatic Identification And Data Capture Market Landscape
The market encompasses technologies such as RFID tag identification, data stream management, and data mining techniques. These solutions enable businesses to efficiently process and analyze vast amounts of data from various sources, leading to significant improvements in data quality metrics and workflow optimization strategies. For instance, RFID implementation can result in a 30% increase in inventory accuracy, while data mining techniques can uncover hidden patterns and trends, driving ROI improvement and compliance alignment. Real-time data processing, facilitated by technologies like document understanding AI and image recognition algorithms, ensures swift decision-making and error reduction. Data capture pipelines and database management systems provide a solid foundation for data aggregation and analysis, while semantic web technologies and natural language processing enhance information retrieval and understanding. By integrating sensor data and applying machine vision systems, businesses can achieve high-throughput imaging and object detection, further enhancing their data processing capabilities.
Key Market Drivers Fueling Growth
The significant expansion of RFID (Radio-Frequency Identification) technology applications is the primary market growth catalyst. In the dyna
Facebook
TwitterAfrica is a continent that covers 6% of the Earth's surface and 20% of the land surface. Its area is 30,415,873 km2 with the islands, making it the third largest in the world if we count America as a single continent. With more than 1.3 billion inhabitants, Africa is the second most populous continent after Asia and represents 17.2% of the world population in 2020.
Africa abounds in very varied energy sources, distributed in distinct zones: abundance of fossil fuels (gas in North Africa, oil in the Gulf of Guinea and coal in southern Africa), hydraulic basins in Central Africa, deposit uranium; solar radiation in Sahelian countries; and geothermal capacities in East Africa. Despite this, it has been a prey to conflicts (socio-political, political, social, civil war, government mismanagement, etc.) since the independence of its countries. And also a land of fierce lust for powerful countries and large multinational corporations.
data is acquired by ACLED (Armed Conflict Location & Event Data) project. The ACLED project report information on the type, agents, location, date, and other characteristics of political violence events, demonstrations and select politically relevant non-violent events. Also, ACLED focuses on tracking a range of violent and non-violent actions by political agents, including governments, rebels, militias, identity groups, political parties, external actors, rioters, protesters and civilians. Africa conflict 1997-2020 datasets is one of database of the ACLED project.
For detail acleddata.com Codebook: ACLED codebook Guide User Quick Guide
Thanks to “Armed Conflict Location & Event Data Project (ACLED); https://www.acleddata.com.”
Can you understand how conflicts evolve in Africa from 1997 to 2020 and what link is there between the energy ressources of certain regions of Africa and conflicts? (Make your Geopolitics, Geo-economics and Geo-energy skills in practical)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.