3 datasets found

Improved support vector machine classification algorithm based on adaptive...
plos.figshare.com
doc
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang (2023). Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment [Dataset]. http://doi.org/10.1371/journal.pone.0215136
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0215136
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.
A
OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...
data.amerigeoss.org
data.wu.ac.at
html
Updated Jul 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
Explore at:
htmlAvailable download formats
Dataset updated
Jul 25, 2019
Dataset provided by
United States[old]
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
Webis Gmane Email Corpus 2019
zenodo.org
Updated Jun 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib (2020). Webis Gmane Email Corpus 2019 [Dataset]. http://doi.org/10.5281/zenodo.3766985
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3766985
Dataset updated
Jun 4, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib
Description
The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

{"index": {"_id": "

The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

Available email headers are:

message_id

date (yyyy-MM-dd HH:mm:ssZZ)

subject

from

to

cc

in_reply_to

references

list_id

Available segment classes are:

paragraph

closing

inline_headers

log_data

mua_signature

patch

personal_signature

quotation

quotation_marker

raw_code

salutation

section_heading

tabular

technical

visual_separator

Find more information about the dataset and the segmentation model at webis.de.

If you are using this resource in your work, please cite it as:

@InProceedings{stein:2020o, author = {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)}, month = jul, publisher = {Association for Computational Linguistics}, site = {Seattle, USA}, title = {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}}, year = 2020 }
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang (2023). Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment [Dataset]. http://doi.org/10.1371/journal.pone.0215136

Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment

Explore at:

12 scholarly articles cite this dataset (View in Google Scholar)

docAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0215136

Dataset updated

May 31, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.

Clear search

Close search

Google apps

Main menu

Improved support vector machine classification algorithm based on adaptive...

OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

Webis Gmane Email Corpus 2019

Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment