3 datasets found
  1. Improved support vector machine classification algorithm based on adaptive...

    • plos.figshare.com
    doc
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang (2023). Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment [Dataset]. http://doi.org/10.1371/journal.pone.0215136
    Explore at:
    docAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.

  2. A

    OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).

  3. Webis Gmane Email Corpus 2019

    • zenodo.org
    Updated Jun 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib (2020). Webis Gmane Email Corpus 2019 [Dataset]. http://doi.org/10.5281/zenodo.3766985
    Explore at:
    Dataset updated
    Jun 4, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib
    Description

    The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

    The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

    {"index": {"_id": "

    The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

    Available email headers are:

    • message_id
    • date (yyyy-MM-dd HH:mm:ssZZ)
    • subject
    • from
    • to
    • cc
    • in_reply_to
    • references
    • list_id

    Available segment classes are:

    • paragraph
    • closing
    • inline_headers
    • log_data
    • mua_signature
    • patch
    • personal_signature
    • quotation
    • quotation_marker
    • raw_code
    • salutation
    • section_heading
    • tabular
    • technical
    • visual_separator

    Find more information about the dataset and the segmentation model at webis.de.

    If you are using this resource in your work, please cite it as:

    @InProceedings{stein:2020o,
     author =       {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
     booktitle =      {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
     month =        jul,
     publisher =      {Association for Computational Linguistics},
     site =        {Seattle, USA},
     title =        {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
     year =        2020
    }
    

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang (2023). Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment [Dataset]. http://doi.org/10.1371/journal.pone.0215136
Organization logo

Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
docAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jianfang Cao; Min Wang; Yanfei Li; Qi Zhang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.

Search
Clear search
Close search
Google apps
Main menu