Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
Facebook
TwitterThe Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.
The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:
{"index": {"_id": "
The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.
Available email headers are:
Available segment classes are:
Find more information about the dataset and the segmentation model at webis.de.
If you are using this resource in your work, please cite it as:
@InProceedings{stein:2020o,
author = {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
month = jul,
publisher = {Association for Computational Linguistics},
site = {Seattle, USA},
title = {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
year = 2020
}
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An image classification algorithm based on adaptive feature weight updating is proposed to address the low classification accuracy of the current single-feature classification algorithms and simple multifeature fusion algorithms. The MapReduce parallel programming model on the Hadoop platform is used to perform an adaptive fusion of hue, local binary pattern (LBP) and scale-invariant feature transform (SIFT) features extracted from images to derive optimal combinations of weights. The support vector machine (SVM) classifier is then used to perform parallel training to obtain the optimal SVM classification model, which is then tested. The Pascal VOC 2012, Caltech 256 and SUN databases are adopted to build a massive image library. The speedup, classification accuracy and training time are tested in the experiment, and the results show that a linear growth tendency is present in the speedup of the system in a cluster environment. In consideration of the hardware costs, time, performance and accuracy, the algorithm is superior to mainstream classification algorithms, such as the power mean SVM and convolutional neural network (CNN). As the number and types of images both increase, the classification accuracy rate exceeds 95%. When the number of images reaches 80,000, the training time of the proposed algorithm is only 1/5 that of traditional single-node architecture algorithms. This result reflects the effectiveness of the algorithm, which provides a basis for the effective analysis and processing of image big data.