6 datasets found

f
S1 Data -
plos.figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farough Ashkouti; Keyhan Khamforoosh (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0285212.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285212.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Farough Ashkouti; Keyhan Khamforoosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.
f
Medical dataset in 4-anonymous model.
plos.figshare.com
xls
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farough Ashkouti; Keyhan Khamforoosh (2023). Medical dataset in 4-anonymous model. [Dataset]. http://doi.org/10.1371/journal.pone.0285212.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285212.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Farough Ashkouti; Keyhan Khamforoosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.
Dataset for Spark_RDD
kaggle.com
Updated Dec 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sureya Subramanian (2020). Dataset for Spark_RDD [Dataset]. https://www.kaggle.com/sureyasubramanian/dataset-for-spark-rdd/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sureya Subramanian
Description
Dataset

This dataset was created by Sureya Subramanian

Contents
A
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate...
data.amerigeoss.org
data.wu.ac.at
html
Updated Jul 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2018). SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics for Scientific Data and Analysis [Dataset]. https://data.amerigeoss.org/sk/dataset/scispark-highly-interactive-and-scalable-model-evaluation-and-climate-metrics-for-scientif
Explore at:
htmlAvailable download formats
Dataset updated
Jul 19, 2018
Dataset provided by
United States
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
We will construct SciSpark, a scalable system for interactive model evaluation and for the rapid development of climate metrics and analyses. SciSpark directly leverages the Apache Spark technology and its notion of Resilient Distributed Datasets (RDDs). RDDs represent an immutable data set that can be reused across multi-stage operations, partitioned across multiple machines and automatically reconstructed if a partition is lost. The RDD notion directly enables the reuse of array data across multi-stage operations and it ensures data can be replicated, distributed and easily reconstructed in different storage tiers, e.g., memory for fast interactivity, SSDs for near real time availability and I/O oriented spinning disk for later operations. RDDs also allow Spark's performance to degrade gracefully when there is not sufficient memory available to the system. It may seem surprising to consider an in-memory solution for massive datasets, however a recent study found that at Facebook 96% of active jobs could have their entire data inputs in memory at the same time. In addition, it is worth noting that Spark has shown to be 100x faster in memory and 10x faster on disk than Apache Hadoop, the de facto industry platform for Big Data. Hadoop scales well and there are emerging examples of its use in NASA climate projects (e.g., Teng et al. and Schnase et al.) but as is being discovered in these projects, Hadoop is most suited for batch processing and long running operations. SciSpark contributes a Scientific RDD that corresponds to a multi-dimensional array representing a scientific measurement subset by space, or by time. Scientific RDDs can be created in a handful of ways by: (1) directly loading HDF and NetCDF data into Hadoop Distributed File System (HDFS); (2) creating a partition or split function that divides up a multi-dimensional array by space or time; (3) taking the results of a regridding operation or a climate metrics computation; or (4) telling SciSpark to cache an existing Scientific RDD (sRDD), keeping it cached in memory for data reuse between stages. Scientific RDDs will form the basis for a variety of advanced and interactive climate analyses, starting by default in memory, and then being cached and replicated to disk when not directly needed. SciSpark will also use the Shark interactive SQL technology that allows structured query language (SQL) to be used to store/retrieve RDDs; and will use Apache Mesos to be a good tenant in cloud environments interoperating with other data system frameworks (e.g., HDFS, iRODS, SciDB, etc.).

One of the key components of SciSpark is interactive sRDD visualizations and to accomplish this SciSpark delivers a user interface built around the Data Driven Documents (D3) framework. D3 is an immersive, javascript based technology that exploits the underlying Document Object Model (DOM) structure of the web to create histograms, cartographic displays and inspections of climate variables and statistics.

SciSpark is evaluated using several topical iterative scientific algorithms inspired by the NASA RCMES project including machine-learning (ML) based clustering of temperature PDFs and other quantities over North America, and graph-based algorithms for searching for Mesocale Convective Complexes in West Africa.
f
pone.0285212.t005 - A distributed computing model for big data anonymization...
plos.figshare.com
xls
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farough Ashkouti; Keyhan Khamforoosh (2023). pone.0285212.t005 - A distributed computing model for big data anonymization in the networks [Dataset]. http://doi.org/10.1371/journal.pone.0285212.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285212.t005
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Farough Ashkouti; Keyhan Khamforoosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
pone.0285212.t005 - A distributed computing model for big data anonymization in the networks
f
F1-measure criterion on the anonymous poker hand for λ = 4, ḱ = 3 and...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farough Ashkouti; Keyhan Khamforoosh (2023). F1-measure criterion on the anonymous poker hand for λ = 4, ḱ = 3 and different values of k. [Dataset]. http://doi.org/10.1371/journal.pone.0285212.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285212.t009
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Farough Ashkouti; Keyhan Khamforoosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F1-measure criterion on the anonymous poker hand for λ = 4, ḱ = 3 and different values of k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Farough Ashkouti; Keyhan Khamforoosh (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0285212.s001

S1 Data -

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0285212.s001

Dataset updated

May 31, 2023

Dataset provided by

PLOS ONE

Authors

Farough Ashkouti; Keyhan Khamforoosh

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.

Clear search

Close search

Google apps

Main menu

S1 Data -

Medical dataset in 4-anonymous model.

Dataset for Spark_RDD

Dataset

Contents

SciSpark: Highly Interactive and Scalable Model Evaluation and Climate...

pone.0285212.t005 - A distributed computing model for big data anonymization...

F1-measure criterion on the anonymous poker hand for λ = 4, ḱ = 3 and...

S1 Data -