This dataset was created by Deepansh Saxena1
182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synthetic data set has 600 points that form 20 clusters with 30 points each in 2 dimensions. The offset between a given point and its true center in each dimension is determined by Rand[0.02, 0.04] ∗ G where G is a random Gaussian number.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
Feature type
Description
Files
Object number
Sparse 1000 dimensional vectors that give the true object assignment
objs.arff.gz
RGB color histograms
Standard RGB color histograms (uniform binning)
aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms
Standard HSV/HSB color histograms in various binnings
aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity
Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)
aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features
First 13 Haralick features (radius 1 pixel)
aloi-haralick-1.csv.gz
Front to back
Vectors representing front face vs. back faces of individual objects
front.arff.gz
Basic light
Vectors indicating basic light situations
light.arff.gz
Manual annotations
Manually annotated object groups of semantically related objects such as cups
manual1.arff.gz
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
Feature type
Description
Files
RGB Histograms
Downsampled to 100000 objects (553 outliers)
aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
Downsampled to 75000 objects (717 outliers)
aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
Downsampled to 50000 objects (1508 outliers)
aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
Summary
This dataset (ml-25m) describes a 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995, and November 21, 2019. This dataset was generated on November 21, 2019.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in the files genome-scores.csv
, genome-tags.csv
, links.csv
, movies.csv
, ratings.csv
, and tags.csv
. More details about the contents and use of all these files follow.
This and other GroupLens data sets are publicly available for download at
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDOstreamclust Evaluation Tests conducted for the paper: Stream Clustering Robust to Concept Drift Context and methodology SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans. This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. Docker A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust Technical details Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations. [data] contains datasets in ARFF format. [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper). "dependencies.sh" lists and installs python dependencies. "pysdoclust-stream-main.zip" contains the SDOstreamclust python package. "README.md" shows details and intructions to use this repository. "run.sh" runs the complete experiments. "run_comp.py"for running experiments specified by arguments. "TSindex.py" implements functions for the Temporal Silhouette index. Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
Dataset Fields
Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article
About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.
The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1
About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.
The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.
Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2
Citation If you use our data, please cite the following paper:
bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MIBI-TOF data for lymph node dataset reported in Liu et al., Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering
1. mibi_single_channel_tifs.zip: Single-channel MIBI-TOF images
Folders are labeled according to the field-of-view (FOV) number. Each folder contains single-channel TIFFs for each marker in the panel. Images are 1024x1024 pixels, 500 um. See paper for details.
2. segmentation.zip: Segmentation output of MIBI-TOF images
Cell segmentation was performed using Mesmer (Greenwald NF, Nature Biotechnology 2021). Output of Mesmer that delineates the single cells in each of the images is included.
3. source_data.zip: Source data files for figures
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.
The attractive features of MusicOSet include:
| Data | # Records |
|:-----------------:|:---------:|
| Songs | 20,405 |
| Artists | 11,518 |
| Albums | 26,522 |
| Lyrics | 19,664 |
| Acoustic Features | 20,405 |
| Genres | 1,561 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a part of BenchStab, a command-line tool for querying and benchmarking web-based protein stability predictors. We created the dataset to independently evaluate 18 structure-enabled and 4 sequence-based predictors of a stability change upon mutation. We suggest that this dataset should be excluded from training and validation of future stability predictors.
The dataset consists of single-point mutations and their experimentally determined ΔΔG from FireProtDB, utilizing only records with both a ΔΔG measurement and a PDB accession code available. We eliminated all records similar to the data used in the training set of any of the predictors considered in BenchStab using UniRef50 clusters. This resulted in 289 records for 36 proteins, of which 28 % display a stabilizing effect (negative value of ΔΔG; see DDG distribution.png
for the exact distribution). We further confirmed, by employing SCOP fold-based structure clustering, that the folds of 25 of our proteins were not present in the training sets.
The file dataset.csv
contains specifications of mutations (including the chain) and the ground truth ΔΔG reported from the literature alongside accession codes from FireProtDB (experiment ID), UniProt and Protein Data Bank, and UniRef50 cluster IDs. The file benchstab_input.csv
contains the same data in the input format of the BenchStab tool.
For more statistics and details about the dataset, please read the supplement of the paper or get in touch with us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains more than 21M hierarchical relationships about ≈10M topics extracted from Freebase knowledgebase. The topics span the various categories of Freebase including Science & Technology, Arts & Entertainment, Sports, Society, Products & Services, Transportation, Time & Space, Special Interests, and Commons. The relationships describe the hierarchies of topics in terms of Types, Domains, and Categories. For example, ‘Albert Einstein’ can be found as a topic that is a sub-class of ‘Person’, belonging to the ‘People’ domain and ‘Society’ category. While another entity named as ‘Albert Einstein’ can also be found as a sub-class of ‘Book’, belonging to the ‘Books’ domain and ‘Arts & Entertainment’ category. The dataset is published in JSON and CSV formats, sample files are provided to help explore how the dataset is structured. The dataset is believed to be useful for studying the inter-related connections among topics in different domains of knowledge. The first author may be contacted at (mahmoud.elbattah@nuigalway.ie) for more information. The following paper may kindly be cited in case of using the dataset. Mahmoud Elbattah, Mohamed Roushdy, Mostafa Aref, Abdel-Badeeh M. Salem. “Large-Scale Entity Clustering Using Graph-Based Structural Similarity within Knowledge Graphs”, Big Data Analytics: Tools, Technology for Effective Planning, CRC Press. https://www.researchgate.net/publication/321716589_Large-Scale_Entity_Clustering_Based_on_Structural_Similarity_within_Knowledge_Graphs
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The NUIG_EyeGaze01(Labelled eye gaze dataset) is a rich and diverse gaze dataset, built using eye gaze data from experiments done under a wide range of operating conditions from three user platforms (desktop, laptop, tablet) . Gaze data is collected under one condition at a time.
The dataset includes gaze (fixation) data collected under 17 different head poses, 4 user distances, 6 platform poses and 3 display screen size and resolutions. Each gaze data file is labelled with the operating condition under which it was collected and has the name format: USERNUMBER_CONDITION_PLATFORM.CSV
CONDITION: RP- Roll plus in degree PP- Pitch plus in degree YP- Yaw plus in degree
RM- Roll minus in degree PM-Pitch minus in degree YM- Yaw minus in degree
50, 60, 70, 80: User distances
PLATFORM: desk- Desktop, lap- Laptop, tab- Tablet
Desktop display: 22 inch, 1680 x1050 pixels Laptop display: 14 inch, 1366x 768 pixels Tablet display: 10.1 inch 1920 x 800, pixels
Eye tracker accuracy: 0.5 degrees (for neutral head and tracker position)
The dataset has 3 folders called “Desktop”, “Laptop”, “Tablet” containing gaze data from respective platforms. The Desktop folder has 2 sub-folders: user_distance and head_pose. These have data for different user distances and head poses (neutral, roll, pitch, yaw )measured with desktop setup. The Tablet folder has 2 sub-folders: user_distance and tablet_pose,. These have data for different user distances and tablet+tracker poses (neutral, roll, pitch, yaw) measured with tablet setup . The Laptop folder has one sub-folder called user_distance which has data for different user distances, measured with laptop setup.
All data files are in CSV format. Each file contains the following data header fields:
("TIM REL","GTX", "GTY","XRAW", "YRAW","GT Xmm", "GT Ymm","Xmm", "Ymm","YAW GT", "YAW DATA","PITCH GT", "PITCH DATA","GAZE GT","GAZE ANG", "DIFF GZ", "AOI_IND","AOI_X","AOI_Y","MEAN_ERR","STD ERR")
The meanings of the header fields are as follows:
TIM REL: relative time stamp for each gaze data point (measured during data collection) "GTX", "GTY": Ground truth x, y positions in pixels "XRAW", "YRAW": Raw gaze data x, y coordinates in pixels "GT Xmm", "GT Ymm": Ground truth x, y positions in mm "Xmm", "Ymm": Gaze x, y positions in mm "YAW GT", "YAW DATA": Ground truth and estimated yaw angles "PITCH GT", "PITCH DATA": Ground truth and estimated pitch angles "GAZE GT","GAZE ANG": Ground truth and estimated gaze angles "DIFF GZ": Gaze angular accuracy "AOI_IND","AOI_X","AOI_Y": Index of the stimuli locations and their x, y coordinates "MEAN_ERR","STD ERR": Mean and standard deviation of error at the stimuli locations
For more details on the purpose of this dataset and data collection method, please consult the paper by authors of this dataset :
Anuradha Kar, Peter Corcoran: Performance Evaluation Strategies for Eye Gaze Estimation Systems with Quantitative Metrics and Visualizations. Sensors 18(9): 3151 (2018)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains datasets for the manuscript "Evaluating scalable supervised learning for synthesize-on-demand chemical libraries":
enamine_top_10000.csv.gz
. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results.cdd_training_data.tar.gz
. Contains 441,900 rows.training_folds.tar.gz
merged for convenience. Contains 427,300 compounds.ams_order_results.csv.gz
.master_df.csv.gz.
If you use these datasets in a publication, please cite:
Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Evaluating scalable supervised learning for synthesize-on-demand chemical libraries. 2021.
See PubChem AID 1272365 and the associated publications for the original PriA-SSB screening data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“
We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (Submitted for review and publication to Journal of Geophysical Research: Space Research 1/2024 - DOI to be created upon acceptance).
This work was funded by grant 2225463 from the NSF GEM program.
The following tables detail the contents of the described files:
labeled_sunside_data.csv description
Column Name
Description
Epoch
Epoch in datetime
probe
MMS probe name
ratio_max_width
Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information
ratio_high_low
Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information
norm_Btot
Magnitude of the total magnetic field normalized to 50nT. See paper for more information
small_energy_mean
The denominator in ratio_high_low
large_energy_mean
The numerator in ratio_high_low
temp_total
Total temperature from the DIS moments. See paper for more information
r_gse_x
x position of the spacecraft in GSE
r_gse_y
y position of the spacecraft in GSE
r_gse_z
z position of the spacecraft in GSE
r_gsm_x
x position of the spacecraft in GSM
r_gsm_y
y position of the spacecraft in GSM
r_gsm_z
z position of the spacecraft in GSM
mlat
magnetic latitude of spacecraft
mlt
magnetic local time of spacecraft
raw_named_label
Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)
modified_named_label
Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information
transition_name
Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information
Column Name
Description
start
Starting Epoch in datetime
stop
Stopping Epoch in datetime
probe
MMS probe name
region
Cleansed cluster name associated with 1-minute resolution “modified_named_label”
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions. Reporting scores varied between -0.17, indicating discrepancies between incidence and binary reporting rate, and 1.0 suggesting high consistency of these two metrics. Median reporting score for all countries was 0.71 (IQR 0.55 to 0.87). Descriptive analyses of the binary reporting rate and relative reporting behavior showed constant reporting with a slight “weekend effect” for most countries, while spectral clustering demonstrated that some countries had even more complex reporting patterns. Conclusion The majority of countries reported COVID-19 cases when they did have cases to report. The identification of a slight “weekend effect” suggests that COVID-19 case counts reported in the middle of the week may represent the best data basis for political ad hoc decisions. A few countries, however, showed unusual or highly irregular reporting that might require more careful interpretation. Our score system and cluster analyses might be applied by epidemiologists advising policymakers to consider country-specific reporting behaviors in political ad hoc decisions. Methods Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document describes two datasets collected at Tampere University facilities with samples taken from a Wi-Fi network interface for experiments with indoor positioning based on Wi-Fi fingerprinting.
To reference this dataset, please use
E.S. Lohan et al. “Additional TAU datasets for Wi-Fi fingerprinting-based positioning” 10.5281/zenodo.3819917
Additional reference using these datasets
Torres-Sospedra, J.; Quezada-Gaibor, D.; Mendoza-Silva, G. M.; Nurmi, J.; Koucheryavy, Y. and Huerta, J. New Cluster Selection and Fine-grained Search for k-Means Clustering and Wi-Fi Fingerprinting Proceedings of the Tenth International Conference on Localization and GNSS (ICL-GNSS), 2020.
Dataset format
Two independent datasets are provided, they are in different folders, namely “Database_Building01” and “Database_Building02” respectively. Each dataset includes two sets of samples:
radio map – a set of Wi-Fi samples collected at a grid of points (reference points);
evaluation – a set of Wi-Fi samples randomly collected in the evaluation area.
Two files are provided for each set that include the rss vectors and the coordinates. For the radio map, the provided files have their names starting with “rm_”; for the evaluation, the evaluation files have their names starting with “eval_”. For instance, for the radio map they are:
rm_crd.csv: holds coordinates (x,y)and floor identifier (z) where the samples were collected;
rm_rss.csv: holds the measured RSSI values from each of the Access Points (AP) detected in each sample;
All the file are described in the same format, and all files are CSV – Comma Separated Values plain text (UTF-8).
Coordinates: Each sample is associated to a pair of coordinates in a 2D Euclidean reference system. The origin of the reference system was chosen arbitrarily for convenience. The units are meters. Therefore, distances between points can be easy calculated. Moreover, the floor identifier is included to enable 3D positioning.
RSSI values: The RSSI values provided as read from the Wi-Fi network interface through the Android API. In each sample, a value of +100 was assigned to each AP not detected during a measurement. No information is provided about the MAC addresses of the APs. However, in the files, the same order is used for all samples, meaning that the values in each column are all associated to the same AP.
Both datasets are independent and none of the provided files include an identifier for each sample. The values in the two provided files are associated by the line number, meaning that the coordinates and RSSI values in the same line, in each file, refer to the same sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How does Facebook always seems to know what the next funny video should be to sustain your attention with the platform? Facebook has not asked you whether you like videos of cats doing something funny: They just seem to know. In fact, FaceBook learns through your behavior on the platform (e.g., how long have you engaged with similar movies, what posts have you previously liked or commented on, etc.). As a result, Facebook is able to sustain the attention of their user for a long time. On the other hand, the typical mHealth apps suffer from rapidly collapsing user engagement levels. To sustain engagement levels, mHealth apps nowadays employ all sorts of intervention strategies. Of course, it would be powerful to know—like Facebook knows—what strategy should be presented to what individual to sustain their engagement. To be able to do that, the first step could be to be able to cluster similar users (and then derive intervention strategies from there). This dataset was collected through a single mHealth app over 8 different mHealth campaigns (i.e., scientific studies). Using this dataset, one could derive clusters from app user event data. One approach could be to differentiate between two phases: a process mining phase and a clustering phase. In the process mining phase one may derive from the dataset the processes (i.e., sequences of app actions) that users undertake. In the clustering phase, based on the processes different users engaged in, one may cluster similar users (i.e., users that perform similar sequences of app actions).
List of files
0-list-of-variables.pdf
includes an overview of different variables within the dataset.
1-description-of-endpoints.pdf
includes a description of the unique endpoints that appear in the dataset.
2-requests.csv
includes the dataset with actual app user event data.
2-requests-by-session.csv
includes the dataset with actual app user event data with a session variable, to differentiate between user requests that were made in the same session.
1.Declines of large vertebrates in tropical forests may reduce dispersal of tree species that rely on them, and the resulting undispersed seedlings might suffer increased distance- and density- dependent mortality. Consequently, extirpation of large vertebrates may alter the composition and spatial structure of plant communities and impair ecosystem functions like carbon storage.
2.We analysed spatial patterns of tree recruitment within six forest plots along a defaunation gradient in western Amazonia. We divided recruits into two size cohorts (“saplings”, ≥1 m tall and <1 cm diameter at breast height [dbh], and juveniles, 1 – 2 cm dbh) and examined the spatial organization of conspecific recruits within each cohort (within-cohort) and around conspecific reproductive-sized trees (between-cohort). We used replicated spatial point pattern analysis to quantify relationships between recruit clustering and cohort, defaunation intensity, each tree species’ reliance on hunted dispersers an...
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Interannual variation, especially weather, is an often-cited reason for restoration “failures”; yet its importance is difficult to experimentally isolate across broad spatiotemporal extents, due to correlations between weather and site characteristics. In the analysis associated with this dataset, we examined post-fire treatments within sagebrush-steppe ecosystems to ask: 1) Is weather following seeding efforts a primary reason why restoration outcomes depart from predictions? and 2) Does the management-relevance of weather differ across space and with time since treatment? This dataset integrates remotely sensed estimates of sagebrush (Artemisia spp.) cover from the RCMAP product (https://www.mrlc.gov/data-services-page), areas that received post-fire seeding, identified using the Land Treatment Digital Library (https://ltdl.wr.usgs.gov/), and GridMet surface meteorological data (https://www.climatologylab.org/gridmet.html) to describe the impacts of weather on sagebrush recovery following restoration treatments. Methods This dataset integrates remotely sensed estimates of sagebrush (Artemisia spp.) cover from the RCMAP product (https://www.mrlc.gov/data-services-page), areas that received post-fire seeding, identified using the Land Treatment Digital Library (LTDL; https://ltdl.wr.usgs.gov/), and GridMet surface meteorological data (https://www.climatologylab.org/gridmet.html) to describe the impacts of weather on sagebrush recovery following restoration treatments. We identified observations from the LTDL in which at least one Artemisia species had been seeded following fire, within the extent covered by the RCMAP (NLCD back-in-time sagebrush cover) that burned between 1980 and 2005 and that were subsequently seeded. We then removed all areas that burned or were seeded multiple times between 1980 and 2015. We then selected all RCMAP pixels that overlapped these burned, seeded areas and extracted sagebrush cover for all years of the record for each pixel. Data was processed in chunks, due to the large number of pixels included in analysis. In order to reduce the data dimensions and redundancy, we next clustered the pixel data using the algorithm spatially contiguous multivariate clustering in ArcGIS. The number of clusters was set at 1/1000 of the initial number of pixels and the spatial constraint was set to contiguity edges only. The analysis fields (data attributes upon which the algorithm was run to decide on cluster membership) were elevation, TWI, heatload, Level 3 ecoregion (coded as a dummy variable), and slope. If the algorithm failed with the initial number of clusters, the number of clusters was increased by 10% until the algorithm would run. We did allow for spatial non-contiguous clusters in the case that the algorithm was not solvable with contiguous clusters only. Post-processing of the clusters included checking to make sure the relative standard error for elevation was less than 20% within a cluster and to screen for multiple fires being combined into one cluster. If multiple fires were combined into a cluster initially, they were separated into different clusters. We also assessed whether dividing the data into chunks significantly influenced the clustering process. Comparisons of the data chunks suggested that each chunk had a similar distribution of relative standard deviations for elevation, slope, heatload, and TWI among the clusters contained within it. In R, using the extract function in the raster package (Hijmans & van Etten, 2012), we extracted sagebrush cover for each year following fire and the following GridMet variables using the centerpoint of each RCMAP pixel as the point to extract to: daily precipitation, minimum temperature, and maximum temperature for the February-April in the first four years after fire, 30-year climate means, and monthly SPEI for the two years before and the four years after fire (calculated from the SPEI package in R). We extracted additional covariates for each pixel, which included elevation, TWI, heatload, Level 3 ecoregion, and slope. We then described the mean characteristics of each cluster for each of these variables. For each climate variable, we calculated each year's deviation (mean - year's observation) from the long-term (30 year) mean. This process resulted in the dataset entitled “longtermsage.csv”. For the autoregressive model (Question 3 in associated manuscript), we formatted the data to allow for statistical modeling of annual changes in sagebrush cover, in a second dataset entitled “growthannualsage.csv”. Specific variable names are described in the README file.
This dataset was created by Deepansh Saxena1