23 datasets found

i
NSL-KDD dataset
impactcybertrust.org
Updated Jan 1, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2009). NSL-KDD dataset [Dataset]. http://doi.org/10.23721/100/1478792
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478792
Dataset updated
Jan 1, 2009
Authors
External Data Source
Time period covered
Jan 1, 2009
Description
NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set . Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

Data files

KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format
KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21
; cic@unb.ca.
Data from: Automatic composition of descriptive music: A case study of the...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucía Martín-Gómez (2023). Automatic composition of descriptive music: A case study of the relationship between image and sound [Dataset]. http://doi.org/10.6084/m9.figshare.6682998.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6682998.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lucía Martín-Gómez
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FANTASIAThis repository contains the data related to image descriptors and sound associated with a selection of frames of the films Fantasia and Fantasia 2000 produced by DisneyAboutThis repository contains the data used in the article Automatic composition of descriptive music: A case study of the relationship between image and sound published in the 6th International Workshop on Computational Creativity, Concept Invention, and General Intelligence (C3GI). Data structure is explained in detail in the article. AbstractHuman beings establish relationships with the environment mainly through sight and hearing. This work focuses on the concept of descriptive music, which makes use of sound resources to narrate a story. The Fantasia film, produced by Walt Disney was used in the case study. One of its musical pieces is analyzed in order to obtain the relationship between image and music. This connection is subsequently used to create a descriptive musical composition from a new video. Naive Bayes, Support Vector Machine and Random Forest are the three classifiers studied for the model induction process. After an analysis of their performance, it was concluded that Random Forest provided the best solution; the produced musical composition had a considerably high descriptive quality. DataNutcracker_data.arff: Image descriptors and the most important sound of each frame from the fragment "The Nutcracker Suite" in film Fantasia. Data stored into ARFF format.Firebird_data.arff: Image descriptors of each frame from the fragment "The Firebird" in film Fantasia 2000. Data stored into ARFF format.Firebird_midi_prediction.csv: Frame number of the fragment "The Firebird" in film Fantasia 2000 and the sound predicted by the system encoded in MIDI. Data stored into CSV format.Firebird_prediction.mp3: Audio file with the synthesizing of the prediction data for the fragment "The Firebird" of film Fantasia 2000.LicenseData is available under MIT License. To make use of the data the article must be cited.
Data from: COVID-19 and media dataset: Mining textual data according periods...
dataverse.cirad.fr
application/x-gzip +1
Updated Dec 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
Explore at:
application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
Unique identifier
https://doi.org/10.18167/DVN1/ZUA8MF
Dataset updated
Dec 21, 2020
Dataset provided by
Centre de coopération internationale en recherche agronomique pour le développementhttps://www.cirad.fr/
Authors
Mathieu Roche; Mathieu Roche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France, United Kingdom, Spain
Dataset funded by
ANR (#DigitAg)
Horizon 2020 - European Commission - (MOOD project)
Description
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
VPN and Non-VPN Application Traffic (CIC-VPN2016)
kaggle.com
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Agarwal (2025). VPN and Non-VPN Application Traffic (CIC-VPN2016) [Dataset]. https://www.kaggle.com/datasets/noobbcoder2/vpn-and-non-vpn-application-traffic-cic-vpn2016
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Krish Agarwal
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Context This dataset is a consolidated and cleaned CSV version of the ISCX VPN-nonVPN 2016 dataset from the Canadian Institute for Cybersecurity (CIC) at the University of New Brunswick. The original dataset was created to characterize and identify different types of network traffic, which is crucial for network management, Quality of Service (QoS) optimization, and cybersecurity.

This single CSV file combines the multiple .arff files from the original dataset, making it easier to use for data analysis and machine learning projects in Python.

Content The dataset contains network flow features extracted from packet captures (PCAPs). Each row represents a single network flow and has been labeled with the specific application type and whether it was routed through a VPN.

Features (X): Include over 20 time-related flow features like duration, flowBytesPerSecond, flowPktsPerSecond, min_active, max_idle, etc. These features describe the timing, duration, and volume of the data flows.

Target (y): The target column, traffic_type, is a multi-class label describing the application and connection type (e.g., VPN-CHAT, NonVPN-STREAMING, VPN-Browse).

Potential Uses & Inspiration 🚀 Multi-Class Classification: Can you build a model to accurately identify the specific application generating the traffic?

Binary Classification: Can you distinguish between VPN and Non-VPN traffic, regardless of the application?

Resource Allocation: Predict which types of traffic (e.g., Streaming) require more bandwidth, helping to build smarter network management tools.

Federated Learning: This dataset is ideal for simulating a Federated Learning environment where data from different "users" (applications) is used to train a central model without sharing raw data.
h
NSL-KDD
huggingface.co
Updated Jul 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mireu Lab (2023). NSL-KDD [Dataset]. https://huggingface.co/datasets/Mireu-Lab/NSL-KDD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2023
Authors
Mireu Lab
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
NSL-KDD

The data set is a data set that converts the arff File provided by the link into CSV and results. The data set is personally stored by converting data to float64. If you want to obtain additional original files, they are organized in the Original Directory in the repo.

Labels

The label of the data set is as follows.

# Column Non-Null Count Dtype

0 duration 151165 non-null int64

1 protocol_type 151165 non-null object

2 service 151165 non-null… See the full description on the dataset page: https://huggingface.co/datasets/Mireu-Lab/NSL-KDD.
Z
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
data.niaid.nih.gov
elki-project.github.io
+2more
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zimek, Arthur (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6355683
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Zimek, Arthur
Schubert, Erich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type Description Files Object number Sparse 1000 dimensional vectors that give the true object assignment objs.arff.gz RGB color histograms Standard RGB color histograms (uniform binning) aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz HSV color histograms Standard HSV/HSB color histograms in various binnings aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz Color similiarity Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black) aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other) Haralick features First 13 Haralick features (radius 1 pixel) aloi-haralick-1.csv.gz Front to back Vectors representing front face vs. back faces of individual objects front.arff.gz Basic light Vectors indicating basic light situations light.arff.gz Manual annotations Manually annotated object groups of semantically related objects such as cups manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type Description Files RGB Histograms Downsampled to 100000 objects (553 outliers) aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz Downsampled to 75000 objects (717 outliers) aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz Downsampled to 50000 objects (1508 outliers) aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
HTRU2
figshare.com
zip
Updated Apr 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2016). HTRU2 [Dataset]. http://doi.org/10.6084/m9.figshare.3080389.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3080389.v1
Dataset updated
Apr 1, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Overview HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South) [1]. Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter (see [2] for more uses). As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes. Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation (see [2] for an introduction to pulsar astrophysics to find out why). Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, (see [4,5,6,7,8,9]) which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class. At present multi-class labels are unavailable, given the costs associated with data annotation. The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). These are summarised below: 1. Mean of the integrated profile. 2. Standard deviation of the integrated profile. 3. Excess kurtosis of the integrated profile. 4. Skewness of the integrated profile. 5. Mean of the DM-SNR curve. 6. Standard deviation of the DM-SNR curve. 7. Excess kurtosis of the DM-SNR curve. 8. Skewness of the DM-SNR curve. HTRU 2 Summary 17,898 total examples. 1,639 positive examples. 16,259 negative examples. The data is presented in two formats: CSV and ARFF (used by the WEKA data mining tool). Candidates are stored in both files in separate rows. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive). Please note that the data contains no positional information or other astronomical details. It is simply feature data extracted from candidate files using the PulsarFeatureLab tool (see [10]).2. Citing our work If you use the dataset in your work please cite us using the DOI of the dataset, and the paper: R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach MNRAS, 2016. 3. Acknowledgements This data was obtained with the support of grant EP/I028099/1 for the University of Manchester Centre for Doctoral Training in Computer Science, from the UK Engineering and Physical Sciences Research Council (EPSRC). The raw observational data was collected by the High Time Resolution Universe Collaboration using the Parkes Observatory, funded by the Commonwealth of Australia and managed by the CSIRO. 4. References [1] M.~J. Keith et al., "The High Time Resolution Universe Pulsar Survey - I. System Configuration and Initial Discoveries",2010, Monthly Notices of the Royal Astronomical Society, vol. 409, pp. 619-627. DOI: 10.1111/j.1365-2966.2010.17325.x [2] D. R. Lorimer and M. Kramer, "Handbook of Pulsar Astronomy", Cambridge University Press, 2005. [3] R. J. Lyon, "Why Are Pulsars Hard To Find?", PhD Thesis, University of Manchester, 2015. [4] R. J. Lyon et al., "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach", Monthly Notices of the Royal Astronomical Society, submitted. [5] R. P. Eatough et al., "Selection of radio pulsar candidates using artificial neural networks", Monthly Notices of the Royal Astronomical Society, vol. 407, no. 4, pp. 2443-2450, 2010. [6] S. D. Bates et al., "The high time resolution universe pulsar survey vi. an artificial neural network and timing of 75 pulsars", Monthly Notices of the Royal Astronomical Society, vol. 427, no. 2, pp. 1052-1065, 2012. [7] D. Thornton, "The High Time Resolution Radio Sky", PhD thesis, University of Manchester, Jodrell Bank Centre for Astrophysics School of Physics and Astronomy, 2013. [8] K. J. Lee et al., "PEACE: pulsar evaluation algorithm for candidate extraction a software package for post-analysis processing of pulsar survey candidates", Monthly Notices of the Royal Astronomical Society, vol. 433, no. 1, pp. 688-694, 2013. [9] V. Morello et al., "SPINN: a straightforward machine learning solution to the pulsar candidate selection problem", Monthly Notices of the Royal Astronomical Society, vol. 443, no. 2, pp. 1651-1662, 2014. [10] R. J. Lyon, "PulsarFeatureLab", 2015, https://dx.doi.org/10.6084/m9.figshare.1536472.v1.
Phishing Dataset UCI ML CSV
kaggle.com
Updated Sep 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satish Yadav (2020). Phishing Dataset UCI ML CSV [Dataset]. https://www.kaggle.com/datasets/isatish/phishing-dataset-uci-ml-csv/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Satish Yadav
Description
Context

This dataset is taken from UCI Phishing Dataset originally in ARFF format, converted into CSV. This dataset can be used to train and validate Phishing Detection Machine Learning Projects
Z
Data from: SoilKsatDB: global compilation of soil saturated hydraulic...
data.niaid.nih.gov
repository.soilwise-he.eu
+1more
Updated Jul 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengl, Tomislav (2024). SoilKsatDB: global compilation of soil saturated hydraulic conductivity measurements for geoscience applications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3752721
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Hengl, Tomislav
Lehmann, Peter
Or, Dani
Surya, Gupta
Bonetti, Sara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A total of 13,258 Ksat measurements from 1,908 sites were assembled from the published literature and other sources, standardized, and quality-checked in order to obtain a global database of soil saturated hydraulic conductivity (SoilKsatDB). The SoilKsatDB covers most global regions, with the highest data density from North America, followed by Europe, Asia, South America, Africa, and Australia. In addition to Ksat, other soil variables such as soil texture (11,584 measurements), bulk density (11,262 measurements), soil organic carbon (9,787 measurements), field capacity (7,382) and wilting point (7,411) are also included in the data set.

To cite this dataset please use:

Gupta, S., Hengl, T., Lehmann, P., Bonetti, S., and Or, D.: SoilKsatDB: global soil saturated hydraulic conductivity measurements for geoscience applications, Earth Syst. Sci. Data Discuss., https://doi.org/10.5194/essd-2020-149, in review, 2021.

Examples of using the SoilKsatDB to generate global maps of Ksat can be found in:

Gupta, S., Hengl, T., Lehmann, P., Bonetti, S., Papritz, A. and Or, D. (2021): Global prediction of soil saturated hydraulic conductivity using random forest in a Covariate-based Geo Transfer Functions (CoGTF) framework. accepted for publication in Journal of Advances in Modeling Earth Systems (JAMES).

Importing and binding steps are described in detail here. To report an issue or bug please use this link. Ksat data tutorial explaining how to access and use data is available here.

In the following, we introduce two different file packages, one for the soil saturated hydraulic conductivity (“sol_ksat”) and another one collecting additional soil hydraulic properties (“sol_hydro”) as well that will be extended in the near future. Note that the package “sol_hydro” is not related to the publication listed above (Gupta et al., 2021a).

Description of the files:

The datasets in this repository include:

sol_ksat.pnts_horizons.***: provides a global compilation of Ksat values and the information described in Table 2 in Gupta et al., (2020). This data is provided in three different data formats.

sol_ksat.pnts_horizons.arff,

sol_ksat.pnts_horizons.csv.gz,

sol_ksat.pnts_horizons.rds,

sol_ksat.pnts_metadata_cl_pedo.csv: provides meta-information with Ksat methods and information of estimated soil pedologic unit and climatic region for each Ksat sample.

sol_ksat.points_horizons_rm.rds: All ksat values overlaid on climatic, topographic, and vegetation based remote sensing data and extracted the corresponding values. These datasets can be used for spatial modeling for the future.

In addition to Ksat points, add these files here as well for the reader that is interested in this topic.

sol_hydro.pnts_horizons.***: provides water retention curve values and other soil hydraulic properties. This data is provided in three different data formats.

sol_hydro.pnts_horizons.arff,

sol_hydro.pnts_horizons.csv.gz,

sol_hydro.pnts_horizons.rds,

sol_hydro.pnts_horizons_rm.rds: All soil hydraulic values overlaid on climatic, topographic, and vegetation based remote sensing data and extracted the corresponding values. These datasets can be used for spatial modeling for the future.

SoilKsatDB is available in CSV, ARFF and RDS formats. ARFF was prepared using the farff package for R. ARFF' (Attribute-Relation File Format) files are like 'CSV' files, with a little bit of added meta information in a header and standardized NA values. Column codes are based on the National Cooperative Soil Survey (NCSS) Soil Characterization Database naming convention (see "README.pdf" for explanation of codes).

The SoilKsatDB is a compilation of numerous existing datasets from which the most significant: SWIG data set (Rahmati et al., 2018), UNSODA (Leij et al., 1996), and HYBRAS (Ottoni et al., 2018). Full list of data sources for Ksat data is available in Gupta et al (2021) and in the Readme.pdf.
m
Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...
data.mendeley.com
ieee-dataport.org
+1more
Updated Apr 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nahuel González (2021). Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings [Dataset]. http://doi.org/10.17632/94dwkbxf2d.1
Explore at:
Unique identifier
https://doi.org/10.17632/94dwkbxf2d.1
Dataset updated
Apr 22, 2021
Authors
Nahuel González
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY.

The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.

We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.
f
Relevance and Redundancy ranking: Code and Supplementary material
springernature.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Kumar Shekar; Tom Bocklisch; Patricia Iglesias Sanchez; Christoph Nikolas Straehle; Emmanuel Mueller (2023). Relevance and Redundancy ranking: Code and Supplementary material [Dataset]. http://doi.org/10.6084/m9.figshare.5418706.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5418706.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Arvind Kumar Shekar; Tom Bocklisch; Patricia Iglesias Sanchez; Christoph Nikolas Straehle; Emmanuel Mueller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the code for Relevance and Redundancy ranking; a an efficient filter-based feature ranking framework for evaluating relevance based on multi-feature interactions and redundancy on mixed datasets.Source code is in .scala and .sbt format, metadata in .xml, all of which can be accessed and edited in standard, openly accessible text edit software. Diagrams are in openly accessible .png format.Supplementary_2.pdf: contains the results of experiments on multiple classifiers, along with parameter settings and a description of how KLD converges to mutual information based on its symmetricity.dataGenerator.zip: Synthetic data generator inspired from NIPS: Workshop on variable and feature selection (2001), http://www.clopinet.com/isabelle/Projects/NIPS2001/rar-mfs-master.zip: Relevance and Redundancy Framework containing overview diagram, example datasets, source code and metadata. Details on installing and running are provided below.Background. Feature ranking is benfiecial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by themselves might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundancy on mixed datasets. In the related publication, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundancy. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state of-the-art feature selection techniques.# Relevance and Redundancy Framework (rar-mfs) rar-mfs is an algorithm for feature selection and can be employed to select features from labelled data sets. The Relevance and Redundancy Framework (RaR), which is the theory behind the implementation, is a novel feature selection algorithm that - works on large data sets (polynomial runtime),- can handle differently typed features (e.g. nominal features and continuous features), and- handles multivariate correlations.## InstallationThe tool is written in scala and uses the weka framework to load and handle data sets. You can either run it independently providing the data as an .arff or .csv file or you can include the algorithm as a (maven / ivy) dependency in your project. As an example data set we use heart-c. ### Project dependencyThe project is published to maven central (link). To depend on the project use:- maven xml de.hpi.kddm rar-mfs_2.11 1.0.2 - sbt: sbt libraryDependencies += "de.hpi.kddm" %% "rar-mfs" % "1.0.2" To run the algorithm usescalaimport de.hpi.kddm.rar._// ...val dataSet = de.hpi.kddm.rar.Runner.loadCSVDataSet(new File("heart-c.csv", isNormalized = false, "")val algorithm = new RaRSearch( HicsContrastPramsFA(numIterations = config.samples, maxRetries = 1, alphaFixed = config.alpha, maxInstances = 1000), RaRParamsFixed(k = 5, numberOfMonteCarlosFixed = 5000, parallelismFactor = 4))algorithm.selectFeatures(dataSet)### Command line tool- EITHER download the prebuild binary which requires only an installation of a recent java version (>= 6) 1. download the prebuild jar from the releases tab (latest) 2. run java -jar rar-mfs-1.0.2.jar--help Using the prebuild jar, here is an example usage: sh rar-mfs > java -jar rar-mfs-1.0.2.jar arff --samples 100 --subsetSize 5 --nonorm heart-c.arff Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...- OR build the repository on your own: 1. make sure sbt is installed 2. clone repository 3. run sbt run Simple example using sbt directly after cloning the repository: sh rar-mfs > sbt "run arff --samples 100 --subsetSize 5 --nonorm heart-c.arff" Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ... ### [Optional]To speed up the algorithm, consider using a fast solver such as Gurobi (http://www.gurobi.com/). Install the solver and put the provided gurobi.jar into the java classpath. ## Algorithm### IdeaAbstract overview of the different steps of the proposed feature selection algorithm:https://github.com/tmbo/rar-mfs/blob/master/docu/images/algorithm_overview.png" alt="Algorithm Overview">The Relevance and Redundancy ranking framework (RaR) is a method able to handle large scale data sets and data sets with mixed features. Instead of directly selecting a subset, a feature ranking gives a more detailed overview into the relevance of the features. The method consists of a multistep approach where we 1. repeatedly sample subsets from the whole feature space and examine their relevance and redundancy: exploration of the search space to gather more and more knowledge about the relevance and redundancy of features 2. decude scores for features based on the scores of the subsets 3. create the best possible ranking given the sampled insights.### Parameters| Parameter | Default value | Description || ---------- | ------------- | ------------|| m - contrast iterations | 100 | Number of different slices to evaluate while comparing marginal and conditional probabilities || alpha - subspace slice size | 0.01 | Percentage of all instances to use as part of a slice which is used to compare distributions || n - sampling itertations | 1000 | Number of different subsets to select in the sampling phase|| k - sample set size | 5 | Maximum size of the subsets to be selected in the sampling phase|
Z
Data from: Machine Learning Models and New Computational Tool for the...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jun 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martinez-Rios (2022). Machine Learning Models and New Computational Tool for the Discovery of Insect Repellents that Interfere with Olfaction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6677764
Explore at:
Dataset updated
Jun 22, 2022
Dataset provided by
Martinez-Rios
Marrero-Ponce
Garcia-Jacas
Pulgar-Sánchez
Hernández-Lambraño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SI1_Supporting Information file (docx) brings together detailed information on the outstanding models obtained for each dataset analyzed in this study such as statistical and training parameters and outliers. There can be found the responses in spikes/s of the mosquito Culex quinquefasciatus to the 50 IRs. Besides, there is presented a full table of the up-to-date studies related to QSAR and insect repellency.

SI2_EXP1_50IRs from Liu et al (2013) SDF file presents the structures of each of the 50 IRs analyzed.

SI3_EXP2_Datasets gathers the four datasets as SDF files from Oliferenko et al. (2013), Gaudin et al. (2008), Omolo et al. (2004), and Paluch et al. (2009) used for the repellency modeling in EXP2.

SI4_EXP3_Prospective analysis provides Malaria Box Library (400 compounds) as an SDF file, which were analyzed in our virtual screening to prospect potential virtual hits.

SI5_QuBiLS-MIDAS MDs lists contain three TXT lists of 3D molecular descriptors used in QuBiLS-MIDAS to describe the molecules used in the present study.

SI6_EXP1_Sensillar Modeling comprises two subfolders: Classification and Regression models for each of the six sensilla. Models built to predict the physiological interaction experimentally obtained from Liu et al. (2013). All of the models are implemented in the software SiLiS-PAPACS. Every single folder compiles a DOCX file with the detailed description of the model, an XLSX file with the output obtained from the training in Weka 3.9.4, an ARFF, and CSV files with the MDs for each molecule, and the SDF of the study dataset.

SI7_EXP2_Repellency Modeling encompasses the four datasets in the study: Oliferenko et al. (2013), Gaudin et al. (2008), Omolo et al. (2004), and Paluch et al. (2009). Inside the subfolders, there are three models per type of MDs (duplex, triple, generic, and mix) selected that best predict each dataset. As well as the SI6 folder, each model includes six files: DOCX, XLSX, ARFF, CSV, and an SDF.

SI8_Virtual Hits includes the cluster analysis results and physico-chemical properties of new IR virtual leads.
u
Detecting Machine-obfuscated Plagiarism
deepblue.lib.umich.edu
Updated Oct 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz, Moritz; Grosky, William; Gipp, Bela (2020). Detecting Machine-obfuscated Plagiarism [Dataset]. http://doi.org/10.7302/bewj-qx93
Explore at:
Unique identifier
https://doi.org/10.7302/bewj-qx93
Dataset updated
Oct 8, 2020
Dataset provided by
Deep Blue Data
Authors
Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz, Moritz; Grosky, William; Gipp, Bela
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file);The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. ;The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper. ;The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
SLAC Dataset
zenodo.org
zip
Updated Mar 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cory McKay; Cory McKay (2021). SLAC Dataset [Dataset]. http://doi.org/10.5281/zenodo.4571050
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571050
Dataset updated
Mar 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cory McKay; Cory McKay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This distribution includes details of the SLAC multimodal music dataset as well as features extracted from it. This dataset is intended to facilitate research comparing relative musical influences of four different musical modalities: symbolic, lyrical, audio and cultural. SLAC was assembled by independently collecting, for each of its component musical pieces, a symbolic MIDI encoding, a lyrical text transcription, an audio MP3 recording and cultural information mined from the internet. It is important to emphasize the independence of how each of these components were collected; for example, the MIDI and MP3 encodings of each piece were collected entirely separately, and neither was generated from the other.

Features have been extracted from each of the musical pieces in SLAC using the jMIR (http://jmir.sourceforge.net) feature extractor corresponding to each of the modalities: jSymbolic for symbolic, jLyrics for lyrics, jAudio for audio and jWebMiner2 for mining cultural data from search engines and Last.fm (https://www.last.fm).

SLAC is quite small, consisting of only 250 pieces. This is due to the difficulty of finding matching information in all four modalities independently. Although this limited size does pose certain limitations, the dataset is nonetheless the largest (and only) known dataset including all four independently collected modalities.

The dataset is divided into ten genres, with 25 pieces belonging to each genre: Modern Blues, Traditional Blues, Baroque, Romantic, Bop, Swing, Hardcore Rap, Pop Rap, Alternative Rock and Metal. These can be collapsed into a 5-genre taxonomy, with 50 pieces per genre: Blues, Classical, Jazz, Rap and Rock. This facilitates experiments with both coarser and finer classes.

SLAC was published at the ISMIR 2010 conference, and was itself an expansion of the SAC dataset (published at the ISMIR 2008 conference), which is identical except that it excludes the lyrics and lyrical features found in SLAC. Both ISMIR papers are included in this distribution.

Due to copyright limitations, this distribution does not include the actual music or lyrics of the pieces comprising SLAC. It does, however, include details of the contents of the dataset as well as features extracted from each of its modalities using the jMIR software. These include the original features extracted for the 2010 ISMIR paper, as well as an updated set of symbolic features extracted in 2021 using the newer jSymbolic 2.2 feature extractor (published at ISMIR 2018). These jSymbolic 2.2 features include both the full MIDI feature set and a “conservative” feature set meant to limit potential biases due to encoding practice. Feature values are distributed as CSV files, Weka ARFF (https://www.cs.waikato.ac.nz/ml/weka/) files and ACE XML (http://jmir.sourceforge.net) files.
GPJATK DATASET – Calibrated and synchronized multi-view video and motion...
zenodo.org
bin, pdf
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bogdan Kwolek; Bogdan Kwolek; Agnieszka Michalczuk; Agnieszka Michalczuk; Tomasz Krzeszowski; Tomasz Krzeszowski; Adam Świtoński; Adam Świtoński; Henryk Josiński; Henryk Josiński; Konrad Wojciechowski; Konrad Wojciechowski (2025). GPJATK DATASET – Calibrated and synchronized multi-view video and motion capture dataset for evaluation of gait recognition [Dataset]. http://doi.org/10.1007/s11042-019-07945-y
Explore at:
pdf, binAvailable download formats
Unique identifier
https://doi.org/10.1007/s11042-019-07945-y
Dataset updated
Apr 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bogdan Kwolek; Bogdan Kwolek; Agnieszka Michalczuk; Agnieszka Michalczuk; Tomasz Krzeszowski; Tomasz Krzeszowski; Adam Świtoński; Adam Świtoński; Henryk Josiński; Henryk Josiński; Konrad Wojciechowski; Konrad Wojciechowski
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
=======================
Summary
=======================
GPJATK DATASET – MULTI-VIEW VIDEO AND MOTION CAPTURE DATASET
The GPJATK dataset has been designed for research on vision-based 3D gait recognition. It can also be used for evaluation of the multi-view (where gallery gaits from multiple views are combined to recognize probe gait on a single view) and the cross-view (where probe gait and gallery gait are recorded from two different views) gait recognition algorithms. In addition to problems related to gait recognition, the dataset can also be used for research on algorithms for human motion tracking and articulated pose estimation. The GPJATK dataset is available only for scientific use.
All documents and papers that use the dataset must acknowledge the use of the dataset by including a citation of the following paper:
B. Kwolek, A. Michalczuk, T. Krzeszowski, A. Switonski, H. Josinski, and K. Wojciechowski, „Calibrated and synchronized multi-view video and motion capture dataset for evaluation of gait recognition,” Multimedia Tools and Applications, vol. 78, iss. 22, p. 32437–32465, 2019, doi:10.1007/s11042-019-07945-y

=======================
Data description
=======================
The GPJATK dataset contains data captured by 10 mocap cameras and four calibrated and synchronized video cameras. The 3D gait dataset consists of 166 data sequences, that present the gait of 32 people (10 women and 22 men). In 128 data sequences, each of the individuals was dressed in his/her own clothes, in 24 data sequences, 6 of the performers (person #26-#31) changed clothes, and in 14 data sequences, 7 of the performers attending in the recordings had a backpack on his/her back. Each sequence consists of four videos with RGB images with a resolution of 960×540, which were recorded by synchronized and calibrated cameras with 25 frames per second, together with the corresponding MoCap data. The mocap data were registered at 100 Hz by a Vicon system consisting of 10 MX-T40 cameras.
During the recording session, the actor has been requested to walk on the scene of size 6.5 m × 4.2 m along a line joining the cameras C2 and C4 as well as along the diagonal of the scene. In a single recording session, every performer walked from right to left, then from left to right, and afterward on the diagonal from upper-right to bottom-left and from bottom-left to upper-right corner of the scene. Some performers were also asked to attend additional recording sessions, i.e. after changing into another garment, and putting on a backpack.

=======================
Dataset structure
=======================
* Gait_Data - data for gait recognition containing 32 subjects. The data was obtained using both marker-less and marker-based motion capture systems.
* Markerless motion tracking algorithm - dataset obtained using a markerless motion tracking algorithm
* MoCap - dataset obtained using the Vicon motion capture system
Each dataset contains:
* Arff - motion data after smoothing, normalization, and MPCA in Weka ARFF format
* AsfAmc - motion data saved in Acclaim ASF/AMC format
* Csv - motion data saved in CSV format. Each row contains data for one frame and each column represents a different attribute. Unit for angles attributes are degrees and unit for distances are millimeters.
* Mat - Matlab .mat files
* Sequences - 166 video sequences with 32 subjects. Each sequence consists of 4 video streams and MoCap data. Video is recorded with a frequency of 25 Hz, and MoCap data is recorded at 100 Hz. Both systems are synchronized.
Each sequence contains:
* Background - sequences with a background in AVI format
* Calibration - camera calibration data (Tsai model)
* Edges - images with detected edges
* Videos - sequences in AVI format
* MoCap - data from motion capture system in formats: c3d and Acclaim ASF/AMC
* Silhouettes - images with person silhouettes
* Matlab_scripts - Matlab scripts for generating .arff files
It requires scripts:
* Tensor Toolbox
* Matlab Toolbox for Multilinear Principal Component Analysis (MPCA) by Haiping LU (https://www.mathworks.com/matlabcentral/fileexchange/26168-multilinear-principal-component-analysis--mpca-?s_tid=prof_contriblnk)
* ListOfSequences.txt - file with information about video sequences (start frames, frames numbers, offsets)
* ActorsData.txt - file with information about recorded persons (age, gender, height, width)
* GPJATK_Release_Agreement.pdf - GPJATK dataset release agreement which must be accepted to use the database

=======================
Project participants
=======================
Konrad Wojciechowski (Polish-Japanese Academy of Information Technology)
Bogdan Kwolek

=======================
Acknowledgements
=======================
The recordings were made in the years 2012-2014 in the Human Motion Lab (Research and Development Center of the Polish-Japanese Academy of Information Technology) in Bytom as part of the projects: 1) „System with a library of modules for advanced analysis and an interactive synthesis of human motion” co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme – Priority Axis 1; 2) OR00002111 financed by the National Centre for Research and Development (NCBiR).

=======================
Privacy statement
=======================
Data of human subjects is provided in coded form (without personal identifying information and with blurred faces to prevent identification).

=======================
Further information
=======================
For any questions, comments or other issues please contact Tomasz Krzeszowski
Data from: Identifying Machine-Paraphrased Plagiarism
zenodo.org
opendatalab.com
+3more
zip
Updated Sep 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Philip Wahle; Terry Ruas; Terry Ruas; Tomas Foltynek; Tomas Foltynek; Norman Meuschke; Norman Meuschke; Bela Gipp; Bela Gipp; Jan Philip Wahle (2022). Identifying Machine-Paraphrased Plagiarism [Dataset]. http://doi.org/10.5281/zenodo.3608000
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3608000
Dataset updated
Sep 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Philip Wahle; Terry Ruas; Terry Ruas; Tomas Foltynek; Tomas Foltynek; Norman Meuschke; Norman Meuschke; Bela Gipp; Bela Gipp; Jan Philip Wahle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README.txt

Title: Identifying Machine-Paraphrased Plagiarism
Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp
contact email: wahle@gipplab.org; ruas@gipplab.org;
Venue: iConference
Year: 2022
================================================================
Dataset Description:

Training:
200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API).

Testing:
SpinBot:
arXiv - Original - 20,966; Spun - 20,867
Theses - Original - 5,226; Spun - 3,463
Wikipedia - Original - 39,241; Spun - 40,729

SpinnerChief-4W:
arXiv - Original - 20,966; Spun - 21,671
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,618

SpinnerChief-2W:
arXiv - Original - 20,966; Spun - 21,719
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,697

================================================================
Dataset Structure:

[human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons).

NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded.

[automated_evaluation]: contains all files used for the automated evaluation considering [spinbot] (https://spinbot.com/API) and [spinnerchief] (http://developer.spinnerchief.com/API_Document.aspx).

Each paraphrase tool folder contains:

[corpus] and [vectors] sub-folders.

For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio.

[vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.

The word embedding technique used is described in the file name with the following structure:

- d2v - doc2vec
google - word2vec
fasttextnw - fastText without subwording
fasttextsw - fastText with subwording
glove - Glove

Details for each technique used can be found in the paper.

arxivp - arXiv paragraph split
thesisp - Theses paragraph split
wikip - Wikipedia paragraph split (wikipedia_paragraph_vector_train are the vectors used for training. It follows the same wikip structure)

Details for each technique used can be found in the paper referenced at the start of this README file.

[corpus] sub-folder: contains de raw text (No pre-processing) used for train and test at a paragraph level.

The Spun paragraphs used for training are only generated using the SpinBot tool. For test both SpinBot and SpinnerChief are used.

The paragraph split is generated by selecting paragraphs from the original documents with 3 or more sentences. Each folder is divided in mg (i.e., machine-generated through SpinBot and SpinnerChief) and og (i.e., original-generated file). the document split is not avaiable since our experiments only use the paragraph level.

Machine Learning models: SVM, Naive Bayes, and Logistic Regression. The grid search for hyperparameter adjustments for the machine learning classifiers is described in the paper.

@incollection{WahleRFM22,
title = {Identifying {{Machine-Paraphrased Plagiarism}}},
booktitle = {Information for a {{Better World}}: {{Shaping}} the {{Global Future}}},
author = {Wahle, Jan Philip and Ruas, Terry and Folt{\’y}nek, Tom{\’a}{\v s} and Meuschke, Norman and Gipp, Bela},
editor = {Smits, Malte},
year = {2022},
volume = {13192},
pages = {393--413},
publisher = {{Springer International Publishing}},
address = {{Cham}},
doi = {10.1007/978-3-030-96957-8_34},
isbn = {978-3-030-96956-1 978-3-030-96957-8},
}

For our previous publication using only SpinBot and Wikipedia articles for document and paragraph split, please see the following publication. The dataset used is hosted in DeepBlue
m
Dataset for Towards Liveness Detection in Keystroke Dynamics: Revealing...
data.mendeley.com
ieee-dataport.org
Updated May 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nahuel González (2021). Dataset for Towards Liveness Detection in Keystroke Dynamics: Revealing Synthetic Forgeries [Dataset]. http://doi.org/10.17632/xvg5j5z29p.1
Explore at:
Unique identifier
https://doi.org/10.17632/xvg5j5z29p.1
Dataset updated
May 19, 2021
Authors
Nahuel González
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". CSV files with dataset results summaries, the evaluated sentences, detailed results, and scores. Results data contains training and evaluation ARFF files for each user, containing features of synthetic and legitimate samples as described in the article. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY.

The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.

We introduce two strategies using higher order contexts and empirical distributions to generate artificial samples of keystroke timings, together with a liveness detection system for keystroke dynamics that leverages them as adversaries. To aid with this objective, a new derived feature based on the inverse function of the smoothed empirical cumulative distributions is presented. One of the proposed attacking strategies outperforms other methods previously evaluated in the literature by a large margin, doubling and sometimes tripling their false acceptance rates, to around 15%, when data of the targeted user is available. If only general population data is available to an attacker, the liveness detection system achieves false acceptance and false rejection rates between 1% and 2%, consistently, over three datasets.
D
Copy of "Recognising user actions during cooking task (Cooking task dataset)...
data.elaine.uni-rostock.de
csv, pdf, txt
Updated Oct 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELAINE INF (2019). Copy of "Recognising user actions during cooking task (Cooking task dataset) – IMU Data" [Dataset]. https://data.elaine.uni-rostock.de/ca/dataset/0339a25f-a606-46f6-9c97-81fea4f69ef3
Explore at:
csv(9126955), csv(10055152), csv(11749660), csv(8783817), csv(6212179), pdf(1105863), txt(2029), csv(8834418), csv(10011740)Available download formats
Dataset updated
Oct 30, 2019
Dataset provided by
ELAINE INF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This datasets contains a copy of the original dataset "Recognising user actions during cooking task (Cooking task dataset) – IMU Data" published by Frank Krüger, Albert Hein, Kristina Yordanova, and Thomas Kirste at https://doi.org/10.18453/rosdok_id00000154. The only change is that arff files are converted into csv using the WEKA open source machine learning software (version 3.8.3. https://www.cs.waikato.ac.nz/ml/weka/).
e
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
b2find.eudat.eu
Updated Sep 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/7e9eb5b9-f166-567e-a521-f3b3be884bf2
Explore at:
Dataset updated
Sep 17, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests conducted for the paper: Stream Clustering Robust to Concept Drift Context and methodology SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans. This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. Docker A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust Technical details Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations. [data] contains datasets in ARFF format. [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper). "dependencies.sh" lists and installs python dependencies. "pysdoclust-stream-main.zip" contains the SDOstreamclust python package. "README.md" shows details and intructions to use this repository. "run.sh" runs the complete experiments. "run_comp.py"for running experiments specified by arguments. "TSindex.py" implements functions for the Temporal Silhouette index. Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.
V2X Security Threats for Cluser-based Evaluation
zenodo.org
application/gzip
Updated Oct 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fábio Gonçalves; Alexandre Santos; Joaquim Macedo; Fábio Gonçalves; Alexandre Santos; Joaquim Macedo (2021). V2X Security Threats for Cluser-based Evaluation [Dataset]. http://doi.org/10.5281/zenodo.5567417
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5567417
Dataset updated
Oct 14, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fábio Gonçalves; Alexandre Santos; Joaquim Macedo; Fábio Gonçalves; Alexandre Santos; Joaquim Macedo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets are obtained from the datasets in OID. Those datasets were filtered in order to obtain only the data collected by the vehicles composing the clusters as shown in the paper Evaluation of VANET Datasets in context of an Intrusion Detection System published in 29th International Conference on Software, Telecommunications and Computer Networks (SoftCOM 2021).

The data from the multiple maps were then grouped to compose the training dataset. The test dataset is similar to the one in the original map 7 datasets. Additionally, the datasets were converted into ARFF format used by weka. The conversion to CSV can easily be made by deleting the header.

Facebook

Twitter

Click to copy link

Link copied

Cite

External Data Source (2009). NSL-KDD dataset [Dataset]. http://doi.org/10.23721/100/1478792

NSL-KDD dataset

DS-0928

Explore at:

79 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.23721/100/1478792

Dataset updated

Jan 1, 2009

Authors

External Data Source

Time period covered

Jan 1, 2009

Description

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set . Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

Data files

KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format
KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21
; cic@unb.ca.

Clear search

Close search

Google apps

Main menu

NSL-KDD dataset

Data from: Automatic composition of descriptive music: A case study of the...

Data from: COVID-19 and media dataset: Mining textual data according periods...

VPN and Non-VPN Application Traffic (CIC-VPN2016)

NSL-KDD

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

HTRU2

Phishing Dataset UCI ML CSV

Context

Data from: SoilKsatDB: global compilation of soil saturated hydraulic...

Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...

Relevance and Redundancy ranking: Code and Supplementary material

Data from: Machine Learning Models and New Computational Tool for the...

Detecting Machine-obfuscated Plagiarism

SLAC Dataset

GPJATK DATASET – Calibrated and synchronized multi-view video and motion...

Data from: Identifying Machine-Paraphrased Plagiarism

Dataset for Towards Liveness Detection in Keystroke Dynamics: Revealing...

Copy of "Recognising user actions during cooking task (Cooking task dataset)...

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

V2X Security Threats for Cluser-based Evaluation

NSL-KDD dataset

DS-0928