100+ datasets found
  1. Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907847
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    • https://github.com/GumTreeDiff/gumtree

    • Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    2. PyDriller

    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  2. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  3. Data from: Automatic composition of descriptive music: A case study of the...

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucía Martín-Gómez (2023). Automatic composition of descriptive music: A case study of the relationship between image and sound [Dataset]. http://doi.org/10.6084/m9.figshare.6682998.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lucía Martín-Gómez
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    FANTASIAThis repository contains the data related to image descriptors and sound associated with a selection of frames of the films Fantasia and Fantasia 2000 produced by DisneyAboutThis repository contains the data used in the article Automatic composition of descriptive music: A case study of the relationship between image and sound published in the 6th International Workshop on Computational Creativity, Concept Invention, and General Intelligence (C3GI). Data structure is explained in detail in the article. AbstractHuman beings establish relationships with the environment mainly through sight and hearing. This work focuses on the concept of descriptive music, which makes use of sound resources to narrate a story. The Fantasia film, produced by Walt Disney was used in the case study. One of its musical pieces is analyzed in order to obtain the relationship between image and music. This connection is subsequently used to create a descriptive musical composition from a new video. Naive Bayes, Support Vector Machine and Random Forest are the three classifiers studied for the model induction process. After an analysis of their performance, it was concluded that Random Forest provided the best solution; the produced musical composition had a considerably high descriptive quality. DataNutcracker_data.arff: Image descriptors and the most important sound of each frame from the fragment "The Nutcracker Suite" in film Fantasia. Data stored into ARFF format.Firebird_data.arff: Image descriptors of each frame from the fragment "The Firebird" in film Fantasia 2000. Data stored into ARFF format.Firebird_midi_prediction.csv: Frame number of the fragment "The Firebird" in film Fantasia 2000 and the sound predicted by the system encoded in MIDI. Data stored into CSV format.Firebird_prediction.mp3: Audio file with the synthesizing of the prediction data for the fragment "The Firebird" of film Fantasia 2000.LicenseData is available under MIT License. To make use of the data the article must be cited.

  4. m

    DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

    • data.mendeley.com
    • narcis.nl
    Updated Mar 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
    Explore at:
    Dataset updated
    Mar 12, 2019
    Authors
    Fabian Constante
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

    Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

  5. Z

    Financial News dataset for text mining

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Oct 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    turenne nicolas (2021). Financial News dataset for text mining [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5569112
    Explore at:
    Dataset updated
    Oct 23, 2021
    Dataset authored and provided by
    turenne nicolas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    please cite this dataset by :

    Nicolas Turenne, Ziwei Chen, Guitao Fan, Jianlong Li, Yiwen Li, Siyuan Wang, Jiaqi Zhou (2021) Mining an English-Chinese parallel Corpus of Financial News, BNU HKBU UIC, technical report

    The dataset comes from Financial Times news website (https://www.ft.com/)

    news are written in both languages Chinese and English.

    FTIE.zip contains all documents in a file individually

    FT-en-zh.rar contains all documents in one file

    Below is a sample document in the dataset defined by these fields and syntax :

    id;time;english_title;chinese_title;integer;english_body;chinese_body

    1021892;2008-09-10T00:00:00Z;FLAW IN TWIN TOWERS REVEALED;科学家发现纽约双子塔倒塌的根本原因;1;Scientists have discovered the fundamental reason the Twin Towers collapsed on September 11 2001. The steel used in the buildings softened fatally at 500?C – far below its melting point – as a result of a magnetic change in the metal. @ The finding, announced at the BA Festival of Science in Liverpool yesterday, should lead to a new generation of steels capable of retaining strength at much higher temperatures.;科学家发现了纽约世贸双子大厦(Twin Towers)在2001年9月11日倒塌的根本原因。由于磁性变化,大厦使用的钢在500摄氏度——远远低于其熔点——时变软,从而产生致命后果。 @ 这一发现在昨日利物浦举行的BA科学节(BA Festival of Science)上公布。这应会推动能够在更高温度下保持强度的新一代钢铁的问世。

    The dataset contains 60,473 bilingual documents.

    Time range is from 2007 and 2020.

    This dataset has been used for parallel bilingual news mining in Finance domain.

  6. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  7. Android malware dataset for machine learning 2

    • figshare.com
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suleiman Yerima (2023). Android malware dataset for machine learning 2 [Dataset]. http://doi.org/10.6084/m9.figshare.5854653.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Suleiman Yerima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection'. The supporting file contains further description of the feature vectors/attributes obtained via static code analysis of the Android apps.

  8. P

    UCR Time Series Classification Archive Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoang Anh Dau; Anthony Bagnall; Kaveh Kamgar; Chin-Chia Michael Yeh; Yan Zhu; Shaghayegh Gharghabi; Chotirat Ann Ratanamahatana; Eamonn Keogh (2023). UCR Time Series Classification Archive Dataset [Dataset]. https://paperswithcode.com/dataset/ucr-time-series-classification-archive
    Explore at:
    Dataset updated
    May 17, 2023
    Authors
    Hoang Anh Dau; Anthony Bagnall; Kaveh Kamgar; Chin-Chia Michael Yeh; Yan Zhu; Shaghayegh Gharghabi; Chotirat Ann Ratanamahatana; Eamonn Keogh
    Description

    The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.

  9. d

    SA Mineral and/or Opal Exploration Licence Applications

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Apr 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). SA Mineral and/or Opal Exploration Licence Applications [Dataset]. https://data.gov.au/data/dataset/064b4ce1-cbf9-4cd3-ad2e-4d1a677c70b8
    Explore at:
    zip(435509)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    Location of all current mineral exploration licences issued under the Mining Act 1971. Exploration licences provide exclusive tenure rights to explore for mineral resources for up to a maximum of 5 years. Comment is sought on applications for exploration licences from numerous sources before granting. Exploration programs are subject to strict environmental and heritage conditions. Exploitation of identified resources must be made under separate mineral production leases.

    Purpose

    Purpose:

    The dataset was developed to record information necessary for the administration of the Mining Act.

    Use:

    Used to supply government, industry and the general public with an up-to-date status and extent of mineral and/or opal exploration licence application activities throughout the state.

    Use limitation:

    The data should not be used at a scale larger than 1:50 000.

    Dataset History

    Lineage:

    Source data history: Exploration Licence application boundaries were sourced from the official Mining Register licence application documents. Licence application boundaries are legally defined to follow lines of latitude and longitude. The register has existed since 1930.

    Processing steps: Coordinates entered by keyboard from licence application documents. Linework cleaned to remove duplicate arcs. Data adjusted for accurate state border and coastline. Where appropriate cadastral parcels removed from licence application polygons. Associated attribute data also captured from licence application documents.

    Dataset Citation

    SA Department of Primary Industries and Resources (2014) SA Mineral and/or Opal Exploration Licence Applications. Bioregional Assessment Source Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/064b4ce1-cbf9-4cd3-ad2e-4d1a677c70b8.

  10. S

    NASICON-type solid electrolyte materials named entity recognition dataset

    • scidb.cn
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
    Description

    1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token

  11. P

    Seoul Bike Trip duration prediction Dataset

    • paperswithcode.com
    • data.mendeley.com
    Updated Oct 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sathishkumar V E; Jangwoo Park; Yongyun Cho (2020). Seoul Bike Trip duration prediction Dataset [Dataset]. https://paperswithcode.com/dataset/seoul-bike-trip-duration-prediction
    Explore at:
    Dataset updated
    Oct 31, 2020
    Authors
    Sathishkumar V E; Jangwoo Park; Yongyun Cho
    Area covered
    Seoul
    Description

    Trip duration is the most fundamental measure in all modes of transportation. Hence, it is crucial to predict the trip-time precisely for the advancement of Intelligent Transport Systems (ITS) and traveller information systems. In order to predict the trip duration, data mining techniques are employed in this paper to predict the trip duration of rental bikes in Seoul Bike sharing system. The prediction is carried out with the combination of Seoul Bike data and weather data. The Data used include trip duration, trip distance, pickup-dropoff latitude and longitude, temperature, precipitation, wind speed, humidity, solar radiation, snowfall, ground temperature and 1-hour average dust concentration. Feature engineering is done to extract additional features from the data. Four statistical models are used to predict the trip duration. (a) Linear regression, (b) Gradient boosting machines, (c) k nearest neighbor and (d) Random Forest(RF). Four performance metrics Root mean squared error, Coefficient of Variance, Mean Absolute Error and Median Absolute Error is used to determine the efficiency of the models. In comparison with the other models, the optimum model RF can explain the variance of 93% in the testing set and 98% (R2) in the training set. The outcome proves that RF is effective to be employed for the prediction of trip duration.

  12. Indexed Data Set From Molisan Regional Seismic Network Events

    • zenodo.org
    csv, txt, zip
    Updated Jan 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni De Gasperis; Christian Del Pinto; Giovanni De Gasperis; Christian Del Pinto (2020). Indexed Data Set From Molisan Regional Seismic Network Events [Dataset]. http://doi.org/10.5281/zenodo.163767
    Explore at:
    zip, csv, txtAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giovanni De Gasperis; Christian Del Pinto; Giovanni De Gasperis; Christian Del Pinto
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract:

    After the earthquake occurred in Molise (Central Italy) on 31st October 2002 (Ml 5.4, 29 people dead), the local Servizio Regionale per la Protezione Civile to ensure a better analysis of local seismic data, through a convention with the Istituto Nazionale di Geofisica e Vulcanologia (INGV), promoted the design of the Regional Seismic Network (RMSM) and funded its implementation. The 5 stations of RMSM worked since 2007 to 2013 collecting a large amount of seismic data and giving an important contribution to the study of seismic sources present in the region and the surrounding territory. This work reports about the dataset containing all triggers collected by RMSM since July 2007 to March 2009, including actual seismic events; among them, all earthquakes events recorded in coincidence to Rete Sismica Nazionale Centralizzata (RSNC) of INGV have been marked with S and P arrival timestamps. Every trigger has been associated to a spectrogram defined into a recorded time vs. frequency domain.
    The dataset has been fully indexed in respect of the recorded spectra: list of all records, list of earthquakes, list of multiple earthquakes records.
    The main aim of this structured dataset is to be used for further analysis with data mining and machine learning techniques on image patterns associated to the waveforms.

  13. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global, United States
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

    What will be the Size of the Data Science Platform Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

    How is this Data Science Platform Industry segmented?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen

  14. A

    OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).

  15. o

    Data from: Spotify Playlists Dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Mar 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Pichl; Eva Zangerle (2019). Spotify Playlists Dataset [Dataset]. http://doi.org/10.5281/zenodo.2594557
    Explore at:
    Dataset updated
    Mar 15, 2019
    Authors
    Martin Pichl; Eva Zangerle
    Description

    This dataset is based on the subset of users in the #nowplaying dataset who publish their #nowplaying tweets via Spotify. In principle, the dataset holds users, their playlists and the tracks contained in these playlists. The csv-file holding the dataset contains the following columns: "user_id", "artistname", "trackname", "playlistname", where user_id is a hash of the user's Spotify user name artistname is the name of the artist trackname is the title of the track and playlistname is the name of the playlist that contains this track. The separator used is , each entry is enclosed by double quotes and the escape character used is . A description of the generation of the dataset and the dataset itself can be found in the following paper: Pichl, Martin; Zangerle, Eva; Specht, Günther: "Towards a Context-Aware Music Recommendation Approach: What is Hidden in the Playlist Name?" in 15th IEEE International Conference on Data Mining Workshops (ICDM 2015), pp. 1360-1365, IEEE, Atlantic City, 2015. {"references": ["Pichl, Martin; Zangerle, Eva; Specht, G\u00fcnther: "Towards a Context-Aware Music Recommendation Approach: What is Hidden in the Playlist Name?" in 15th IEEE International Conference on Data Mining Workshops (ICDM 2015), pp. 1360-1365, IEEE, Atlantic City, 2015."]}

  16. f

    Meta-analysis results for 245 genes using 16 TC expression datasets.

    • figshare.com
    xlsx
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fanyong Kong; Boxuan Han; Zhen Wu; Jiaming Chen; Xixi Shen; Qian Shi; Lizhen Hou; Jugao Fang; Meng Lian (2025). Meta-analysis results for 245 genes using 16 TC expression datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0318747.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Fanyong Kong; Boxuan Han; Zhen Wu; Jiaming Chen; Xixi Shen; Qian Shi; Lizhen Hou; Jugao Fang; Meng Lian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Meta-analysis results for 245 genes using 16 TC expression datasets.

  17. f

    References supporiting the influnce of SS on seven genes that show...

    • plos.figshare.com
    xlsx
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fanyong Kong; Boxuan Han; Zhen Wu; Jiaming Chen; Xixi Shen; Qian Shi; Lizhen Hou; Jugao Fang; Meng Lian (2025). References supporiting the influnce of SS on seven genes that show significance in Meta-analysis using 16 TC expression datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0318747.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Fanyong Kong; Boxuan Han; Zhen Wu; Jiaming Chen; Xixi Shen; Qian Shi; Lizhen Hou; Jugao Fang; Meng Lian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    References supporiting the influnce of SS on seven genes that show significance in Meta-analysis using 16 TC expression datasets.

  18. SyROCCo dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde (2024). SyROCCo dataset [Dataset]. http://doi.org/10.5281/zenodo.12204304
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The peer-reviewed publication for this dataset has been published in Data & Policy, and can be accessed here: https://arxiv.org/abs/2406.16527 Please cite this when using the dataset.

    This dataset has been produced as a result of the “Systematic Review of Outcomes Contracts using Machine Learning” (SyROCCo) project. The goal of the project was to apply machine learning techniques to a systematic review process of outcomes-based contracting (OBC). The purpose of the systematic review was to gather and curate, for the first time, all of the existing evidence on OBC. We aimed to map the current state of the evidence, synthesise key findings from across the published studies, and provide accessible insights to our policymaker and practitioner audiences.

    OBC is a model for the provision of public services wherein a service provider receives payment, in-part or in-full, only upon the achievement of pre-agreed outcomes.

    The data used to conduct the review consists of 1,952 individual studies of OBC. They include peer reviewed journal articles, book chapters, doctoral dissertations, and assorted ‘grey literature’ - that is, reports and evaluations produced outside of traditional academic publications. Those studies were manually filtered by experts on the topic from an initial search of over 11,000 results.

    The full text of the articles was obtained from their PDF versions and preprocessed. This involved text format normalisation, removing acknowledgements and bibliographic references.

    The corpus was then connected to the INDIGO Impact Bond Dataset. Projects and organisations mentioned in this latter dataset were searched for in the article’s corpus to relate both datasets.

    Other types of information that were identified in the texts were 1) financial mechanisms (type of outcomes-based instrument); using a list of terms related to those financial mechanisms based on prior discussions with a policy advisory group (Picker et al., 2021); 2) references to the 17 Sustainable Development Goals (SDGs) defined by the United Nations General Assembly in the 2030 Agenda; 3) country names mentioned in each article and income levels related to the countries; according to the World Classification of Income Levels 2022 by the World Bank.

    Three machine learning techniques were applied to the corpus:

    • Policy areas identification. A query-driven topic model (QDTM) (Fang et al., 2021) was used to determine the probability of an article belonging to different policy areas (health, education, homelessness, criminal justice, employment and training, child and family welfare, and agriculture and environment), using all text of the article as input. The QDTM is a semi-supervised machine learning algorithm that allows users to specify their prior knowledge in the form of simple queries in words or phrases and return query-related topics.

    • Named Entity Recognition. Three named entity recognition models were applied: “en_core_web_lg” and “en_core_web_trf” models from the python package ‘spaCy’ and the “ner-ontonotes-large” English model from ‘Flair’. “en_core_web_trf” is based on the RoBERTa-base transformer model. ‘Flair’ is a bi-LSTM character-based model. All models were trained on the “OntoNotes 5” data source (Marcus et al., 2011) and are able to identify geographical locations, organisation names, and laws and regulations. An ensemble method was adopted, considering the entities that appear simultaneously in the results of any two models as the correct entities.

    • Semantic text similarity. We calculated the similarity score between articles. The 10,000 most frequently mentioned words were first extracted from all the articles’ titles and abstracts and the text vectorization technique TF*IDF was applied to convert each article’s abstract into an importance score vector based on these words. Using these numerical vectors, the cosine similarity between different articles was calculated.

    The SyROCCo Dataset includes references to the 1952 studies of OBCs mentioned above and the results of the previous processing steps and techniques. Each entry of the dataset contains the following information.

    The basic information of each document is its title, abstract, authors, published years, DOI and Article ID:

    • Title: Title of the document.

    • Abstract: Text of the abstract.

    • Authors: Authors of a study.

    • Published Years: Published Years of a study.

    • DOI: DOI link of a study.

    • Article ID: ID of the document selected during the screening process.

    The probability of a study belonging to each policy area:

    • policy_sector_health: The probability of a study belongs to the policy sector “health”.

    • policy_sector_education: The probability of a study belongs to the policy sector “education”.

    • policy_sector_homelessness: The probability of a study belongs to the policy sector “homelessness”.

    • policy_sector_criminal: The probability of a study belongs to the policy sector “criminal”

    • policy_sector_employment: The probability of a study belongs to the policy sector “employment”

    • policy_sector_child: The probability of a study belongs to the policy sector “child”.

    • policy_sector_environment: The probability of a study belongs to the policy sector “environment”.

    Other types of information such as financial mechanisms, Sustainable Development Goals, and different types of named entities:

    • financial_mechanisms: Financial mechanisms mentioned in a study.

    • top_financial_mechanisms: The financial mechanisms mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

    • top_sgds: Sustainable Development Goals mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

    • top_countries: Country names mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions. This entry is also used to determine the income level of the mentioned counties.

    • top_Project: Indigo projects mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

    • top_GPE: Geographical locations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

    • top_LAW: Relevant laws and regulations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

    • top_ORG: Organisations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

  19. i

    UCI datasets

    • ieee-dataport.org
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets
    Explore at:
    Dataset updated
    May 14, 2025
    Authors
    Yuan Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    biology

  20. f

    Amazon dataset for ERS-REFMMF

    • figshare.com
    txt
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teng Chang (2024). Amazon dataset for ERS-REFMMF [Dataset]. http://doi.org/10.6084/m9.figshare.25126313.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 1, 2024
    Dataset provided by
    figshare
    Authors
    Teng Chang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recommender systems based on matrix factorization act as black-box models and are unable to explain the recommended items. After adding the neighborhood algorithm, the explainability is measured by the user's neighborhood recommendation, but the subjective explicit preference of the target user is ignored. To better combine the latent factors from matrix factorization and the target user's explicit preferences, an explainable recommender system based on reconstructed explanatory factors and multi-modal matrix factorization (ERS-REFMMF) is proposed. ERS-REFMMF is a two-layer model, and the underlying model decomposes the multi-modal scoring matrix to get the rich latent features of the user and the item based on the method of Funk-SVD, in which the multi-modal scoring matrix consists of the original matrix and the preference features and sentiment scores exhibited by users in the reviews corresponding to the ratings. The set of candidate items is obtained based on the latent features, and the explainability is reconstructed based on the subjective preference of the target user and the real recognition level of the neighbors. The upper layer is the multi-objective high-performance recommendation stage, in which the candidate set is optimized by a multi-objective evolutionary algorithm to bring the user a final recommendation list that is accurate, recallable, diverse, and interpretable, in which the accuracy and recall are represented by F1-measure. Experimental results on three real datasets from Amazon show that the proposed model is competitive compared to existing recommendation methods in both stages.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907847
Organization logo

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jan 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

  1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
  2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
  3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
  4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

1. GumTree

  • https://github.com/GumTreeDiff/gumtree

  • Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

2. PyDriller

  • https://pydriller.readthedocs.io/en/latest/

  • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

Search
Clear search
Close search
Google apps
Main menu