24 datasets found
  1. Z

    Results of the 3rd Intl. Competition on Software Testing (Test-Comp 2021)

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Feb 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beyer, Dirk (2021). Results of the 3rd Intl. Competition on Software Testing (Test-Comp 2021) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4459469
    Explore at:
    Dataset updated
    Feb 7, 2021
    Dataset provided by
    LMU Munich, Germany
    Authors
    Beyer, Dirk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Competition Results

    This file describes the contents of an archive of the 3rd Competition on Software Testing (Test-Comp 2021). https://test-comp.sosy-lab.org/2021/

    The competition was run by Dirk Beyer, LMU Munich, Germany. More information is available in the following article: Dirk Beyer. Status Report on Software Testing: Test-Comp 2021. In Proceedings of the 24th International Conference on Fundamental Approaches to Software Engineering (FASE 2021, Luxembourg, March 27 - April 1), 2021. Springer.

    Copyright (C) Dirk Beyer https://www.sosy-lab.org/people/beyer/

    SPDX-License-Identifier: CC-BY-4.0 https://spdx.org/licenses/CC-BY-4.0.html

    To browse the competition results with a web browser, there are two options:

    start a local web server using php -S localhost:8000 in order to view the data in this archive, or

    browse https://test-comp.sosy-lab.org/2021/results/ in order to view the data on the Test-Comp web page.

    Contents

    index.html: directs to the overview web page

    LICENSE.txt: specifies the license

    README.txt: this file

    results-validated/: results of validation runs

    results-verified/: results of test-generation runs and aggregated results

    The folder results-validated/ contains the results from validation runs:

    *.xml.bz2: XML results from BenchExec

    *.logfiles.zip: output from tools

    *.json.gz: mapping from files names to SHA 256 hashes for the file content

    The folder results-verified/ contains the results from test-generation runs and aggregated results:

    index.html: overview web page with rankings and score table

    design.css: HTML style definitions

    *.xml.bz2: XML results from BenchExec

    *.merged.xml.bz2: XML results from BenchExec, status adjusted according to the validation results

    *.logfiles.zip: output from tools

    *.json.gz: mapping from files names to SHA 256 hashes for the file content

    *.xml.bz2.table.html: HTML views on the detailed results data as generated by BenchExec’s table generator

    *.All.table.html: HTML views of the full benchmark set (all categories) for each tool

    META_*.table.html: HTML views of the benchmark set for each meta category for each tool, and over all tools

    *.table.html: HTML views of the benchmark set for each category over all tools

    iZeCa0gaey.html: HTML views per tool

    quantilePlot-*: score-based quantile plots as visualization of the results

    quantilePlotShow.gp: example Gnuplot script to generate a plot

    score*: accumulated score results in various formats

    The hashes of the file names (in the files *.json.gz) are useful for

    validating the exact contents of a file and

    accessing the files from the witness store.

    Other Archives

    Overview over archives from Test-Comp 2021 that are available at Zenodo:

    https://doi.org/10.5281/zenodo.4459466 Witness store (containing the generated test suites)

    https://doi.org/10.5281/zenodo.4459470 Results (XML result files, log files, file mappings, HTML tables)

    https://doi.org/10.5281/zenodo.4459132 Test tasks, version testcomp21

    https://doi.org/10.5281/zenodo.4317433 BenchExec, version 3.6

    All benchmarks were executed for Test-Comp 2021 https://test-comp.sosy-lab.org/2021/ by Dirk Beyer, LMU Munich, based on the following components:

    https://gitlab.com/sosy-lab/test-comp/archives-2021 testcomp21-0-gdacd4bf

    https://gitlab.com/sosy-lab/software/sv-benchmarks testcomp21-0-gefea738258

    https://gitlab.com/sosy-lab/software/benchexec 3.6-0-gb278ebbb

    https://gitlab.com/sosy-lab/benchmarking/competition-scripts testcomp21-0-g8339740

    https://gitlab.com/sosy-lab/test-comp/bench-defs testcomp21-0-g9d532c9

    Contact

    Feel free to contact me in case of questions: https://www.sosy-lab.org/people/beyer/

  2. Z

    Results of the 2nd International Competition on Software Testing (Test-Comp...

    • nde-dev.biothings.io
    • data-staging.niaid.nih.gov
    • +1more
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beyer, Dirk (2021). Results of the 2nd International Competition on Software Testing (Test-Comp 2020) [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_3678263
    Explore at:
    Dataset updated
    Mar 9, 2021
    Dataset authored and provided by
    Beyer, Dirk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive contains the results of the 2nd Competition on Software Testing (Test-Comp 2020) https://test-comp.sosy-lab.org/2020/

    The competition was run by Dirk Beyer, LMU Munich, Germany. More information is available in the following article: Dirk Beyer. Second Competition on Software Testing: Test-Comp 2020. In Proceedings of the 23rd International Conference on Fundamental Approaches to Software Engineering (FASE 2020, Dublin, April 28-30), 2020. Springer. https://doi.org/10.1007/978-3-030-45234-6_25

    Copyright (C) Dirk Beyer https://www.sosy-lab.org/people/beyer/

    SPDX-License-Identifier: CC-BY-4.0 https://spdx.org/licenses/CC-BY-4.0.html

    To browse the competition results with a web browser, there are two options: - start a local web server using php -S localhost:8000 in order to view the data in this archive, or - browse https://test-comp.sosy-lab.org/2020/results/ in order to view the data on the Test-Comp web page.

    Contents:

    index.html directs to the overview web page LICENSE.txt specifies the license README.txt this file results-validated/ results of coverage-validation runs results-verified/ results of test-generation runs and aggregated results

    The folder results-validated/ contains the results from coverage-validation runs:

    • *.xml.bz2 XML results from BenchExec
    • *.logfiles.zip output from tools
    • *.json.gz mapping from files names to SHA 256 hashes for the file content

    The folder results-verified/ contains the results from test-generation runs and aggregated results:

    index.html overview web page with rankings and score table design.css HTML style definitions *.xml.bz2 XML results from BenchExec *.merged.xml.bz2 XML results from BenchExec, status adjusted according to the validation results *.logfiles.zip output from tools *.json.gz mapping from files names to SHA 256 hashes for the file content *.xml.bz2.table.html HTML views on the detailed results data as generated by BenchExec's table generator .All.table.html HTML views of the full benchmark set (all categories) for each tool META_.table.html HTML views of the benchmark set for each meta category for each tool, and over all tools *.table.html HTML views of the benchmark set for each category over all tools iZeCa0gaey.html HTML views per tool

    quantilePlot-* score-based quantile plots as visualization of the results quantilePlotShow.gp example Gnuplot script to generate a plot score* accumulated score results in various formats

    The hashes of the file names (in the files *.json.gz) are useful for - validating the exact contents of a file and - accessing the files from the witness store.

    Overview over archives from Test-Comp 2020 that are available at Zenodo:

    https://doi.org/10.5281/zenodo.3678275 Witness store (containing the generated test suites) https://doi.org/10.5281/zenodo.3678264 Results (XML result files, log files, file mappings, HTML tables) https://doi.org/10.5281/zenodo.3678250 Test tasks, version testcomp20 https://doi.org/10.5281/zenodo.3574420 BenchExec, version 2.5.1

    All benchmarks were executed for Test-Comp 2020, https://test-comp.sosy-lab.org/2020/ by Dirk Beyer, LMU Munich based on the components git@github.com:sosy-lab/sv-benchmarks.git testcomp20-0-gd6cd3e5dd4 git@gitlab.com:sosy-lab/test-comp/bench-defs.git testcomp19-84-gac76836 git@github.com:sosy-lab/benchexec.git 2.5.1-0-gffad635

    Feel free to contact me in case of questions: https://www.sosy-lab.org/people/beyer/

  3. Playing Cards Images - Object Detection Dataset

    • kaggle.com
    zip
    Updated Oct 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gunjan Haldar (2020). Playing Cards Images - Object Detection Dataset [Dataset]. https://www.kaggle.com/gunhcolab/object-detection-dataset-standard-52card-deck
    Explore at:
    zip(37946914 bytes)Available download formats
    Dataset updated
    Oct 9, 2020
    Authors
    Gunjan Haldar
    Description

    Context

    This project is our venture to create an object detection dataset from the scratch 😄

    Content

    The rows contain the various entries and their respective attributes. Features: There are columns which represent the Filename, Width, Height, Class, Xmin, Ymin, Xmax and Ymax as the Features Targets '[ace, jack, king, queen, two, three, four, five, six, seven, eight, nine, ten] of [diamonds, spades, clubs, hearts]' (52 targets in total)

    Acknowledgements

    We used deep file and downloaded the required set from Google. After that, the data was debugged, the unimportant ones were eliminated and the noise was basically reduced. The same sized images were kept in folders, arranged class wise. For the next step, we chose LableImg for the labelling and annotation. Once the labelling was complete, the corresponding .xml files were generated for the data.

    Inspiration

  4. E

    IUST-PDFCorpus

    • live.european-language-grid.eu
    • zenodo.org
    pdf
    Updated May 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). IUST-PDFCorpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7737
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 8, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AboutIUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.Citing IUST-PDFCorpusIf IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2

  5. Z

    Results of the 1st International Competition on Software Testing (Test-Comp...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beyer, Dirk (2020). Results of the 1st International Competition on Software Testing (Test-Comp 2019) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3856660
    Explore at:
    Dataset updated
    May 27, 2020
    Dataset provided by
    LMU Munich, Germany
    Authors
    Beyer, Dirk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file describes the contents of an archive of the 1st Competition on Software Testing (Test-Comp 2019) https://test-comp.sosy-lab.org/2019/

    The competition was run by Dirk Beyer, LMU Munich, Germany. More information is available in the following article: Dirk Beyer. First International Competition on Software Testing: Test-Comp 2019. International Journal on Software Tools for Technology Transfer, 2020.

    Copyright (C) Dirk Beyer https://www.sosy-lab.org/people/beyer/

    SPDX-License-Identifier: CC-BY-4.0 https://spdx.org/licenses/CC-BY-4.0.html

    To browse the competition results with a web browser, there are two options: - start a local web server using php -S localhost:8000 in order to view the data in this archive, or - browse https://test-comp.sosy-lab.org/2019/results/ in order to view the data on the Test-Comp web page.

    Contents:

    index.html directs to the overview web page LICENSE.txt specifies the license README.txt this file results-validated/ results of validation runs results-verified/ results of verification runs and aggregated results

    The folder results-validated/ contains the results from validation runs:

    • *.xml.bz2 XML results from BenchExec
    • *.logfiles.zip output from tools
    • *.json.gz mapping from files names to SHA 256 hashes for the file content

    The folder results-verified/ contains the results from test-generation runs and aggregated results:

    index.html overview web page with rankings and score table design.css HTML style definitions *.xml.bz2 XML results from BenchExec *.merged.xml.bz2 XML results from BenchExec, status adjusted according to the validation results *.logfiles.zip output from tools *.json.gz mapping from files names to SHA 256 hashes for the file content *.xml.bz2.table.html HTML views on the detailed results data as generated by BenchExec's table generator .All.table.html HTML views of the full benchmark set (all categories) for each tool META_.table.html HTML views of the benchmark set for each meta category for each tool, and over all tools *.table.html HTML views of the benchmark set for each category over all tools iZeCa0gaey.html HTML views per tool

    quantilePlot-* score-based quantile plots as visualization of the results quantilePlotShow.gp example Gnuplot script to generate a plot score* accumulated score results in various formats

    The hashes of the file names (in the files *.json.gz) are useful for - validating the exact contents of a file and - accessing the files from the witness store.

    Overview over archives from Test-Comp 2019 that are available at Zenodo:

    https://doi.org/10.5281/zenodo.3856669 Witness store (containing the generated test suites) https://doi.org/10.5281/zenodo.3856661 Results (XML result files, log files, file mappings, HTML tables) https://doi.org/10.5281/zenodo.3856478 Test tasks, version testcomp19 https://doi.org/10.5281/zenodo.2561835 BenchExec, version 1.18

    All benchmarks were executed for Test-Comp 2019, https://test-comp.sosy-lab.org/2019/ by Dirk Beyer, LMU Munich based on the components git@github.com:sosy-lab/sv-benchmarks.git testcomp19-0-g6a770a9c1 git@gitlab.com:sosy-lab/test-comp/bench-defs.git testcomp19-0-g1677027 git@github.com:sosy-lab/benchexec.git 1.18-0-gff72868

    Feel free to contact me in case of questions: https://www.sosy-lab.org/people/beyer/

  6. Fuel Economy Label and CAFE Data Inventory

    • catalog.data.gov
    • data.amerigeoss.org
    • +1more
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Air and Radiation (OAR) - Office of Transportation and Air Quality (OTAQ) (2021). Fuel Economy Label and CAFE Data Inventory [Dataset]. https://catalog.data.gov/dataset/fuel-economy-label-and-cafe-data-inventory
    Explore at:
    Dataset updated
    Jul 12, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Fuel Economy Label and CAFE Data asset contains measured summary fuel economy estimates and test data for light-duty vehicle manufacturers by model for certification as required under the Energy Policy and Conservation Act of 1975 (EPCA) and The Energy Independent Security Act of 2007 (EISA) to collect vehicle fuel economy estimates for the creation of Economy Labels and for the calculation of Corporate Average Fuel Economy (CAFE). Manufacturers submit data on an annual basis, or as needed to document vehicle model changes.The EPA performs targeted fuel economy confirmatory tests on approximately 15% of vehicles submitted for validation. Confirmatory data on vehicles is associated with its corresponding submission data to verify the accuracy of manufacturer submissions beyond standard business rules. Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML, and includes descriptive information on the vehicle itself, fuel economy information, and the manufacturer's testing approach. This data may contain proprietary information (CBI) such as information on estimated sales or other data elements indicated by the submitter as confidential. CBI data is not publically available; however, within the EPA data can accessed under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Datasets are segmented by vehicle model/manufacturer and/or year with corresponding fuel economy, test, and certification data. Data assets are stored in EPA's Verify system.Coverage began in 1974 with early records being primarily paper documents which did not go through the same level of validation as primarily digital submissions which started in 2008. Early data is available to the public digitally starting from 1978, but more complete digital certification data is available starting in 2008. Fuel economy submission data prior to 2006 was calculated using an older formula; however, mechanisms exist to make this data comparable to current results.Fuel Economy Label and CAFE Data submission documents with metadata, certificate and summary decision information is utilized and made publically available through the EPA/DOE's Fuel Economy Guide Website (https://www.fueleconomy.gov/) as well as EPA's Smartway Program Website (https://www.epa.gov/smartway/) and Green Vehicle Guide Website (http://ofmpub.epa.gov/greenvehicles/Index.do;jsessionid=3F4QPhhYDYJxv1L3YLYxqh6J2CwL0GkxSSJTl2xgMTYPBKYS00vw!788633877) after it has been quality assured. Where summary data appears inaccurate, OTAQ returns the entries for review to their originator.

  7. c

    Data from: Datasets used to train the Generative Adversarial Networks used...

    • opendata.cern.ch
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    CERN Open Data Portal
    Authors
    ATLAS collaboration
    Description

    Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

    The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

    The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.

    Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.

    Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.

  8. UnmixDB: A Dataset for DJ-Mix Information Retrieval

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    pdf, txt, zip
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diemo Schwarz; Dominique Fourer; Diemo Schwarz; Dominique Fourer (2024). UnmixDB: A Dataset for DJ-Mix Information Retrieval [Dataset]. http://doi.org/10.5281/zenodo.1422385
    Explore at:
    zip, txt, pdfAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diemo Schwarz; Dominique Fourer; Diemo Schwarz; Dominique Fourer
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    A collection of automatically generated DJ mixes with ground truth, based on creative-commons-licensed freely available and redistributable electronic dance tracks.

    In order to evaluate the DJ mix analysis and reverse engineering methods, we created a dataset of excerpts of open licensed dance tracks and automatically generated mixes based on these.

    Each mix is based on a playlist that mixes 3 track excerpts beat-synchronously, such that the middle track is embedded in a realistic context of beat-aligned linear cross fading to the other tracks.
    The first track's BPM is used as the seed tempo onto which the other tracks are adapted.

    Each playlist of 3 tracks is mixed 12 times with combinations of 4 variants of effects and 3 variants of time scaling using the treatments of the sox open source command-line program [http://sox.sourceforge.net].

    Each track excerpt contains about 20s of the beginning and 20s of the end of the source track. However, the exact choice is made taking into account the metric structure of the track. The cue-in region, where the fade-in will happen, is placed on the second beat marker starting a new measure, and lasts for 4 measures. The cue-out region ends with the 2nd to last measure marker. We assure at least 20s for the beginning and end parts. The cut points where they are spliced together is again placed on the start of a measure, such that no artefacts due to beat discontinuity are introduced.

    The UnmixDB dataset contains the ground truth for the source tracks and mixes in ASCII label format with tab-separated columns starttime, endtime, label.
    For each mix, the start, end, and cue points of the constituent tracks are given, along with their BPM and speed factors.
    We use the convention that the label starts with a number indicating which of the 3 source tracks the label refers to.

    The song excerpts are accompanied by their cue region and tempo information in .txt files in table format.

    Additionally, we provide the .beat.xml files containing the beat tracking results for the full tracks available from Sonnleitner et. al. 2016.

    Our DJ mix dataset is based on the curatorial work of Sonnleitner et. al. (ISMIR 2016), who collected Creative-Commons licensed source tracks of 10 free dance music mixes from Mixotic. We used their collected tracks to produce our track excerpts, but regenerated artificial mixes with perfectly accurate ground truth.

    The code used to create the dataset from the above is published at https://github.com/Ircam-RnD/unmixdb-creation, such that other researchers can create test data from other track collections or in other variants.

  9. Charaters for detection classification

    • kaggle.com
    zip
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Kalinchuk (2020). Charaters for detection classification [Dataset]. https://www.kaggle.com/ivankalinchuk/charaters-for-detection-classification
    Explore at:
    zip(251469594 bytes)Available download formats
    Dataset updated
    Oct 28, 2020
    Authors
    Ivan Kalinchuk
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Prehistory

    Learning concepts of object detection and classification, especially on of the most successful architecture for it - yolov3, I've cought an idea to create my own dataset for an object detection problem. Meet my very first dataset for it: the next letters for both classification and detection. abcdefghijklmnopqrstuwvxyzABCDEFGHIJKLMNOPQRSTUWVXYZ0123456789+@

    Content

    There are 3 folders: test, train and real_world_images. Let's have a look on test and train folders. Both of them contain a folder images and a csv file, according to the main folder (train.csv or tesdt.csv). Folder real_world_images contains random images from random places of the Internet of random sizes.

    Acknowledgements

    Huge thanks to all kagglers, as doing data science is a brand new magic business for humans, so you and I are making a very good job!

    Inspiration

    OCR (object character recognition) is still a huge problem for machine learning algorithms, as it's quite difficult to say what symbol of several alphabets is on an image. For example, you've taken a picture of a signboard in Germany, and there is a word 'Bad'. This is bath in english, but what if you have some database, which contains english, german, spanish, chinese etc vocabularies? Will you just skip same letters and leave only one in alphabets? I've come to an idea, but it doesn't seem for me to be a final solution: yes, we have to drop same looking letters, and leave it in only one dictionary. For example, we are free to drop a,b,c,d,e,f etc from german, spanish, portugalish etc and leave it only in english vocabulary for classification. But when we get classified word (several letters), we need to say what language does the word came from. As a fast solution, we just check for some unique letters. For example if a word contains some so called english letters, and two from german language, we can successfully say that this word is german. Searching for a word in huge dictionaries (vocabularies) is not a good solution, as it will take a lot of time and there is no garantee that this word is in your dictionaries.

  10. Z

    Data from: Synthetic Smart Card Data for the Analysis of Temporal and...

    • nde-dev.biothings.io
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Bouman (2020). Synthetic Smart Card Data for the Analysis of Temporal and Spatial Patterns [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_776718
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Paul Bouman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a synthetic smart card data set that can be used to test pattern detection methods for the extraction of temporal and spatial data. The data set is tab seperated and based on a stylized travel pattern description for city of Utrecht in The Netherlands and is developed and used in Chapter 6 of the PhD Thesis of Paul Bouman.

    This dataset contains the following files:

    journeys.tsv : the actual data set of synthetic smart card data

    utrecht.xml : the activity pattern definition that was used to randomly generate the synthethic smart card data

    validate.ref : a file derived from the activity pattern definition that can be used for validation purposes. It specifies which activity types occur at each location in the smart card data set.

  11. m

    A Golden Set of Problem, Solution, Advantages Senteces of the Patents

    • data.mendeley.com
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vito Giordano (2022). A Golden Set of Problem, Solution, Advantages Senteces of the Patents [Dataset]. http://doi.org/10.17632/kpxdzkgs3j.1
    Explore at:
    Dataset updated
    Aug 11, 2022
    Authors
    Vito Giordano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data contains two different dataset:

    (1) Golden Set is dataset of sentences tagged as (A) technical problem; (B) solution to the problem; and (C) advantageous effect of the invention. The dataset is based on a selectively extracted collection from the United States Patent and Trademark Office (USPTO) curated by Chikkamath, R., Parmar, V. R., Hewel, C., & Endres, M. (2021). Patent Sentiment Analysis to Highlight Patent Paragraphs. arXiv preprint arXiv:2111.09741. The full text of a patent is composed of three main parts: abstract, claims and description. The five IP offices (IP5) formalize a common application format to standardize the written style of the patent description. The common application format also includes the sections related with the concepts of our interest, i.e., technical problems, solutions to the problem, and advantageous effects of the invention. USPTO provides to the public the full text of the patent for advancing the state-of-the-art in innovation. The full text of a patent is saved in a nested eXtensible Markup Language (XML) formatted file. The XML format enables to distinguish the patent text in the abstract, claims and description part. It allows us to distinguish from the different sections composing the description established by the IP5 common application format, i.e., the background information, the summary, the embodiment, the description of the drawings, the technical fields, and other sections. Chikkamath et al. (2021) use the USPTO data and in particular the XML files of the patent full text for creating a new dataset. The dataset contains a collection of patent texts (150,000 samples) referred to (A) technical problems; (B) solutions to the problem; and (C) advantageous effects of the invention. We use this data for building our golden set.

    (2) Test data is a database 400 random patent grants and patent applications downoladed from USPTO. We use this data for evaluating a transformer-based language models developed for extracting problems, solutions and advantages on a real case use in an open ended domain.

  12. FTICR-MS Data from Multi-continent River Water and Sediment and from Coastal...

    • osti.gov
    • knb.ecoinformatics.org
    • +2more
    Updated Dec 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. DOE > Office of Science > Early Career Research Program (2021). FTICR-MS Data from Multi-continent River Water and Sediment and from Coastal River Fresh and Saline Sediment Associated with: “Dissolved Organic Matter Functional Trait Relationships are Conserved Across Rivers” [Dataset]. http://doi.org/10.15485/1824222
    Explore at:
    Dataset updated
    Dec 31, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Department of Energy Biological and Environmental Research Program
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    Early Career Research Program: Watershed Perturbation-Response Traits Derived Through Ecological Theory
    Description

    This data package is associated with the publication “Dissolved Organic Matter Functional Trait Relationships are Conserved Across Rivers” submitted to PNAS (Stegen et al., 2023). The study aims to understand large-scale spatial structure of the dissolved organic matter (DOM) thermodynamic traits and inter-trait relationships by investigating (1) river water and sediments collected along 97 rivers spanning 3 continents and (2) coastal sediment collected from fresh and saline locations in Pacific and Gulf/Atlantic rivers. Sediment extracts and water samples were analyzed using ultrahigh resolution Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). This dataset is comprised of three folders (1) Coastal, (2) WHONDR_S19S, and (3) Data_Dictionaries. Coastal contains (1) a subfolder with processed FTICR-MS data as csv files and sample collection metadata, (2) a subfolder with R scripts used to process the data and create associated figures, (3) a subfolder with the raw, unprocessed FTICR-MS data as .xml files, and (4) a readme file with more information about the dataset and instructions for using Formularity (https://omics.pnl.gov/software/formularity). WHONDRS_S19S contains (1) a csv file with processed FTICR data, (2) a csv with sample collection metadata, (3) a csv with sample geospatial data, (4) a csv with simulated lambda model outputs, (5) a subfolder with R scripts used to process the data and create associated figures, and (6) a readme file with more information regarding WHONDRS raw FTICR data and processing scripts. Data_Dictionaries contains data dictionaries for each csv file in the data package. The 97 global river corridors were part of a WHONDRS (https://whondrs.pnnl.gov) study. The raw, unprocessed FTICR-MS data with additional data can be found at doi:10.15485/1729719 for sediments and doi:10.15485/1603775 for water. This data package contains the processed data used in the associated manuscript. The coastal data has not been previously published, and this data package contains both the raw and processed data. Version 3 of this data package published February 2023 includes updates to the title of the manuscript, additional data and data dictionary and updated scripts linked to new analysis.

  13. w

    PESO: Prostate Epithelium Segmentation on H&E-stained prostatectomy whole...

    • wouterbulten.nl
    • data.niaid.nih.gov
    • +1more
    Updated Jul 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wouter Bulten; Geert Litjens (2019). PESO: Prostate Epithelium Segmentation on H&E-stained prostatectomy whole slide images [Dataset]. http://doi.org/10.1038/s41598-018-37257-4
    Explore at:
    Dataset updated
    Jul 29, 2019
    Dataset provided by
    Computational Pathology Group, Radboudumc
    Authors
    Wouter Bulten; Geert Litjens
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Large set of whole-slide-images (WSI) of prostatectomy specimens with various grades of prostate cancer (PCa). More information can be found in the corresponding paper: https://doi.org/10.1038/s41598-018-37257-4

    The WSIs in this dataset can be viewed using the open-source software ASAP or Open Slide. Due to the large size of the complete dataset, the data has been split up in to multiple archives.

    The data from the training set:

    • peso_training_masks.zip: Training masks (N=62) that have been used to train the main network of our paper. These masks are generated by a trained U-Net on the corresponding IHC slides.
    • peso_training_masks_corrected.zip: A subset of the color deconvolution masks (N=25) on which manual annotations have been made. Within these regions, stain and other artifacts have been removed.
    • peso_training_colordeconvolution.zip: Mask files (N=62) containing the P63&CK8/18 channel of the color deconvolution operation. These masks mark all regions that are stained by either P63 or CK8/18 in the IHC version of the slides.
    • peso_training_wsi_{1-6}.zip: Zip files containing the whole slide images of the training set (N=62). Each archive contains 10 slides, excluding the last which contains 12. These images are exported at a pixel resolution of 0.48mu/pixels.

    The data from the test set:

    • peso_testset_regions.zip: Collection of annotation XML files with outlines of the test regions. These can be used to view the test regions in more detail using ASAP.
    • peso_testset_png.zip: Export of the test set regions in PNG format (2500x2500 pixels per region).
    • peso_testset_png_padded.zip: Export of the test regions in PNG format padded with a 500 pixel wide border (3500x3500 pixels per region). Useful for segmenting pixels at the border of the regions.
    • peso_testset_mapping.csv: A csv file mapping files from the test set (numbered 1-160) to regions in the xml files. The csv file also contains the label (benign or cancer) for each region.
    • peso_testset_wsi_{1-4}.zip: Zip files containing the whole slide images of the test set (N=40). Each archive contains 10 slides of the test set. These images are exported at a pixel resolution of 0.48mu/pixels.

    This study was financed by a grant from the Dutch Cancer Society (KWF), grant number KUN 2015-7970.

    If you make use of this dataset please cite both the dataset itself and the corresponding paper: https://doi.org/10.1038/s41598-018-37257-4

  14. Z

    ATM: Black-box Test Case Minimization based on Test Code Similarity and...

    • data.niaid.nih.gov
    Updated Mar 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rongqi Pan; Taher A. Ghaleb; Lionel Briand (2023). ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search – Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7455765
    Explore at:
    Dataset updated
    Mar 25, 2023
    Dataset provided by
    University of Ottawa
    Authors
    Rongqi Pan; Taher A. Ghaleb; Lionel Briand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package associated with the paper "ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search" accepted at the 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023) – Technical Track. Cite this paper using the following:

    @inproceedings{pan2023atm, title={ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search}, author={Pan, Rongqi and Ghaleb, Taher A. and Briand, Lionel}, booktitle={Proceedings of the 45th IEEE/ACM International Conference on Software Engineering}, year={2023}, pages={1--12} }

    Replication Package Contents: The replication package contains all the necessary data and code required to reproduce the results reported in the paper. We also provide the results for other minimization budgets, and detailed FDR, execution time, and statistical test results. In addition, we provide the data and code required to reproduce the results of baselines techniques: FAST-R and random minimization.

    Data: We provide in the Data directory the data used in our experiments, which is based on 16 projects from Defects4J, whose characteristics can be found in Data/subject_projects.csv.

    Code: We provide in the Code directory the code and scripts (Java, Python, and Bash) required to run the experiments and reproduce the results.

    Results: We provide in the Results directory the results for each technique independently, and also a summary of all results together for comparison purposes. The source code for this step is in the Code/ATM/CodeToAST directory. The source code for this step is in the Code/ATM/Similarity directory.

    ATM - Code to AST transformation:

    Requirements: * Eclipse IDE (we used 2021-12) * The libraries (the .jar files in the Code/ATM/CodeToAST/lib directory)

    Input: All zipped data files should be unzipped before running each step. * Data/test_suites/all_test_cases.zip → Data/test_suites/all_test_cases * Data/test_suites/changed_test_cases.zip → Data/test_suites/changed_test_cases * Data/test_suites/relevant_test_cases.zip → Data/test_suites/relevant_test_cases

    Output: * Data/ATM/ASTs/all_test_cases * Data/ATM/ASTs/changed_test_cases

    Running the experiment: To generate ASTS for all test cases in the project test suites, the Code/ATM/CodeToAST/src/CodeToAST.java file should be compiled and run using the Eclipse IDE by including all the required .jar files in the Code/ATM/CodeToAST/lib directory as part of the classpath. A bash script is provided along with a pre-generated .jar file in the Code/ATM/CodeToAST/bin directory to run this step, as follows:

    cd Code/ATM/CodeToAST bash transform_code_to_ast.sh

    Each test file in the Data/test_suites/all_test_cases and Data/test_suites/changed_test_cases directories is parsed to generate a corresponding AST for each test case method (saved in an XML format in Data/ATM/ASTs/all_test_cases and Data/ATM/ASTs/changed_test_cases for each project version)

    ATM - Similarity Measurement:

    Requirements: * Eclipse IDE (we used 2021-12) * The libraries (the .jar files in the Code/ATM/Similarity/lib directory)

    Input: * Data/test_suites/all_test_cases * Data/test_suites/changed_test_cases

    Output: * Data/ATM/similarity_measurements

    Running the experiment: To measure the similarity between each pair of test cases, the Code/ATM/Similarity/src/SimilarityMeasurement.java file should be compiled and run using the Eclipse IDE by including all the required .jar files in the Code/ATM/Similarity/lib directory as part of the classpath. A bash script is provided along with a pre-generated .jar file in the Code/ATM/Similarity/bin directory to run this step, as follows:

    cd Code/ATM/Similarity bash measure_similarity.sh

    ASTs of each project in the Data/ATM/ASTs/all_test_cases and Data/ATM/ASTs/changed_test_cases directories are parsed to create pairs of ASTs containing one test case from the Data/ATM/ASTs/all_test_cases directory with another test case from the Data/ATM/ASTs/changed_test_cases directory (redundant pairs are discarded). Then, all similarity measurements are saved in the Data/ATM/similarity_measurements.zip file.

    Search-based Minimization Algorithms: The source code for this step is in the Code/ATM/Search directory.

    Requirements: To run this step, Python 3 is required (we used Python 3.10). Also, the libraries in the Code/AMT/Search/requirements.txt file should be installed, as follows:

    cd Code/ATM/Search pip install -r requirements.txt

    Input: * Data/ATM/similarity_measurements

    Output: * Results/ATM/minimization_results

    Running the experiment: To minimize the test suites in our dataset, the following bash script should be executed:

    bash minimize.sh

    All similarity measurements are parsed for each version of the projects, independently. Each version is run 10 times using three minimization budgets (25%, 50%, and 75%). Genetic Algorithm (GA) is run using four similarity measures, namely top-down, bottom-up, combined, and tree edit distance. NSGA-II is run using two combinations of similarity measures: top-down & bottom-up and combined & tree edit distance. The minimization results are generated in the Results/ATM/minimization_results directory.

    Evaluate results: To evaluate and summarize the minimization results, run the following:

    cd Code/ATM/Evaluation bash evaluate.sh

    This will generate summarized FDR and execution time results (per-project and per-version) for each minimization budget, which can all be found in Results/ATM. In this replication package, we provide the final, merged FDR with execution time results.

    Running FAST-R experiments ATM was compared to FAST-R, a state-of-the-art baseline, which is a set of test case minimization techniques called: FAST++, FAST-CS, FAST-pw, and FAST-all, which we adapted to our data and experimental setup.

    Requirements: To run this step, Python 3.7 is required. Also, the libraries in the Code/FAST-R/requirements.txt file should be installed, as follows:

    cd Code/FAST-R pip install -r requirements.txt

    Input: * Data/FAST-R/test_methods * Data/FAST-R/test_classes

    Output: * Results/FAST-R/test_methods/FDR_and_Exec_Time_Results_[budget]%_budget.csv * Results/FAST-R/test_classes/FDR_and_Exec_Time_Results_[budget]%_budget.csv

    To run FAST-R experiments, the following bash script should be executed:

    bash fast_r.sh test_methods #method level bash fast_r.sh test_classes #class level

    Results are generated in .csv files for each budget. For example, for the 50% budget, results are saved in FDR_and_Exec_Time_Results_50%_budget.csv in the Results/FAST-R/test_methods and Results/FAST-R/test_classes directories.

    Running the random minimization experiments ATM was also compared to random minimization as a standard baseline.

    Requirements: To run this step, Python 3 is required (we used Python 3.10). Also, the libraries in the Code/RandomMinimization/requirements.txt file should be installed, as follows:

    cd Code/RandomMinimization pip install -r requirements.txt

    Input: N/A

    Output: * Results/RandomMinimization/FDR_and_Exec_Time_Results_[budget]%_budget.csv

    To run the random selection experiments, the following bash script should be executed:

    bash random_minimization.sh

    Results are generated in .csv files for each budget. For example, for the 50% budget, results are saved in FDR_and_Exec_Time_Results_50%_budget.csv in the Results/RandomMinimization directory.

  15. Annotated Patent Identification Datasets

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Sofean (2025). Annotated Patent Identification Datasets [Dataset]. https://www.kaggle.com/datasets/mustafasofean/annotated-patent-dataset-for-plasma-physics
    Explore at:
    zip(1512922 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Mustafa Sofean
    Description

    Training Dataset Summary: **Source: ** Extracted from EPO, WIPO, and US Patent and Trademark Office databases.

    Initial Filtering: Used patent classification code "H05H" (Plasma Technology) to collect 46,193 candidate patents.

    Refinement: Removed duplicates by selecting only one member per patent family, resulting in 24,513 unique patents.

    Preprocessing: Removed unwanted tokens (e.g., numbers, special characters, HTML/XML tags).

    Additional data: 275 non-relevant patents using the code "A61K0035-16" (related to blood plasma).

    Final Size: 24,788 documents used to create the training set.

    Test Dataset Summary: **Source: Randomly extracted from the Canadian patent database.

    Size: Started with 5,000 candidate patents.

    Focus Section: Only used the technical field section of each patent.

    Annotation: Conducted by a data scientist in collaboration with plasma experts.

    Labeled Samples: Final set contains 1,295 annotated patents:

    551 related to plasma physics

    744 unrelated to plasma physics

    Test dataset for information security domain:
    We randomly selected 1,000 documents that were not part of the training dataset. Out of these, 941 documents were manually annotated with two labels: "INFOSEC" and "NON-INFOSEC".

    Purpose: Serves as an independent benchmark to evaluate model performance.

  16. AgileOERP Dataset v1

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    randa elamin; rasha osman; randa elamin; rasha osman (2020). AgileOERP Dataset v1 [Dataset]. http://doi.org/10.5281/zenodo.1472688
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    randa elamin; rasha osman; randa elamin; rasha osman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is collected from commercial management tool to customize an open source ERP applying agile methodology. It contains 350 user stories (US), 1323 tasks (TS) and 198 developer test (DT) artifacts, in addition to answer set of trace links that manually generated by developers which relates the user story artifact with task (1304) artifact , as such relates the task artifact with developer test artifact (65). Artifacts and trace links are prepared in XML format to test our methodology that presented in “Implemen-ting Traceability Repositories as Graph Databases for Software Quality Improvement” 10.1109/QRS.2018.00040. The data set contains XML for each artifact such as US.xml, ERPrelations.xml is the answer set file, AgileModel.xml which describes the defined model and AgileTraceabilityRule,xml that includes all rules applied for trace links type

  17. Z

    Dataset of ICPR 2020 Competition on Text Block Segmentation on a NewsEye...

    • data.niaid.nih.gov
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael, Johannes; Weidemann, Max; Laasch, Bastian; Labahn, Roger (2021). Dataset of ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4943581
    Explore at:
    Dataset updated
    Jun 14, 2021
    Dataset provided by
    Institute of Mathematics, CITlab, University of Rostock
    Authors
    Michael, Johannes; Weidemann, Max; Laasch, Bastian; Labahn, Roger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data for the ICPR 2020 paper ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset.

    The data is taken from the NewsEye project and consists of historical newspaper pages (partially binarized) ranging from the 19th to 20th century provided by the Austrian National Library, i.e., especially newspapers in German language. The newspapers made available for this competition comprises the titles "Arbeiter Zeitung", "Illustrierte Kronen Zeitung", "Innsbrucker Nachrichten" and "Neue Freie Presse".

    The data is split into two tracks. A simple track with newspaper pages only with continuous text (40 pages training data, 10 pages test data) and a complex track with pages including additional tables, images or advertisements (40 pages training data, 10 pages test data).

    The training data (simple_pages_train.zip, complex_pages_train.zip) contains a set of scanned pages. Furthermore, for every image we provide the coordinates of the baselines, the corresponding text of the lines and the text regions marking the text blocks in the well-established PAGE XML format. Additionally, baselines lying within the same block have a unique ID in the so-called "custom tag".

    Please note that a text block caputers a whole paragraph and the block outlines enclose the text very closely. Headlines are separately marked and blocks are not across columns. Furthermore, images can be ignored since they (usually) do not contain baselines and occurring tables and framed advertisements are handled as single text blocks.

    The following represents a snippet of a PAGE XML file where the baseline with ID "tl_223" forms a block together with all other lines with the block ID "a7"

    .
    

    The type description "article" in the custom tag is a result of the NewsEye project. In connection with this competition an article means simply a text block.

    The test data comes in two versions. One with (simple_pages_test_gt.zip, complex_pages_test_gt.zip) and one without (simple_pages_test.zip, complex_pages_test.zip) the corresponding ground truth. Ground Truth means, in our context, the ideal of a system's output generated by humans.

    For each sample in the test data there is an image of the scanned newspaper page with its corresponding PAGE XML file. In the case without ground truth, the PAGE XML files contain the baselines (without any block ID's), the text and only a single text region surrounding the whole page. The single region should be ignored but is necessary because the PAGE XML format requires that every line is assigned to a region. In the case with ground truth, the PAGE XML files again contain the text regions marking the text blocks, and the corresponding baseline have again the same block ID's. The passwords for extracting the ground truth test data is "icpr2020!tb_simple" for the simple track and "icpr2020!tb_complex" for the complex track.

  18. T

    ag_news_subset

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ag_news_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. Z

    Implementing Traceability Repositories as Graph Databases for Software...

    • data.niaid.nih.gov
    Updated Aug 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    elamin,randa; osman,rasha (2020). Implementing Traceability Repositories as Graph Databases for Software Quality Improvement: Datasets used to test our methodology that is presented in the paper 10.1109/QRS.2018.00040 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_1467616
    Explore at:
    Dataset updated
    Aug 19, 2020
    Authors
    elamin,randa; osman,rasha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The first dataset is the Event Based Traceability for Managing Evolutionary Change (EBT), it is a public dataset provided by CoEST, the original artifacts and trace links are represented in XML and text format. From the EBT dataset, we selected the 41 requirements and 25 test case artifacts, in addition to the answer set of 51 trace links which relates the requirements with the test case. Artifacts and trace links are prepared in XML format. The data set contains XML for each artifact such as RQ.xml, EBTrelations.xml is the answer set file, TradModel.xml which describes the defined model and TradTraceabilityRule.xml that includes the rules applied for trace link types.

    The second dataset AgileOERP is collected from commercial management tool to customize an open source ERP applying agile methodology. It contains 350 user stories (US), 1323 tasks (TS) and 198 developer test (DT) artifacts, in addition to answer set of trace links that manually generated by developers which relates the user story artifact with task (1304) artifact, as such relates the task artifact with developer test artifact (65). Artifacts and trace links are prepared in XML format. The data set contains XML for each artifact such as US.xml, ERPrelations.xml is the answer set file, AgileModel.xml which describes the defined model and AgileTraceabilityRule,xml that includes all rules applied for trace links type

    The original dataset of the last dataset is the Aqualush irrigation system which is used as a case study in “C. Fox, Introduction to Software Engineering Design: Processes, Principles and Patterns with UML2. Addison-Wesley, 2006”. The trace links are generated and provided in “E. Ben Charrada, D. Caspar, C. Jeanneret, and M. Glinz, towards a benchmark for traceability, in Joint EVOL and IWPSE 2011, pp. 21-30”, in HTML format. For our work, we selected the software requirements specification (396 SRS), user level requirements (48 ULR), use case (74 UC), detailed design (85 DD) and software architecture(15 SArch) artifacts in addition to the answer set of trace links that relate the SRS with other artifacts(4038) and thus relates the DD artifact with other artifacts (1719) . Artifacts and trace links are prepared in XML. The data set contains XML for each artifact such as SRS.xml, AqualushRelations.xml is the answer set file, TradModel.xml which describes the defined model and TradTraceabilityRule.xml that includes the rules applied for trace link types.

  20. Z

    Workplans generated during AWOPS tests

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Messi, Leonardo; Carbonari, Alessandro (2022). Workplans generated during AWOPS tests [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6546629
    Explore at:
    Dataset updated
    Jun 7, 2022
    Dataset provided by
    Università Politecnica delle Marche
    Authors
    Messi, Leonardo; Carbonari, Alessandro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is about workplans generated during the validation of the Automated Work Planning Services (AWOPS).The dataset includes:

    9 xml files regarding workplans automatically generated by AWOPS for renovation works' activities during the following days:

    March 7th 2022 (day 0);

    March 14th 2022 (day 2);

    March 17th 2022 (day 5);

    March 21st 2022 (day 9);

    March 24th 2022 (day 12);

    March 28th 2022 (day 16);

    March 31st 2022 (day 19);

    April 4th 2022 (day 23);

    April 7th 2022 (day 26).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Beyer, Dirk (2021). Results of the 3rd Intl. Competition on Software Testing (Test-Comp 2021) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4459469

Results of the 3rd Intl. Competition on Software Testing (Test-Comp 2021)

Explore at:
Dataset updated
Feb 7, 2021
Dataset provided by
LMU Munich, Germany
Authors
Beyer, Dirk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Competition Results

This file describes the contents of an archive of the 3rd Competition on Software Testing (Test-Comp 2021). https://test-comp.sosy-lab.org/2021/

The competition was run by Dirk Beyer, LMU Munich, Germany. More information is available in the following article: Dirk Beyer. Status Report on Software Testing: Test-Comp 2021. In Proceedings of the 24th International Conference on Fundamental Approaches to Software Engineering (FASE 2021, Luxembourg, March 27 - April 1), 2021. Springer.

Copyright (C) Dirk Beyer https://www.sosy-lab.org/people/beyer/

SPDX-License-Identifier: CC-BY-4.0 https://spdx.org/licenses/CC-BY-4.0.html

To browse the competition results with a web browser, there are two options:

start a local web server using php -S localhost:8000 in order to view the data in this archive, or

browse https://test-comp.sosy-lab.org/2021/results/ in order to view the data on the Test-Comp web page.

Contents

index.html: directs to the overview web page

LICENSE.txt: specifies the license

README.txt: this file

results-validated/: results of validation runs

results-verified/: results of test-generation runs and aggregated results

The folder results-validated/ contains the results from validation runs:

*.xml.bz2: XML results from BenchExec

*.logfiles.zip: output from tools

*.json.gz: mapping from files names to SHA 256 hashes for the file content

The folder results-verified/ contains the results from test-generation runs and aggregated results:

index.html: overview web page with rankings and score table

design.css: HTML style definitions

*.xml.bz2: XML results from BenchExec

*.merged.xml.bz2: XML results from BenchExec, status adjusted according to the validation results

*.logfiles.zip: output from tools

*.json.gz: mapping from files names to SHA 256 hashes for the file content

*.xml.bz2.table.html: HTML views on the detailed results data as generated by BenchExec’s table generator

*.All.table.html: HTML views of the full benchmark set (all categories) for each tool

META_*.table.html: HTML views of the benchmark set for each meta category for each tool, and over all tools

*.table.html: HTML views of the benchmark set for each category over all tools

iZeCa0gaey.html: HTML views per tool

quantilePlot-*: score-based quantile plots as visualization of the results

quantilePlotShow.gp: example Gnuplot script to generate a plot

score*: accumulated score results in various formats

The hashes of the file names (in the files *.json.gz) are useful for

validating the exact contents of a file and

accessing the files from the witness store.

Other Archives

Overview over archives from Test-Comp 2021 that are available at Zenodo:

https://doi.org/10.5281/zenodo.4459466 Witness store (containing the generated test suites)

https://doi.org/10.5281/zenodo.4459470 Results (XML result files, log files, file mappings, HTML tables)

https://doi.org/10.5281/zenodo.4459132 Test tasks, version testcomp21

https://doi.org/10.5281/zenodo.4317433 BenchExec, version 3.6

All benchmarks were executed for Test-Comp 2021 https://test-comp.sosy-lab.org/2021/ by Dirk Beyer, LMU Munich, based on the following components:

https://gitlab.com/sosy-lab/test-comp/archives-2021 testcomp21-0-gdacd4bf

https://gitlab.com/sosy-lab/software/sv-benchmarks testcomp21-0-gefea738258

https://gitlab.com/sosy-lab/software/benchexec 3.6-0-gb278ebbb

https://gitlab.com/sosy-lab/benchmarking/competition-scripts testcomp21-0-g8339740

https://gitlab.com/sosy-lab/test-comp/bench-defs testcomp21-0-g9d532c9

Contact

Feel free to contact me in case of questions: https://www.sosy-lab.org/people/beyer/

Search
Clear search
Close search
Google apps
Main menu