The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
APPS is a benchmark for Python code generation, it includes 10,000 problems, which range from having simple oneline solutions to being substantial algorithmic challenges, for more details please refer to this paper: https://arxiv.org/pdf/2105.09938.pdf.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
APPS Dataset
Dataset Description
APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find APPS metric in the hub here codeparrot/apps_metric.
Languages
The dataset contains questions in English and code solutions in Python.
Dataset Structure
from datasets import load_dataset load_dataset("codeparrot/apps")… See the full description on the dataset page: https://huggingface.co/datasets/AuroraH456/apps-small.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dating app industry grew out of the dating websites that were prominent in the early 2010s, with Match, Plenty of Fish and Zoosk leading the way with similarly designed services for mobile. This...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
including mobile devices
AndroidWorld is an environment for building and benchmarking autonomous computer control agents.
It runs on a live Android emulator and contains a highly reproducible benchmark of 116 hand-crafted tasks across 20 apps, which are dynamically instantiated with randomly-generated parameters to create millions of unique task variations.
In addition to the built-in tasks, AndroidWorld also supports the popular web benchmark, MiniWoB++ from Liu et al..
Key features of AndroidWorld include:
📝 116 diverse tasks across 20 real-world apps 🎲 Dynamic task instantiation for millions of unique variations 🏆 Durable reward signals for reliable evaluation 🌐 Open environment with access to millions of Android apps and websites 💾 Lightweight footprint (2 GB memory, 8 GB disk) 🔧 Extensible design to easily add new tasks and benchmarks 🖥️ Integration with MiniWoB++ web-based tasks
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent times, one of the most impactful applications of the growing capabilities of Large Language Models (LLMs) has been their use in Retrieval-Augmented Generation (RAG) systems. RAG applications are inherently more robust against LLM hallucinations and provide source traceability, which holds critical importance in the scientific reading and writing process. However, validating such systems is essential due to the stringent systematic requirements of the scientific domain. Existing benchmark datasets are limited in the scope of research areas they cover, often focusing on the natural sciences, which restricts their applicability and validation across other scientific fields.
To address this gap, we present a closed-question answering (QA) dataset for benchmarking scientific RAG applications. This dataset spans 34 research topics across 10 distinct areas of study. It includes 108 manually curated question-answer pairs, each annotated with answer type, difficulty level, and a gold reference along with a link to the source paper. Further details on each of these attributes can be found in the accompanying README.md
file.
Please cite the following publication when using the dataset: TBD
The publication is available at: TBD
A preprint version of the publication is available at: TBD
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.
Eliciting-Contexts/applications-benchmark-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Key App Engagement Rate StatisticsSession Length by App CategoryMonthly Sessions by App CategoryDaily Active Users by App CategoryHours Spent by App CategoryHours Spent on Apps by CountryMost Popular...
The National Flood Hazard Layer (NFHL) data incorporates all Digital Flood Insurance Rate Map(DFIRM) databases published by FEMA, and any Letters Of Map Revision (LOMRs) that have been issued against those databases since their publication date. The DFIRM Database is the digital, geospatial version of the flood hazard information shown on the published paper Flood Insurance Rate Maps(FIRMs). The primary risk classifications used are the 1-percent-annual-chance flood event, the 0.2-percent-annual-chance flood event, and areas of minimal flood risk. The NFHL data are derived from Flood Insurance Studies (FISs), previously published Flood Insurance Rate Maps (FIRMs), flood hazard analyses performed in support of the FISs and FIRMs, and new mapping data where available. The FISs and FIRMs are published by the Federal Emergency Management Agency (FEMA). The specifications for the horizontal control of DFIRM data are consistent with those required for mapping at a scale of 1:12,000. The NFHL data contain layers in the Standard DFIRM datasets except for S_Label_Pt and S_Label_Ld. The NFHL is available as State or US Territory data sets. Each State or Territory data set consists of all DFIRMs and corresponding LOMRs available on the publication date of the data set.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmark data contains name, type, material, coordinates, elevations and vertical order. All benchmarks were conventionally leveled through in accordance with the procedures setup in the Brevard County Vertical Control Manual (October 2012). The elevations of the bench marks are based on the North American Vertical Datum of 1988 (NAVD88). The horizontal coordinates are from a handheld GPS unit and are fro reference purposes only.
This is a benchmark of data loss bugs for android apps. It is a public benchmark of 110 data loss faults in Android apps that we systematically collected to facilitate research and experimentation with these problems. The benchmark is available on GitLab and includes the faulty apps, the fixed apps (when available), the test cases to automatically reproduce the problems, and additional information that may help researchers in their tasks.
Brand performance data collected from AI search platforms for the query "mHealth app retention benchmarks".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To address the aforementioned limitations, this paper presents ParamScope, a static analysis tool for cryptographic API misuse detection. ParamScope first obtains high-quality Intermediate Representation (IR) and comprehensive coverage of cryptographic API calls through fine-grained static analysis. It then performs assignment-driven program slicing and lightweight IR simulation to reconstruct the complete propagation and assignment chain of parameter values. This approach enables effective analysis of value assignments that can only be determined at runtime, which are often missed by existing static analysis, while also addressing the coverage limitations inherent in dynamic approaches. We evaluated ParamScope by comparing it with leading static and dynamic tools, including CryptoGuard, CrySL, and RvSec, using four cryptographic misuse benchmarks and a dataset of 327 Google Play applications. The results show that ParamScope outperforms the other tools, achieving an accuracy of 96.22% and an F1-score of 96.85%. In real-world experiments, ParamScope identifies 27% more misuse cases than the best-performing tools, while maintaining a comparable analysis time.
This map displays National Geodetic Survey (NGS) classifications of geodetic control stations for the Pennsylvania area with PennDOT county and municipal boundaries.NOAA Charting and Geodesy: https://www.noaa.gov/chartingNOAA Survey Map: https://noaa.maps.arcgis.com/apps/webappviewer/index.html?id=190385f9aadb4cf1b0dd8759893032dbPennDOT GIS Hub: GIS Hub (arcgis.com)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Application-level monitoring frameworks, such as Kieker, provide insight into the inner workings and the dynamic behavior of software systems. However, depending on the number of monitoring probes used, these frameworks may introduce significant runtime overhead. Consequently, planning the instrumentation of continuously operating software systems requires detailed knowledge of the performance impact of each monitoring probe.
In this paper, we present our benchmark engineering approach to quantify the monitoring overhead caused by each probe under controlled and repeatable conditions. Our developed MooBench benchmark provides a basis for performance evaluations and comparisons of application-level monitoring frameworks. To evaluate its capabilities, we employ our benchmark to conduct a performance comparison of all available Kieker releases from version 0.91 to the current release 1.8.
This dataset supplements the paper and contains the raw experimental data as well as several generated diagrams for each experiment.
Neural networks are potentially valuable for many of the challenges associated with MRS data. The purpose of this manuscript is to describe the AGNOSTIC dataset, which contains 259,200 synthetic 1H MRS examples for training and testing neural networks. AGNOSTIC was created using 270 basis sets that were simulated across 18 field strengths and 15 echo times. The synthetic examples were produced to resemble in vivo brain data with combinations of metabolite, macromolecule, residual water signals, and noise. To demonstrate the utility, we apply AGNOSTIC to train two Convolutional Neural Networks (CNNs) to address out-of-voxel (OOV) echoes. A Detection Network was trained to identify the point-wise presence of OOV echoes, providing proof of concept for real-time detection. A Prediction Network was trained to reconstruct OOV echoes, allowing subtraction during post-processing. Complex OOV signals were mixed into 85% of synthetic examples to train two separate CNNs for the detection and predi..., AGNOSTIC was created using 270 basis sets that were simulated across 18 field strengths and 15 echo times. The synthetic examples were produced to resemble in vivo brain data with combinations of metabolite, macromolecule, and residual water signals, and noise. All of the parameters (i.e., amplitudes, relaxation decays, etc.) are included in each of the NumPy zipped archive file., NumPy archive files can be opened using Python and NumPy., # AGNOSTIC: Adaptable Generalized Neural-Network Open-source Spectroscopy Training dataset of Individual Components
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context and Aim
Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.
We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.
Description
The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.
The TreeSatAI Benchmark Archive contains:
50,381 image triplets (aerial, Sentinel-1, Sentinel-2)
synchronized time steps and locations
all original spectral bands/polarizations from the sensors
20 species classes (single labels)
12 age classes (single labels)
15 genus classes (multi labels)
60 m and 200 m patches
fixed split for train (90%) and test (10%) data
additional single labels such as English species name, genus, forest stand type, foliage type, land cover
The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.
Version history
v1.0.2 - Minor bug fix multi label JSON file
v1.0.1 - Minor bug fixes in multi label JSON file and description file
v1.0.0 - First release
Citation
Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth Syst. Sci. Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.
GitHub
Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitLab and GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark) and the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) (https://github.com/DFKI/treesatai_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.
Folder structure
We refer to the proposed folder structure in the PDF file.
Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.
Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.
Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.
The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]
The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.
The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).
CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),
Join the archive
Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.
Project description
This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TUB Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).
Project publications
Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth System Science Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.
Schulz, C., Förster, M., Vulova, S. V., Rocha, A. D., and Kleinschmit, B.: Spectral-temporal traits in Sentinel-1 C-band SAR and Sentinel-2 multispectral remote sensing time series for 61 tree species in Central Europe. Remote Sensing of Environment, 307, 114162, https://doi.org/10.1016/j.rse.2024.114162, 2024.
Conference contributions
Ahlswede, S. Madam, N.T., Schulz, C., Kleinschmit, B., and Demіr, B.: Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.48550/arXiv.2201.07495, 2022.
Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.1109/IGARSS46834.2022.9884173, 2022.
Schulz, C., Förster, M., Vulova, S., and Kleinschmit, B.: The temporal fingerprints of common European forest types from SAR and optical remote sensing data, AGU Fall Meeting, New Orleans, USA, 2021.
Kleinschmit, B., Förster, M., Schulz, C., Arias, F., Demir, B., Ahlswede, S., Aksoy, A.K., Ha Minh, T., Hees, J., Gava, C., Helber, P., Bischke, B., Habelitz, P., Frick, A., Klinke, R., Gey, S., Seidel, D., Przywarra, S., Zondag, R., and Odermatt B.: Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests, Living Planet Symposium, Bonn, Germany, 2022.
Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series, ForestSAT, Berlin, Germany, 2022.
Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.
this dataset has got three files named as ratings.csv, movies.csv and tags.csv
ratings.csv the movies have been rated by 138493 users on the scale of 1 to 5, this file contains the information divided in the column 'userId', 'movieId', 'rating' and 'timestamp'.
tags.csv this file has the data divided under category 'userId','movieId' and 'tag'
I got this data from MovieLens, for a mini project. This is the link to original data set
You have got a ton data. You can use this to make fun decisions like which is the best movie series of all time or create a completely new story out of the data that you have.
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.