This is benchmark model for wastewater treatment using an activated sludge process. The activated sludge process is a means of treating both municipal and industrial wastewater. The activated sludge process is a multi-chamber reactor unit that uses highly concentrated microorganisms to degrade organics and remove nutrients from wastewater, producing quality effluent. This model provides pollutant concentrations, mass balance, electricity requirements, and treatment costs. This model will be continuously updated based on the latest data.
This dataset was created by Anthony Goldbloom
DTBM is a benchmark dataset for Digital Twins that reflects these characteristics and look into the scaling challenges of different knowledge graph technologies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Results of the benchmark run, see attached JSON for details.
The pretrained models from four image translation algorithms: ACL-GAN, Council-GAN, CycleGAN, and U-GAT-IT on three benchmarking datasets: Selfie2Anime, CelebA_gender, CelebA_glasses.
We trained the models to provide benchmarks for the algorithm we detailed in the paper "UVCGAN: UNet Vision Transformer Cycle-consistent GAN for Unpaired Image-to-Image Translation.".
We only trained a model if a pretrained model is provided by a benchmarking algorithm.
This data set contains the model outputs of different hydrology models calibrated using the same forcing data (Maurer) and the same calibration period for the CAMELS data set. The models are: SAC-SMA, VIC, HBV, FUSE and mHM. All of these models have been calibrated for each basin separately. Additionally, for VIC and mHM, also regionally calibrated model outputs exist. All models have been calibrated using the period 1 October 1999 until 30 September 2008 and were validated in the period 1 October 1989 until 30 September 1999.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
a spatial model for situated multiagent systems that optimizes neighborhood search". In this paper we presented a new model to implement a spatially explicit environment that supports constant-time sensory (neighborhood search) and locomotion functions for situated multiagent systems.
Three linear time varying system benchmarks implemented in MATLAB. 1.) a time varying version of the Oberwolfach Steel Cooling Benchmark 2.) a one dimensional heat equation with a moving point heat source 3.) a linearized Burgers equation Model 1 comes in the same 5 resolutions as the original time-invariant version. the other two are freely scalable. {"references": ["N. Lang, J. Saak, and T. Stykel, Balanced truncation model reduction for linear time-varying systems, Math. Comput. Model. Dyn. Syst., 22 (2016), pp. 267\u2013 281. doi:10.1080/13873954.2016.1198386."]}
Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model
RewardBench is a benchmark designed to evaluate the capabilities and safety of reward models, including those trained with Direct Preference Optimization (DPO). It serves as the first evaluation tool for reward models and provides valuable insights into their performance and reliability¹.
Here are the key components of RewardBench:
Common Inference Code: The repository includes common inference code for various reward models, such as Starling, PairRM, OpenAssistant, and more. These models can be evaluated using the provided tools¹.
Dataset and Evaluation: The RewardBench dataset consists of prompt-win-lose trios spanning chat, reasoning, and safety scenarios. It allows benchmarking reward models on challenging, structured, and out-of-distribution queries. The goal is to enhance scientific understanding of reward models and their behavior².
Scripts for Evaluation:
scripts/run_rm.py: Used to evaluate individual reward models. scripts/run_dpo.py: Used to evaluate direct preference optimization (DPO) models. scripts/train_rm.py: A basic reward model training script built on TRL (Transformer Reinforcement Learning)¹.
Installation and Usage:
Install PyTorch on your system. Install the required dependencies using pip install -e .. Set the environment variable HF_TOKEN with your token. To contribute your model to the leaderboard, open an issue on HuggingFace with the model name. For local model evaluation, follow the instructions in the repository¹.
Remember that RewardBench provides a standardized way to assess reward models, ensuring transparency and comparability across different approaches. 🌟🔍
(1) GitHub - allenai/reward-bench: RewardBench: the first evaluation tool .... https://github.com/allenai/reward-bench. (2) RewardBench: Evaluating Reward Models for Language Modeling. https://arxiv.org/abs/2403.13787. (3) RewardBench: Evaluating Reward Models for Language Modeling. https://paperswithcode.com/paper/rewardbench-evaluating-reward-models-for.
In a performance comparison on Chinese language benchmarking in 2025, DeepSeek's AI model Deepseek-R1 outperformed all other representative models, except the DeepSeek V3 model. The models from DeepSeek performed best in the mathematics and Chinese language benchmarks, and the weakest in coding.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Collection and documentation of the benchmark Petri net models used for the evaluation of the B-I-Sat algorithm.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
One building performance simulation benchmark model for nearly zero-energy dwellings in Brussels. The study reports an inventory and field survey conducted on a terraced house renovated after the year 2010. An analysis of energy consumption (electricity and natural gas) and a walkthrough survey were conducted. A building performance simulation model is created in EnergyPlus to benchmark the average energy consumption and building characteristics. The estimate's validity has been further checked against the public statistics and verified through model calibration and utility bill comparison. The benchmark has an average energy use intensity of 29 kWh/m2/year and represents terraced single-family houses after renovation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ether0-benchmark
QA benchmark (test set) for the ether0 reasoning language model: https://huggingface.co/futurehouse/ether0 This benchmark is made from commonly used tasks - like reaction prediction in USPTO/ORD, molecular captioning from PubChem, or predicting GHS classification. It's unique from other benchmarks in that all answers are a molecule. It's balanced so that each task is about 25 questions, a reasonable amount for frontier model evaluations. The tasks generally follow… See the full description on the dataset page: https://huggingface.co/datasets/futurehouse/ether0-benchmark.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arabic Broad Benchmark (ABB)
The Arabic Broad Benchmark is a unique dataset and an advanced benchmark created by SILMA.AI to assess the performance of Large Language Models in Arabic Language. ABB consists of 470 high quality human-validated questions sampled from 64 Arabic benchmarking datasets, evaluating 22 categories and skills. The advanced benchmarking script utilizes the dataset to evaluate models or APIs using a mix of 20+ Manual Rules and LLM as Judge variations customized… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark.
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This document describes the different benchmark codes to reproduce the figures from the SolidEarth article:Root, B. and Sebera, J. and Szwillus, W. and Thieulot, C. and Martinec, Z. and Fullea, J., Benchmark forward gravity schemes: the gravity field of a realistic lithosphere model WINTERC-G, Solid Earth Discussions, 2021, 1--36, 10.5194/se-2021-145.Three different benchmarks are discussed:- shell test 2 - Equal thickness lateral varying density- shell test 3 - Lateral varying density interface (CRUST1.0 MOHO)- WINTERC-grav benchmark full layered modelThe Matlab codes and data files are presented in the database.
This benchmark aims to provide tools to evaluate 3D Interest Point Detection Algorithms with respect to human generated ground truth. Using a web-based subjective experiment, human subjects marked 3D interest points on a set of 3D models. The models were organized in two datasets: Dataset A and Dataset B. Dataset A consists of 24 models which were hand-marked by 23 human subjects. Dataset B is larger with 43 models, and it contains all the models in Dataset B. The number of human subjects who marked all the models in this larger set is 16. Some of the models are standard models that are widely used in 3D shape research; and they have been used as test objects by researchers working on the best view problem. We have compared five 3D Interest Point Detection algorithms. The interest points detected on the 3D models of the dataset can be downloaded from the link below. Please refer to README for details in the download. Mesh saliency [Lee et al. 2005] : Interest points by mesh saliency Salient points [Castellani et al. 2008] : Interest points by salient points 3D-Harris [Sipiran and Bustos, 2010] : Interest points by 3D-Harris 3D-SIFT [Godil and Wagan, 2011] : Interest points by 3D-SIFT (Please note that some models in the dataset are not watertight, hence their volumetric representations could not be generated. Therefore, 3D-SIFT algorithm wasn't able to detect interest points for those models.) Scale-dependent corners [Novatnack and Nishino, 2007] : Interest points by SD corners HKS-based interest points [Sun et al. 2009] : Interest points by HKS method Please Cite the Paper: Helin Dutagaci, Chun Pan Cheung, Afzal Godil, ?Evaluation of 3D interest point detection techniques via human-generated ground truth?, The Visual Computer, 2012. References: [Lee et al. 2005] Lee, C.H., Varshney, A., Jacobs, D.W.: Mesh saliency. In: ACM SIGGRAPH 2005, pp. 659?666 (2005) [Castellani et al. 2008] Castellani, U., Cristani, M., Fantoni, S., Murino, V.: Sparse points matching by combining 3D mesh saliency with statistical descriptors. Comput. Graph. Forum 27(2), 643?652 (2008) [Sipiran and Bustos, 2010] Sipiran, I., Bustos, B.: A robust 3D interest points detector based on Harris operator. In: Eurographics 2010 Workshop on 3D Object Retrieval (3DOR?10), pp. 7?14 (2010) [Godil and Wagan, 2011] Godil, A., Wagan, A.I.: Salient local 3D features for 3D shape retrieval. In: 3D Image Processing (3DIP) and Applications II, SPIE (2011) [Novatnack and Nishino, 2007] Novatnack, J., Nishino, K.: Scale-dependent 3D geometric features. In: ICCV, pp. 1?8, (2007) [Sun et al. 2009] Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably informative multi-scale signature based on heat diffusion. In: Eurographics Symposium on Geometry Processing (SGP), pp. 1383?1392 (2009)
Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model
In a benchmark comparison, DeepSeek's Janus-Pro-7B model outperforms similar models in the GenEval benchmark and scores comparable results in the DPC-Bench. The company developed a large language model and a text-to-image model that achieves similar results to industry leaders, such as DALL-E and Stable Diffusion.
This is benchmark model for wastewater treatment using an activated sludge process. The activated sludge process is a means of treating both municipal and industrial wastewater. The activated sludge process is a multi-chamber reactor unit that uses highly concentrated microorganisms to degrade organics and remove nutrients from wastewater, producing quality effluent. This model provides pollutant concentrations, mass balance, electricity requirements, and treatment costs. This model will be continuously updated based on the latest data.