Facebook
TwitterNemotron-3-8B-Base-4k Model Overview License
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description
Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.
NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References
Announcement Blog Model Architecture
Architecture Type: Transformer
Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration
Runtime Engine(s): NVIDIA AI Enterprise
Toolkit: NeMo Framework
To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.
Sample Inference Code:
from nemo.deploy import NemoQuery
nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")
output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)
Supported Hardware:
H100
A100 80GB, A100 40GB
Model Version(s)
Nemotron-3-8B-base-4k-BF16-1 Dataset & Training
The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.
NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4
** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use
This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use
Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations
The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
Facebook
TwitterThis deep learning model is used to transform incorrect and non-standard addresses into standardized addresses. Address standardization is a process of formatting and correcting addresses in accordance with global standards. It includes all the required address elements (i.e., street number, apartment number, street name, city, state, and postal) and is used by the standard postal service.
An address can be termed as non-standard because of incomplete details (missing street name or zip code), invalid information (incorrect address), incorrect information (typos, misspellings, formatting of abbreviations), or inaccurate information (wrong house number or street name). These errors make it difficult to locate a destination. Although a standardized address does not guarantee the address validity, it simply converts an address into the correct format. This deep learning model is trained on address dataset provided by openaddresses.io and can be used to standardize addresses from 10 different countries.
Using the model
Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.
Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.Input
Text (non-standard address) on which address standardization will be performed.
Output
Text (standard address)
Supported countries
This model supports addresses from the following countries:
AT – Austria
AU – Australia
CA – Canada
CH – Switzerland
DK – Denmark
ES – Spain
FR – France
LU – Luxemburg
SI – Slovenia
US – United States
Model architecture
This model uses the T5-base architecture implemented in Hugging Face Transformers.
Accuracy metrics
This model has an accuracy of 90.18 percent.
Training dataThe model has been trained on openly licensed data from openaddresses.io.Sample results
Here are a few results from the model.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Docmatix
Dataset description
Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")
If you want the dataset to link to the pdf files… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Gretel Synthetic Domain-Specific Documents Dataset (English)
This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. The dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📦 Dhivehi Synthetic Document Layout + Textline Dataset
This dataset contains synthetically generated image-document pairs with detailed layout annotations and ground-truth Dhivehi text extractions.It’s designed for document layout analysis, visual document understanding, OCR fine-tuning, and related tasks specifically for Dhivehi script. Note: this version image are compressed. Raw version 📁 Repository: Hugging Face Datasets
📋 Dataset Summary
Total Examples: ~58… See the full description on the dataset page: https://huggingface.co/datasets/alakxender/od-syn-page-annotations-com.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterNemotron-3-8B-Base-4k Model Overview License
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description
Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.
NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References
Announcement Blog Model Architecture
Architecture Type: Transformer
Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration
Runtime Engine(s): NVIDIA AI Enterprise
Toolkit: NeMo Framework
To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.
Sample Inference Code:
from nemo.deploy import NemoQuery
nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")
output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)
Supported Hardware:
H100
A100 80GB, A100 40GB
Model Version(s)
Nemotron-3-8B-base-4k-BF16-1 Dataset & Training
The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.
NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4
** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use
This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use
Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations
The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.