5 datasets found

nemotron-3-8b-base-4k
kaggle.com
zip
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serhii Kharchuk (2024). nemotron-3-8b-base-4k [Dataset]. https://www.kaggle.com/datasets/serhiikharchuk/nemotron-3-8b-base-4k
Explore at:
zip(13688476176 bytes)Available download formats
Dataset updated
Aug 31, 2024
Authors
Serhii Kharchuk
Description
Nemotron-3-8B-Base-4k Model Overview License

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description

Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.

NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References

Announcement Blog Model Architecture

Architecture Type: Transformer

Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration

Runtime Engine(s): NVIDIA AI Enterprise

Toolkit: NeMo Framework

To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.

Sample Inference Code:

from nemo.deploy import NemoQuery

In this case, we run inference on the same machine

nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")

output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)

Supported Hardware:

H100 A100 80GB, A100 40GB

Model Version(s)

Nemotron-3-8B-base-4k-BF16-1 Dataset & Training

The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.

NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4

The calculation of MMLU follows the original implementation. See Hugging Face’s explanation of different implementations of MMLU.

** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use

This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use

Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations

The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Address Standardization

sdiinnovation-geoplatform.hub.arcgis.com
arc-gis-hub-home-arcgishub.hub.arcgis.com

Updated Jul 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Esri (2022). Address Standardization [Dataset]. https://sdiinnovation-geoplatform.hub.arcgis.com/content/6c8e054fbdde4564b3b416eacaed3539

Explore at:

Dataset updated

Jul 26, 2022

Dataset authored and provided by

Esrihttp://esri.com/

Description

This deep learning model is used to transform incorrect and non-standard addresses into standardized addresses. Address standardization is a process of formatting and correcting addresses in accordance with global standards. It includes all the required address elements (i.e., street number, apartment number, street name, city, state, and postal) and is used by the standard postal service.

      An address can be termed as non-standard because of incomplete details (missing street name or zip code), invalid information (incorrect address), incorrect information (typos, misspellings, formatting of abbreviations), or inaccurate information (wrong house number or street name). These errors make it difficult to locate a destination. Although a standardized address does not guarantee the address validity, it simply converts an address into the correct format. This deep learning model is trained on address dataset provided by openaddresses.io and can be used to standardize addresses from 10 different countries.



  Using the model


      Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.



    Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.Input
    Text (non-standard address) on which address standardization will be performed.

    Output
    Text (standard address)

    Supported countries
    This model supports addresses from the following countries:

      AT – Austria
      AU – Australia
      CA – Canada
      CH – Switzerland
      DK – Denmark
      ES – Spain
      FR – France
      LU – Luxemburg
      SI – Slovenia
      US – United States

    Model architecture
    This model uses the T5-base architecture implemented in Hugging Face Transformers.
    Accuracy metrics
    This model has an accuracy of 90.18 percent.

    Training dataThe model has been trained on openly licensed data from openaddresses.io.Sample results
    Here are a few results from the model.

h
Docmatix
huggingface.co
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2024). Docmatix [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/Docmatix
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 19, 2024
Dataset authored and provided by
HuggingFaceM4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Docmatix

Dataset description

Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.

Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")

If you want the dataset to link to the pdf files… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.
h
gretel-pii-masking-en-v1
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, gretel-pii-masking-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Gretel Synthetic Domain-Specific Documents Dataset (English)

This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. The dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.
h
od-syn-page-annotations-com
huggingface.co
Updated Jun 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rusputin (2025). od-syn-page-annotations-com [Dataset]. https://huggingface.co/datasets/alakxender/od-syn-page-annotations-com
Explore at:
Dataset updated
Jun 10, 2025
Authors
rusputin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📦 Dhivehi Synthetic Document Layout + Textline Dataset

This dataset contains synthetically generated image-document pairs with detailed layout annotations and ground-truth Dhivehi text extractions.It’s designed for document layout analysis, visual document understanding, OCR fine-tuning, and related tasks specifically for Dhivehi script. Note: this version image are compressed. Raw version 📁 Repository: Hugging Face Datasets

📋 Dataset Summary

Total Examples: ~58… See the full description on the dataset page: https://huggingface.co/datasets/alakxender/od-syn-page-annotations-com.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Serhii Kharchuk (2024). nemotron-3-8b-base-4k [Dataset]. https://www.kaggle.com/datasets/serhiikharchuk/nemotron-3-8b-base-4k

nemotron-3-8b-base-4k

Nemotron 3 8B Base 4k

Explore at:

zip(13688476176 bytes)Available download formats

Dataset updated

Aug 31, 2024

Authors

Serhii Kharchuk

Description

Nemotron-3-8B-Base-4k Model Overview License

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description

Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.

NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References

Announcement Blog Model Architecture

Architecture Type: Transformer

Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration

Runtime Engine(s): NVIDIA AI Enterprise

Toolkit: NeMo Framework

To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.

Sample Inference Code:

from nemo.deploy import NemoQuery

In this case, we run inference on the same machine

nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")

output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)

Supported Hardware:

H100
A100 80GB, A100 40GB

Model Version(s)

Nemotron-3-8B-base-4k-BF16-1 Dataset & Training

The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.

NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4

The calculation of MMLU follows the original implementation. See Hugging Face’s explanation of different implementations of MMLU.

** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use

This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use

Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations

The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Clear search

Close search

Google apps

Main menu

nemotron-3-8b-base-4k

In this case, we run inference on the same machine

Address Standardization

Docmatix

gretel-pii-masking-en-v1

od-syn-page-annotations-com

nemotron-3-8b-base-4k

Nemotron 3 8B Base 4k

In this case, we run inference on the same machine