11 datasets found
  1. h

    snorkel-curated-instruction-tuning

    • huggingface.co
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Snorkel AI (2023). snorkel-curated-instruction-tuning [Dataset]. https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2023
    Dataset authored and provided by
    Snorkel AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Please check out our Blog Post - How we built a better GenAI with programmatic data development for more details!

      Summary
    

    snorkel-curated-instruction-tuning is a curated dataset that consists of high-quality instruction-response pairs. These pairs were programmatically filtered with weak supervision from open-source datasets Databricks Dolly-15k, Open Assistant, and Helpful Instructions. To enhance the dataset, we also programmatically classified each instruction based on the… See the full description on the dataset page: https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning.

  2. Alpaca GPT-4

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca GPT-4 [Dataset]. https://www.kaggle.com/datasets/thedevastator/gpt-4-instruction-following-dataset/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca GPT-4

    High-Performance NLP for Instruction-Following Reasoning

    By Huggingface Hub [source]

    About this dataset

    This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.

    What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:

    • The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following: instruction, input, output and text. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) The instruction column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
      d) The ‘text’ column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.

      Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.

      To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!

      Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are

    Research Ideas

    • Training intelligent conversational agents with instruction-following reasoning capabilities.
    • Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.
    • Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Colu...

  3. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  4. AceMath-Instruct-Training-Data

    • huggingface.co
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). AceMath-Instruct-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    website | paper

      AceMath-Instruct Training Data Card
    

    We release all the datasets to train AceMath-1.5B/7B/72B-Instruct models. These models are built upon the Qwen2.5-Math-Base models through a multi-stage supervised fine-tuning (SFT) process. The fine-tuning begins with general-purpose SFT data (general_sft_stage1.parquet and general_sft_stage2.parquet) and is followed by math-specific SFT data (math_sft.parquet). In our experiments, fine-tuning the Qwen2.5-Math-Base models using… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data.

  5. Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin...

    • zenodo.org
    application/gzip, bin
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greg Schuette; Greg Schuette; Zhuohan Lao; Zhuohan Lao; Bin Zhang; Bin Zhang (2024). Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations" [Dataset]. http://doi.org/10.5281/zenodo.14218666
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Greg Schuette; Greg Schuette; Zhuohan Lao; Zhuohan Lao; Bin Zhang; Bin Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 4, 2024
    Description

    This dataset includes all code and data required to reproduce the results of:

    Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]

    File descriptions:

    1. chromogen_code.tar.gz contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:
      1. Some or all of the code inside chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder was adapted from that provided in the original EPCOT paper, Zhang et al. (2023).
      2. chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3.
      3. Several of the Jupyter Notebooks within chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/ visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md for instructions on obtaining the data.
      4. Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5).
      5. The files within chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files).
    2. epcot_final.pt contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters.
    3. chromogen.pt contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters.
    4. conformations.tar.gz contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:
      1. conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd is from Schuette et al. (2023), though it first made available here.
      2. conformations.tar.gz/conformations/DipC/processed_data.h5 represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018).
    5. outside_data.tar.gz contains two subdirectories:
      1. inputs contains our post-processed genome assembly file. Its sole content, hg19.h5, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.
      2. training_data contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2.
    6. embeddings.tar.gz contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training.

    chromogen_code.tar.gz/ChromoGen/README.md and the README.md file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.

    You can download and organize all the files in this dataset as intended by running the following in bash:
    # Download the code and expand the tarball whose contents define the
    # larger file structure of the repository this dataset is archiving.
    wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gz
    tar -xvzf chromogen_code.tar.gz
    rm chromogen_code.tar.gz

    # Enter the top-level directory of the repo, create the subdirectories
    # that'll contain the data, and cd to it
    cd ChromoGen
    mkdir -p recreate_results/downloaded_data/models
    cd recreate_results/downloaded_data

    # Download all the data in the proper locations
    wget https://zenodo.org/records/14218666/files/conformations.tar.gz &
    wget https://zenodo.org/records/14218666/files/embeddings.tar.gz &
    wget https://zenodo.org/records/14218666/files/outside_data.tar.gz &
    cd models
    wget https://zenodo.org/records/14218666/files/chromogen.pt &
    wget https://zenodo.org/records/14218666/files/epcot_final.pt &
    cd ..
    wait

    # Untar the three tarballs
    tar -xvzf conformations.tar.gz &
    tar -xvzf embeddings.tar.gz &
    tar -xvzf outside_data.tar.gz &
    wait

    # Remove the now-unneeded tarballs
    rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz

  6. u

    Data from: CESM2 83-level simulations

    • rda.ucar.edu
    • data.ucar.edu
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CESM2 83-level simulations [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
    Explore at:
    Description

    In the next generation of the Community Atmosphere Model (CAM7), the model top will be raised and the vertical resolution will be increased. The model top will be approximately 80 ... km (compared to 40 km in older generations), and the grid spacing in the free troposphere and lower stratosphere will be reduced to about 500 m compared to around 1137m in older generations. In addition to this, extra levels will be added to the boundary layer and the lowest model level will be lowered. Overall, this "mid-top" version of CAM7 will have 93 levels. However, many other changes will also be present in CAM7 such as physics updates and the new spectral element dynamical core, making it challenging to identify the role of this enhanced resolution in changes between CAM7 and CAM6. This dataset consists of a suite of simulations that use CAM6 physics and the finite volume dynamical core, but with CAM7's grid except for the changes to the levels in the boundary layer i.e., an 83 level model. The boundary layer levels are not changed because once those are changed, some additional tuning of the physical parameterizations is needed precluding a clean comparison and identification of the impact of vertical resolution. Only minimal changes to CAM6 physics have been applied to these simulations; the non-orographic gravity wave drag scheme was turned on, the upper boundary condition was changed such that any remaining gravity wave momentum flux is deposited at the model lid, and some minor tuning of the gravity wave drag settings was also performed to optimize the behavior of the QBO. These simulations can, therefore, be compared with existing CAM6 simulations to identify the impacts of this raising of the model lid and change to the resolution of the free troposphere and lower stratosphere. These simulations have an internally generated QBO and a relatively good representation of the stratospheric polar vortices and can be used to explore climate variability and change in the presence of those features.

  7. n

    Data from: Validating marker-less pose estimation with 3D x-ray radiography

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated May 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dalton Moore; Jeffrey Walker; Jason MacLean; Nicholas Hatsopoulos (2022). Validating marker-less pose estimation with 3D x-ray radiography [Dataset]. http://doi.org/10.5061/dryad.d7wm37q2z
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 12, 2022
    Dataset provided by
    University of Chicago
    Authors
    Dalton Moore; Jeffrey Walker; Jason MacLean; Nicholas Hatsopoulos
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    These data were generated to evaluate the accuracy of DeepLabCut (DLC), a deep learning marker-less motion capture approach, by comparing it to a 3D x-ray video radiography system that tracks markers placed under the skin (XROMM). We recorded behavioral data simultaneously with XROMM and RGB video as marmosets foraged and reconstructed three-dimensional kinematics in a common coordinate system. We used XMALab to track 11 XROMM markers, and we used the toolkit Anipose to filter and triangulate DLC trajectories of 11 corresponding markers on the forelimb and torso. We performed a parameter sweep of relevant Anipose and post-processing parameters to characterize their effect on tracking quality. We compared the median error of DLC+Anipose to human labeling performance and placed this error in the context of the animal's range of motion.
    Methods Subjects These experiments were conducted with two common marmosets (Callithrix jacchus) (an 8-year old, 356g male and a 7-year old, 418g female). All methods were approved by the Institutional Animal Care and Use Committee of the University of Chicago. Data Collection The two marmosets were placed together in a 1m x 1m x 1m cage with a modular foraging apparatus attached to the top of the cage, as previously described by Walker et al. (2020). The marmosets were allowed to forage voluntarily throughout recording sessions that lasted 1-2 hours. Recordings of individual trials were triggered manually with a foot pedal by the experimenters when the marmosets appeared ready to initiate a reach. The manual trigger initiated synchronized video collection by the XROMM system (Brainerd et al., 2010) and two visible light cameras, each described in further detail below. We retained all trials that captured right-handed reaches. Marmoset TY produced four useful reaching events containing 5 total reaches and marmoset PT produced 13 reaching events containing 17 reaches. XROMM Bi-planar X-ray sources and image intensifiers (90kV, 25mA at 200 fps) were used to track the 3D position of radiopaque tantalum beads (0.5-1 mm, Bal-tec) placed subcutaneously in the arm, hand, and torso. Details of bead implants can be found in Walker et al. (2020), in which the authors also report estimating XROMM marker tracking precision of 0.06 mm based on the standard deviation of inter-marker distances during a recording of a calibration specimen. Marker locations were chosen to approximate the recommendations given by the International Society of Biomechanics for defining coordinate systems of the upper limb and torso in humans (Wu et al., 2005). These recommendations were adapted to the marmoset and constrained by surgical considerations. Positions of 13 beads were tracked using a semi-automated process in XMALab (Knorlein et al., 2016) following the procedure described there and in the XMALab User Guide (https://bitbucket.org/xromm/xmalab/wiki/Home). Two beads implanted in the anterior torso were ignored for comparison with DLC because corresponding positions on the skin were occluded in nearly every frame captured by visible light cameras. DeepLabCut Two high-speed cameras (FLIR Blackfly S, 200 fps, 1440x1080 resolution) were used to record video for analysis by DLC. The cameras were positioned to optimize visibility of the right upper limb during reaching behavior in the foraging apparatus and to minimize occlusions, while avoiding the path between the X-ray sources and image intensifiers (Fig. 1A). The cameras were triggered to record continuous images between the onset and offset of the manual XROMM trigger, with series of images later converted to video for DLC processing. All videos were brightened using the OpenCV algorithm for contrast limited adaptive histogram equalization (CLAHE) prior to labeling. We labeled 11 body parts in DLC – two labels on the torso and three on each of the upper arm, forearm, and hand (Fig. 1B). Locations of each label were chosen to be as close as possible to the approximate location of XROMM beads, although concessions had to be made to ensure the location was not occluded consistently in the recordings. We used DLC 2.2 with in-house modifications to produce epipolar lines in image frames that were matched between the two cameras (Fig. 1C), which significantly improved human labeling accuracy by correcting gross errors and fine-tuning minor errors. We did not train a network on labels produced without the aid of epipolar lines and therefore cannot evaluate 3D error reduction using epipolar lines. However, we note that labels applied without epipolar lines on the torso were grossly inaccurate – these labels were adjusted by an average of 63 pixels and 57 pixels in camera-1 and camera-2, respectively, after implementation. The other nine labels were adjusted by an average of <1 pixel in camera-1 and 11 pixels in camera-2. This modification has been added as a command line feature in the DLC package (a guide for using epipolar lines can be found at https://deeplabcut.github.io/DeepLabCut/docs/HelperFunctions.html). Aside from this and related changes to the standard DLC process, we followed the steps outlined in Nath et al. (2019). In the first labeling iteration we extracted 100 total frames (50/camera) across the four events for marmoset TY and 254 frames (127/camera) across seven of the 13 events for marmoset PT, which produced a labeled dataset of 354 frames. These were chosen manually to avoid wasting time labeling frames before and after reaching bouts during which much of the marmoset forelimb was entirely occluded in the second camera angle. An additional 202 frames (101/camera) were extracted using the DLC toolbox with outliers identified by the ‘jump’ algorithm and frame selection by k-means clustering. We chose the number of frames to extract for each video based on visual inspection of labeling quality and chose the start and stop parameters to extract useful frames that captured reaching bouts. In all cases, frame numbers of extracted frames were matched between cameras to enable the use of epipolar lines. This refinement step resulted in an error reduction of 0.046 cm and percent frames tracked increase of 14.7% after analysis with the chosen Anipose parameters. The final dataset consisted of 278 human-labeled timepoints from 15 of the 17 events and 10,253 timepoints from all 17 events labeled by the network only. We used the default resnet-50 architecture for our networks with default image augmentation. We trained 3 shuffles of the first labeling iteration with a 0.95 training set fraction and used the first shuffle for the label refinement discussed above. We trained 15 total networks after one round of label refinement – three shuffles each with training fractions of 0.3, 0.5, 0.7, 0.85, and 0.95. Each network was trained for 300,000 iterations starting from the default initial weights. We evaluated each network every 10,000 iterations and selected the snapshot that produced the minimum test error across all labels for further analysis. We chose the network to use in subsequent analyses by finding the smallest training set size that reached the threshold of human labeling error (discussed next). We then chose the median-performing network of the three shuffles at this training set size for all further analysis.
    Human Labeling Error We selected 134 frames (67/camera) across three events from the same marmoset and session to be relabeled by the original, experienced human labeler and by a second, less experienced labeler. We used the error between the new and original labels to evaluate whether the networks reached asymptotic performance, defined by the experienced human labeling error. Calibration A custom calibration device was built to allow for calibration in both recording domains (Knorlein et al. 2016; instruction manual for small lego cube is located in the XMALab BitBucket). The device was constructed to contain a three-dimensional grid of steel beads within the structure and a two-dimensional grid of white circles on one face of the cube. Calibration of x-ray images was computed in XMALab and calibration of visible light images was computed with custom code using OpenCV. This integrated calibration device, along with the PCA-based alignment procedure described below, ensures that DLC and XROMM tracked trajectories in a common 3D coordinate system. DLC videos were accurately calibrated, with 0.42 pixels and 0.40 pixels of intrinsic calibration error for camera-1 and camera-2, respectively, and 0.63 pixels of stereo reprojection error. XROMM calibration was similarly accurate, with average intrinsic calibration error equal to 0.81 pixels and 1.38 pixels for the two cameras.
    Trajectory processing with Anipose We used Anipose to analyze videos, filter in 2D, triangulate 3D position from 2D trajectories, and apply 3D filters (see Karashchuk et al., 2021 for details). For 2D-filtering, we chose to apply a Viterbi filter followed by an autoencoder filter because the authors demonstrate this to be the most accurate combination of 2D filters. For triangulation and 3D filtering, we enabled optimization during triangulation and enabled spatial constraints for each set of three points on the hand, forearm, and upper arm, and for the pair of points on the torso. We identified six Anipose parameters and one post-processing parameter that may affect the final accuracy of DLC+Anipose tracking and ran a parameter sweep to find the optimal combination. In 2D filtering, we varied the number of bad points that could be back-filled into the Viterbi filter (“n-back”) and the offset threshold beyond which a label was considered to have jumped from the filter. We varied four parameters in 3D processing, including the weight applied to spatial constraints (“scale_length”) and a smoothing factor (“scale_smooth”), the reprojection error threshold used during triangulation optimization, and the score threshold used as a cutoff for

  8. h

    Sujet-Finance-Instruct-177k

    • huggingface.co
    Updated Jul 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujet AI (2024). Sujet-Finance-Instruct-177k [Dataset]. https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Instruct-177k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2024
    Dataset authored and provided by
    Sujet AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Sujet Finance Dataset Overview

    The Sujet Finance dataset is a comprehensive collection designed for the fine-tuning of Language Learning Models (LLMs) for specialized tasks in the financial sector. It amalgamates data from 18 distinct datasets hosted on HuggingFace, resulting in a rich repository of 177,597 entries. These entries span across seven key financial LLM tasks, making Sujet Finance a versatile tool for developing and enhancing financial applications of AI.… See the full description on the dataset page: https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Instruct-177k.

  9. E

    Data from: INCLUDE: A Large Scale Dataset for Indian Sign Language...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    mov
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7631
    Explore at:
    movAvailable download formats
    Dataset updated
    May 9, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR methods with hyperparameter tuning. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset. Download Instructions: For ease of access, we have prepared a Shell Script to download all the parts of the dataset and extract them to form the complete INCLUDE dataset.You can find the script here: http://bit.ly/include_dl

  10. a

    Seconds to Output 500 Tokens, including reasoning model 'thinking' time by...

    • artificialanalysis.ai
    Updated Dec 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2023). Seconds to Output 500 Tokens, including reasoning model 'thinking' time by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model

  11. hh-rlhf

    • huggingface.co
    Updated Dec 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthropic (2022). hh-rlhf [Dataset]. https://huggingface.co/datasets/Anthropic/hh-rlhf
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2022
    Dataset authored and provided by
    Anthropichttps://anthropic.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for HH-RLHF

      Dataset Summary
    

    This repository provides access to two different kinds of data:

    Human preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. These data are meant to train preference (or reward) models for subsequent RLHF training. These data are not meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/hh-rlhf.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Snorkel AI (2023). snorkel-curated-instruction-tuning [Dataset]. https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning

snorkel-curated-instruction-tuning

snorkelai/snorkel-curated-instruction-tuning

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2023
Dataset authored and provided by
Snorkel AI
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Please check out our Blog Post - How we built a better GenAI with programmatic data development for more details!

  Summary

snorkel-curated-instruction-tuning is a curated dataset that consists of high-quality instruction-response pairs. These pairs were programmatically filtered with weak supervision from open-source datasets Databricks Dolly-15k, Open Assistant, and Helpful Instructions. To enhance the dataset, we also programmatically classified each instruction based on the… See the full description on the dataset page: https://huggingface.co/datasets/snorkelai/snorkel-curated-instruction-tuning.

Search
Clear search
Close search
Google apps
Main menu