Dataset Card for example-preference-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.
Dataset Card for my-dataset-generate
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Bipul8765/my-dataset-generate/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/Bipul8765/my-dataset-generate.
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
ccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hybrid LCA database generated using ecoinvent and EXIOBASE, i.e., each process of the original ecoinvent database is added new direct inputs (coming from EXIOBASE) deemed missing (e.g., services). Each process of the resulting hybrid database is thus not (or at least less) truncated and the calculated lifecycle emissions/impacts should therefore be closer to reality.
For license reasons, only the added inputs for each process of ecoinvent are provided (and not all the inputs).
Why are there two versions for hybrid-ecoinvent3.5?
One of the version corresponds to ecoinvent hybridized with the normal version of EXIOBASE and the other is hybridized with a capital-endogenized version of EXIOBASE.
What does capital endogenization do?
It matches capital goods formation to the value chains of products where they are required. In a more LCA way of speaking, EXIOBASE in its normal version does not allocate capital use to value chains. It's like if ecoinvent processes had no inputs of buildings, etc. in their unit process inventory. For more detail on this, refer to (Södersten et al., 2019) or (Miller et al., 2019).
So which version do I use?
Using the version "with capitals" gives a more comprehensive coverage. Using the "without capitals" version means that if a process of ecoinvent misses inputs of capital goods (e.g., a process does not include the company laptops of the employees), it won't be added. It comes with its fair share of assumptions and uncertainties however.
Why is it only available for hybrid-ecoinvent3.5?
The work used for capital endogenization is not available for exiobase3.8.1.
How do I use the dataset?
First, to use it, you will need both the corresponding ecoinvent [cut-off] and EXIOBASE [product x product] versions. For the reference year of EXIOBASE to-be-used, take 2011 if using the hybrid-ecoinvent3.5 and 2019 for hybrid-ecoinvent3.6 and 3.7.1.
In the four datasets of this package, only added inputs are given (i.e. inputs from EXIOBASE added to ecoinvent processes). Ecoinvent and EXIOBASE processes/sectors are not included, for copyright issues. You thus need both ecoinvent and EXIOBASE to calculate life cycle emissions/impacts.
Module to get ecoinvent in a Python format: https://github.com/majeau-bettez/ecospold2matrix (make sure to take the most up-to-date branch)
Module to get EXIOBASE in a Python format: https://github.com/konstantinstadler/pymrio (can also be installed with pip)
If you want to use the "with capitals" version of the hybrid database, you also need to use the capital endogenized version of EXIOBASE, available here: https://zenodo.org/record/3874309. Choose the pxp version of the year you plan to study (which should match with the year of the EXIOBASE version). You then need to normalize the capital matrix (i.e., divide by the total output x of EXIOBASE). Then, you simply add the normalized capital matrix (K) to the technology matrix (A) of EXIOBASE (see equation below).
Once you have all the data needed, you just need to apply a slightly modified version of the Leontief equation:
(\begin{equation} \textbf{q}^{hyb} = \begin{bmatrix} \textbf{C}^{lca}\cdot\textbf{S}^{lca} & \textbf{C}^{io}\cdot\textbf{S}^{io} \end{bmatrix} \cdot \left( \textbf{I} - \begin{bmatrix} \textbf{A}^{lca} & \textbf{C}^{d} \ \textbf{C}^{u} & \textbf{A}^{io}+\textbf{K}^{io} \end{bmatrix} \right) ^{-1} \cdot \left( \begin{bmatrix} \textbf{y}^{lca} \ 0 \end{bmatrix} \right) \end{equation})
qhyb gives the hybridized impact, i.e., the impacts of each process including the impacts generated by their new inputs.
Clca and Cio are the respective characterization matrices for ecoinvent and EXIOBASE.
Slca and Sio are the respective environmental extension matrices (or elementary flows in LCA terms) for ecoinvent and EXIOBASE.
I is the identity matrix.
Alca and Aio are the respective technology matrices for ecoinvent and EXIOBASE (the ones loaded with ecospold2matrix and pymrio).
Kio is the capital matrix. If you do not use the endogenized version, do not include this matrix in the calculation.
Cu (or upstream cut-offs) is the matrix that you get in this dataset.
Cd (or downstream cut-offs) is simply a matrix of zeros in the case of this application.
Finally you define your final demand (or functional unit/set of functional units for LCA) as ylca.
Can I use it with different versions/reference years of EXIOBASE?
Technically speaking, yes it will work, because the temporal aspect does not intervene in the determination of the hybrid database presented here. However, keep in mind that there might be some inconsistencies. For example, you would need to multiply each of the inputs of the datasets by a factor to account for inflation. Prices of ecoinvent (which were used to compile the hybrid databases, for all versions presented here) are defined in €2005.
What are the weird suite of numbers in the columns?
Ecoinvent processes are identified through unique identifiers (uuids) to which metadata (i.e., name, location, price, etc.) can be retraced with the appropriate metadata files in each dataset package.
Why is the equation (I-A)-1 and not A-1 like in LCA?
IO and LCA have the same computational background. In LCA however, the convention is to represents outputs and inputs in the technology matrix. That's why there is a diagonal of 1s (the outputs, i.e. functional units) and negative values elsewhere (inputs). In IO, the technology matrix does not include outputs and only registers inputs as positive values. In the end, it is just a convention difference. If we call T the technology matrix of LCA and A the technology matrix of IO we have T = I-A. When you load ecoinvent using ecospold2matrix, the resulting version of ecoinvent will already be in IO convention and you won't have to bother with it.
Pymrio does not provide a characterization matrix for EXIOBASE, what do I do?
You can find an up-to-date characterization matrix (with Impact World+) for environmental extensions of EXIOBASE here: https://zenodo.org/record/3890339
If you want to match characterization across both EXIOBASE and ecoinvent (which you should do), here you can find a characterization matrix with Impact World+ for ecoinvent: https://zenodo.org/record/3890367
It's too complicated...
The custom software that was used to develop these datasets already deals with some of the steps described. Go check it out: https://github.com/MaximeAgez/pylcaio. You can also generate your own hybrid version of ecoinvent using this software (you can play with some parameters like correction for double counting, inflation rate, change price data to be used, etc.). As of pylcaio v2.1, the resulting hybrid database (generated directly by pylcaio) can be exported to and manipulated in brightway2.
Where can I get more information?
The whole methodology is detailed in (Agez et al., 2021).
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
The product data are six statistics that were estimated for the chemical concentration of lithium in the soil C horizon of the conterminous United States. The estimates are made at 9998 locations that are uniformly distributed across the conterminous United States. The six statistics are the mean for the isometric log-ratio transform of the concentrations, the equivalent mean for the concentrations, the standard deviation for the isometric log-ratio transform of the concentrations, the probability of exceeding a concentration of 55 milligrams per kilogram, the 0.95 quantile for the isometric log-ratio transform of the concentrations, and the equivalent 0.95 quantile for the concentrations. Each statistic may be used to generate a statistical map that shows an attribute of the distribution of lithium concentration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
## Overview
Build is a dataset for object detection tasks - it contains Build annotations for 623 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the Portuguese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems. Dataset Content: This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Portuguese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more. Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Portuguese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references. Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format. Prompt Diversity: To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others. These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments. Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers. Data Format and Annotation Details: This fully labeled Portuguese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence. Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance. The Portuguese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset. Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options. License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Portuguese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPIN covid19 RMRIO dataset is a time series of MRIO tables covering years from 2016-2026 on a yearly basis. The dataset covers 163 sectors in 155 countries.
This repository includes data for years from 2016 to 2019 (hist scenario) and the corresponding labels.
Data for years 2020 to 2026 are stored in the corresponding repositories:
Tables are generated using the SPIN method, based on the RMRIO tables for the year 2015, GDP, imports and exports data from the International Financial Statistics (IFS) and the World Economic Outlooks (WEO) of October 2019 and April 2021.
From 2020 to 2026, the dataset includes two diverging scenarios. The covid scenario is in line with April 2021 WEO's data and includes the macroeconomic effects of Covid 19. The counterfactual scenario is in line with October 2019 WEO's data and simulates the global economy without Covid 19. Tables from 2016 to 2019 are labelled as hist.
The Projections folder includes the generated tables for years from 2016 to 2019 (hist scenario) and the corresponding labels.
The Sources folder contains the data records from the IFS and WEO databases. The Method data contains the data files used to generate the tables with the SPIN method and the following Python scripts:
All tables are labelled in 2015 US$ and valued in basic prices.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NSText2SQL Dataset (Reformatted for Fine Tuned Generative Models)
This is the exact same dataset as NSText2SQL: https://huggingface.co/datasets/NumbersStation/NSText2SQL, but with the data reformatted to allow direct use to fine tune generative models. The original license and credits for the original dataset remain in place. Specifically, the changes from standard NSText2SQL are:
Removed non-english questions
Removed all rows with more than one input table, simplifying the… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/NSText2SQL-generate.
This dataset contains all data used in this study, including site ID, latitude, longitude, watershed land cover, water chemistry, and carbon and nitrogen stable isotope ratios of periphyton, invertebrate functional feeding groups, and five most frequently observed invertebrate families. Also included is a list of all invertebrates collected in this study along with their functional feeding group and stable isotope ratios. This dataset is associated with the following publication: Smucker, N., A. Kuhn, C. Cruz-Quinones, J. Serbst, and J. Lake. Stable isotopes of algae and macroinvertebrates in streams respond to watershed urbanization, inform management goals, and indicate food web relationships. ECOLOGICAL INDICATORS. Elsevier Science Ltd, New York, NY, USA, 90: 295-304, (2018).
This dataset provides all data used to generate the figures and tables in the article entitled "Particulate matter and black carbon optical properties and emission factors from prescribed fires in the southeastern United States" published in the Journal of Geophysical Research: Atmospheres. This dataset is associated with the following publication: Holder , A., G. Hagler , J. Aurell, M. Hays , and B. Gullett. Particulate matter and black carbon optical properties and emission factors from prescribed fires in the southeastern United States. JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES. American Geophysical Union, Washington, DC, USA, 121(7): 3465-3483, (2016).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All the gene expression dataset are published on GEO (Gene Expression Omnibus). All the dataset are based on Affymetrix Mouse Genome 430 2.0 Chip (GEO platform: GPL1226).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
This dataset presents the IoT network traffic generated by connected objects. In order to understand and characterise the legitimate behaviour of network traffic, a platform is created to generate IoT traffic under realistic conditions. This platform contains different IoT devices: voice assistants, smart cameras, connected printers, connected light bulbs, motion sensors, etc. Then, a set of interactions with these objects is performed to allow the generation of real traffic. This data is used to identify anomalies and intrusions using machine learning algorithms and to improve existing detection models. Our dataset is available in two formats: PCAP and csv and was created as part of the EU CEF Variot project https://variot.eu. To download the data in pcap format and for more information, our database is available on this web portal: https://www.variot.telecom-sudparis.eu/
Dataset Card for example-preference-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.