Dataset Card for "Llama-2-SQL-and-Code-Dataset"
This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Title: Add title here
Project Team: Add contact information for research project team members
Summary: Provide a descriptive summary of the nature of your research project and its aims/focal research questions.
Relevant publications/outputs: When available, add links to the related publications/outputs from this data.
Data availability statement: If your data is not linked on figshare directly, provide links to where it is being hosted here (i.e., Open Science Framework, Github, etc.). If your data is not going to be made publicly available, please provide details here as to the conditions under which interested individuals could gain access to the data and how to go about doing so.
Data collection details: 1. When was your data collected? 2. How were your participants sampled/recruited?
Sample information: How many and who are your participants? Demographic summaries are helpful additions to this section.
Research Project Materials: What materials are necessary to fully reproduce your the contents of your dataset? Include a list of all relevant materials (e.g., surveys, interview questions) with a brief description of what is included in each file that should be uploaded alongside your datasets.
List of relevant datafile(s): If your project produces data that cannot be contained in a single file, list the names of each of the files here with a brief description of what parts of your research project each file is related to.
Data codebook: What is in each column of your dataset? Provide variable names as they are encoded in your data files, verbatim question associated with each response, response options, details of any post-collection coding that has been done on the raw-response (and whether that's encoded in a separate column).
Examples available at: https://www.thearda.com/data-archive?fid=PEWMU17 https://www.thearda.com/data-archive?fid=RELLAND14
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
People Data Set 2 is a dataset for object detection tasks - it contains People STRs annotations for 5,243 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
All results of the primary interrupted time-series results evaluating targeted and total border closures that met the following criteria: 1) at least seven days of data is available before and after the intervention point, 2) for multiple intervention time series, at least seven days has passed since the last intervention point, and 3) for multiple sequential targeted border closures, the second (or third) intervention is observed to indicate an increase of at least 20% of the world’s population being targeted by the new border closures.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data presented here consists of three parts: Dataset 1: In this set, we extract 327,322 faces from our entire collection of 3389 issues, and automatically classified each face as male or female. We present this data as a single table with columns identifying the date, issue, page number, the coordinates identifying the position of the face on the page, and classification (male or female). The coordinates identifying the position of the face on the page are based on the size and resolution of the pages found in the “Time Vault”. Dataset 2: Dataset 2 consists of 8,789 classified faces from 100 selected issues. Human labor was used to identify and extract 3,299 face images from 39 issues, which were later classified by another set of workers. This selection of 39 issues contains one issue per decade spanned by the archive plus one issue per year between 1961 and 1991, and the extracted face images were used to train the face extraction algorithm. The remaining 5,490 faces from 61 issues were extracted via machine learning before being classified by human coders. These 61 issues were chosen to complement the first selection of 39 issues: one issue per year for all years in the archive excluding those between 1961 and 1991. Thus, Dataset 2 contains fully-labelled faces from at least one issue per year. Dataset 3: In the interest of transparency, Dataset 3 consists of the raw data collected to create Dataset 2, and consists of 2 tables. Before explaining these tables we first briefly describe our data collection and verification procedures, which have been fully described elsewhere. A custom AMT interface was used to enable human labors to classify faces according the categories in Table 4. Each worker was given a randomly-selected batch of 25 pages, each with a clearly highlighted face to be categorized, of which three pages were verification pages with known features, which were used for quality control. Each face was labeled by two distinct human coders, determined at random so that the paring of coders varied with the image. A proficiency rating was calculated for each coder by considering all images they annotated and computing the average number of labels that matched those identified by the image’s other coder. The tables in Dataset 2 were created by resolving inconsistencies between the two image coders by selecting the labels from the coder with the highest proficiency rating. Prior to calculating the proficiency score, all faces that were tagged as having ‘Poor’ or ‘Error’ image quality by either of the two coders were eliminated. Due to technical bugs when the AMT interface was first implemented, a small number of images were only labeled once; these were also eliminated from Datasets 2 and 3. In Dataset 3, we present the raw annotations for each coder that tagged each face, along with demographic data for each coder. Dataset 3 consists of two tables: the raw data from each of the two sets of coders, and the demographic information for each of the coders.
Dataset Card for "dataset2"
More Information needed
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is about: Gridded data sets II. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.733691 for more information.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by dataset268
Released under MIT
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Search_2 dataset consists of (1) a set of 44 high-resolution digital color images of different complex natural scenes, (2) the ground truth corresponding to each of these scenes, and (3) the results of psychophysical experiments on each of these images. The images in the Search_2 dataset are a subset of a larger set that was used in a visual search and detection experiment. Each scene (image) contains a single military vehicle that serves as a search target. Areport describes the images in detail, and presents the corresponding ground truth and observer data. The image dataset, an Excel file with the ground truth and observer data, and a copy of this report are included in the dataset. The complete dataset can be used to validate (1) digital metrics that compute the visual distinctness (contrast, conspicuity, or saliency) of targets in complex scenes, and (2) models of human visual search and detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: Find data at source. ・ BDG2 is an open data set made up of 3,053 energy meters from 1,636 buildings. The time range of the times-series data is the two full years (2016 and 2017) and the frequency is hourly measurements of electricity, heating and cooling water, steam, and irrigation meters.
nachors/dataset2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
industrial CT are used for industrial applciations. Computational Fluid dynamics assist flow simulation of fluid channels.
141: Areas with vegetation within or partly embraced by urban fabric. This class is assigned for urban greenery, which usually has recreational or ornamental character and is usually accessible for the public.
🚢 Stanford Human Preferences Dataset v2 (SHP-2)
Summary
SHP-2 is a dataset of 4.8M collective human preferences over responses to questions/instructions in 129 different subject areas, from cooking to legal advice. It is an extended version of the original 385K SHP dataset. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP). Each example… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/SHP-2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
Methodology
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
Description of the data in this data set
Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx
Licenses or restrictions CC-BY
For more info, see README.txt
This dataset was created by Roxana
Overview: 142: Areas used for sports, leisure and recreation purposes. Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
People 2 is a dataset for object detection tasks - it contains People Y9S6 annotations for 479 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset package includes four datasets
Dataset Card for "Llama-2-SQL-and-Code-Dataset"
This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.