42 datasets found

MetaMath QA
kaggle.com
zip
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b
Explore at:
zip(78629842 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.

Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.

Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
LC-QuAD 2.0 (Question & Answering)
kaggle.com
zip
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). LC-QuAD 2.0 (Question & Answering) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-smarter-querying-with-lc-quad-2-0
Explore at:
zip(3004134 bytes)Available download formats
Dataset updated
Dec 2, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
LC-QuAD 2.0 (Question & Answering)

30,000 pairs of question and its corresponding SPARQL query

By Huggingface Hub [source]

About this dataset

LC-QuAD 2.0 is a breakthrough dataset designed to advance the state of intelligent querying towards unprecedented heights. By providing a collection of 30,000 different pairs of questions and their respective SPARQL queries each, it presents an enormous opportunity for every person looking to unlock the power of knowledge with smart querying techniques.

These questions have been carefully devised such that they relate to the latest version of Wikidata and DBpedia, granting tech-savvy individuals an access key to an information repository far beyond what was once thought imaginable. The dataset found under this union is nothing short of amazing - consisting not just of Natural Language Questions but also their solutions in the form of a SPARQL query. With LC-QuAD 2.0, you have at your fingertips more than thirty thousand answers ready for any query you can think up! Unlocking knowledge has never been easier!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Using the LC-QuAD 2.0 dataset can be a great way to power up your intelligent systems with smarter querying. Whether you want to build a question-answering system or create new knowledge graphs and search systems, utilizing this dataset can certainly be helpful. Here is a guide on how to use this dataset:

Understand the structure of the data: The LC-QuAD 2.0 consists of 30,000 different pairs of questions and their corresponding SPARQL queries in two files – train (used for training an intelligent system) and test (used for testing an intelligent system). The columns present in each pair are NNQT_question (Natural Language Question), subgraph (Subgraph information for the question), sparql_dbpedia18 (SPARQL query for DBpedia 18), template (Templates from which SPARQL query was generated).

Read up on SPARQL: Before you start using this dataset, it is important that you read up more on what SPARQL means and how it works as SPAQL will be used frequently when browsing through this data set. This will make the understanding of the content easier and quicker!

Start exploring!: After doing some research about SPARQL, now it’s time to explore! You can start by looking at each pair in detail - read through its natural language question, subgraph information and try understanding its relation with its corresponding sparql queries from both DBpedia 18 or try running these sparql queries yourself against Wikidata or DBPedia platform to see where they lead you eventually! In case any query has multiple results having different variances with respect to answers range , then look inside entity definitions contained within words \ phrases / synonyms reflected by natural language parsing services API's like AIKATsetu etc., before writing authoritative answer modules/endpoints forming partinmonly sustainable pipeline architecture using such prepared & refined datasets like LC-QUAD !

Use your own data: Once you have familiarized yourself sufficiently with the available pairs & understand their relevance , consider creating your own data set by adding more complex questions along associated unique attributes which shall give great insights . If not done already evaluate if population enrichment techniques should be applied suiting specific domain's needs your bot purports - either just features selection criterion wise or entire classifier selection algorithm wise as otherwise global extracted vectors may decide either selectively for reducing overfitting/generalization penalty in

Research Ideas

Incorporating the LC-QUAD 2.0 dataset into Intelligent systems such as Chatbots, Question Answering Systems, and Document Summarization programs to allow them to retrieve the required information by transforming natural language questions into SPARQL queries.

Utilizing this dataset in Semantic Scholar Search Engines and Academic Digital Libraries which can use natural language queries instead of keywords in order to perform more sophisticated searches and provide more accurate results for researchers in diverse areas.

Applying this dataset for building Knowledge Graphs that can store entities along with their attributes, categories and relations thereby allowing better understanding of complex relationships between entities or data and further advancing development of AI agents that are able to answer specific questions or provide personalized recommendations in various contexts or tasks

Acknowledgements

&g...
aol-data.tar.bz2
figshare.com
bz2
Updated Oct 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graham Cormode (2017). aol-data.tar.bz2 [Dataset]. http://doi.org/10.6084/m9.figshare.5527231.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5527231.v1
Dataset updated
Oct 23, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Graham Cormode
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AOL search data anonymized and released by AOL Research in 2006."500k User Session Collection----------------------------------------------This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.Brief description:This collection consists of ~20M web queries collected from ~650k users over three months.The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research. The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.Each line in the data represents one of two types of events: 1. A query that was NOT followed by the user clicking on a result item. 2. A click through on an item in the result list returned from a query.In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above). In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.Basic Collection StatisticsDates: 01 March, 2006 - 31 May, 2006Normalized queries: 36,389,567 lines of data 21,011,340 instances of new queries (w/ or w/o click-through) 7,887,022 requests for "next page" of results 19,442,629 user click-through events 16,946,938 queries w/o user click-through 10,154,742 unique (normalized) queries 657,426 unique user ID'sPlease reference the following publication when using this collection:G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.Copyright (2006) AOL"
h
astro-llms-full-query-data
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Language and Speech Processing @ JHU, astro-llms-full-query-data [Dataset]. https://huggingface.co/datasets/jhu-clsp/astro-llms-full-query-data
Explore at:
Dataset authored and provided by
Center for Language and Speech Processing @ JHU
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AstroLLMs Full Query Dataset

This dataset includes all of the data collected in a four-week deployment of a Large Language Model-powered Slack chatbot trained on astrophysics papers. Astronomers were invited to interact with the chatbot, ask questions, and leave feedback. This data includes 368 question-answer pairs, including feedback, reactions, and labeling.

Dataset Structure

The columns of this dataset are thread_ts (unique time stamp of the query), channel_id… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/astro-llms-full-query-data.
Z
Data from: SARA - A Collection of Sensitivity-Aware Relevance Assessments
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McKechnie, Jack; McDonald, Graham (2023). SARA - A Collection of Sensitivity-Aware Relevance Assessments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8006819
Explore at:
Dataset updated
Jun 6, 2023
Dataset provided by
University of Glasgow
Authors
McKechnie, Jack; McDonald, Graham
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SARA - A Collection of Sensitivity-Aware Relevance Assessments

Presented here is a collection of Sensitivity-Aware Relevance Assessments for the UC Berkely labelled subset of the Enron Email Collection. The Hearst [1] labelled version of the Enron Email Collection is a subset of the CMU collection that contains 1702 emails that were annotated as part of a class project at UC Berkley. Students in the Natural Language Processing course were tasked with annotating the emails as relevant or not relevant to 53 different categories. Therefore, the labelled version of the Enron email collection provides a rich taxonomy of labels which can be used for multiple definitions of sensitivity such as the Purely Personal and Personal but in a Professional Context. The categories that the emails are labelled for can be seen in Table 1. The files for the labelled version of the Enron Email Collection are available from the UC Berkely website.

We deploy a topic modelling approach to identify topical themes in the labelled Enron collection that serve as a basis for our information needs which are in turn used to gather queries and relevance assessments, the notebook for which is available here. Two separate crowdsourcing tasks are carried out in the development of SARA. Firstly, query formulations are crowdsourced to represent the information needs and, secondly, relevance assessments are crowdsourced for a pooled set of documents from the labelled Enron collection for each of the information needs.

The SARA Collection of Sensitivity-Aware Relevance Assessments is available through the popular ir_datasets library. More information can be found on the ir_datasets GitHub and website.

Information Needs

To create our set of sensitivity-aware relevance assessments for the labelled Enron email collection, we first identify a set of topical subjects that reflect the contents of the emails in the collection. We use a topic modelling approach to identify the information needs. When identifying topics to be used as information needs, we are interested in identifying general themes that relate to the topics of discussion that might likely be covered in the contents (i.e., the body) of the emails in the collection. The topics are chosen to be broad enough to be able to reasonably expect that there would be relevant documents in the collection, and not so specific that it would require specialist knowledge to make a judgement of relevance on the subject. Subsequently, we manually construct short passages of text to serve as descriptions of the information needs that are to be searched for in the collection by the crowdworkers. The information needs that the crowdworkers are available in the information_needs.tsv file.

Queries

In order to collect relevance assessments for pairs of emails and information needs, different query formulations are first needed to generate pools of documents. Query formulations for each topic are collected from crowdworkers from the Prolific crowdwork platform. Ten information needs are shown to each crowdworker and they are asked to provide a query formulation that they would use to get relevant documents to satisfy the information need they are presented with. Three queries for each of the fifty information needs are released. The resulting queries are available in the repeated_queries.tsv file.

Relevance Assesments

Crowdworkers are shown an information need and an email and asked to rate the document as being either Highly Relevant, Partially Relevant, or Not Relevant to the information need. Each information need/email pair is judged by three crowdworkers and a majority vote is used to generate a ground truth label. Since each information need / email pair is judged by three crowdworkers and there are three possible labels, it is possible for each of the labels to be selected by one crowdworker. In practice, this only happened for 134 pairs. In such cases, ties are broken by having one of the authors read the document and make an additional judgement. In order to ensure that sensitive documents definitely have relevance labels they were also judged by one of the authors for each of the information needs. The relevance assessments are available in the repeated_qrels.txt file. The relevance assessments are in the format 'query iteration document relevancy'. The iteration column is used for IR_Datasets and can be safely ignored and the document name is the filename used in the labelled Enron collection.

Table 1

1) Coarse genre 2) Included/forwarded information 3) Primary topics (If coarse genre 1.1 is selected) 4) Emotional tone (If not neutral) 1.1 Company Business, Strategy, etc. (See 3) 2.1 Includes new text in addition to forwarded material 3.1 Regulations and regulators (includes price caps) 4.1 Jubilation 1.2 Purely Personal 2.2 Forwarded email(s) including replies 3.2 Internal projects -- progress and strategy 4.2 Hope / anticipation 1.3 Personal but in professional context (e.g., it was good working with you) 2.3 Business letter(s) / document(s) 3.3 Company image -- current 4.3 Humor 1.4 Logistic Arrangements (meeting scheduling, technical support, etc.) 2.4 News article(s) 3.4 Company image -- changing / influencing 4.4 Camaraderie 1.5 Employment arrangements (job seeking, hiring, recommendations, etc.) 2.5 Government / academic report(s) 3.5 Political influence / contributions / contacts 4.5 Admiration 1.6 Document editing/checking (collaboration) 2.6 Government action(s) (such as results of a hearing, etc.) 3.6 California energy crisis / California politics 4.6 Gratitude 1.7 Empty message (due to missing attachment) 2.7 Press release(s) 3.7 Internal company policy 4.7 Friendship / affection 1.8 Empty message 2.8 Legal documents (complaints, lawsuits, advice) 3.8 Internal company operations 4.8 Sympathy / support 2.9 Pointers to url(s) 3.9 Alliances / partnerships 4.9 Sarcasm 2.10 Newsletters 3.10 Legal advice 4.10 Secrecy / confidentiality 2.11 Jokes, humor (related to business) 3.11 Talking points 4.11 Worry / anxiety 2.12 Jokes, humor (unrelated to business) 3.12 Meeting minutes 4.12 Concern 2.13 Attachment(s) (assumed missing) 3.13 Trip reports 4.13 Competitiveness / aggressiveness 4.14 Triumph / gloating 4.15 Pride 4.16 Anger / agitation 4.17 Sadness / despair 4.18 Shame 4.19 Dislike / scorn

The Sensitivity-Aware Relevance Assessments dataset is held under an Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence which allows for it to be adapted, transformed and built upon.

Questions and comments are welcomed via email.

References

[1] Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. In Proc. of Workshop on Effective Tools and Methodologies for Teaching NLP and CL.
WikiSQL (Questions and SQL Queries)
kaggle.com
zip
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
Explore at:
zip(21491264 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

By Huggingface Hub [source]

About this dataset

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

Research Ideas

This dataset can be used to develop natural language interfaces for relational databases.

This dataset can be used to develop a knowledge base of common SQL queries.

This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
TREC 2022 Deep Learning test collection
catalog.data.gov
gimi9.com
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Cyber-Physical System power Consumption
zenodo.org
bin, csv, zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis (2024). Cyber-Physical System power Consumption [Dataset]. http://doi.org/10.5281/zenodo.14215756
Explore at:
bin, csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14215756
Dataset updated
Nov 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files

This dataset is comprised of 5 CSV files contained in the data.zip archive. Each one represents a production machine from which various sensor data has been collected. The average cadence for collection was 5 measurements per second. The monitored devices where used for hydroforming.

The collection period covered the period from 2023-06-01 until 2023-08-05.

Data

These files represent a complete data dump from the data available in the time-series database, InfluxDB, used for collection. Because of this some columns have no semantic value for detecting production cycles or any other analytics.

Each file contains a total of 14 columns. Some of the columns are artefacts of the query used to extract the data from InfluxDB and can be discarded. These columns are: results, table _start, _stop

results - An artefact of the InfluxDB query, signifies postprocessing of results in this dataset. It is "mean".

table - An artefact of the InfluxDB query, can be discarded.

_start and _stop - Refers to ingestion related data, used in monitoring ingestion.

_field - An artefact of the InfluxDB query, specifying what field to use for the query.

_measurement - An artefact of the InfluxDB query, specifying what measurement to use for the query. Contains the same information as device_id.

host - An artefact of the InfluxDB query, the unique name of the host used for the InfluxDB sink in Kubernetes.

kafka_topic - Name of the Kafka topic used for collection.

Pertinent columns are:

_time - Denotes the time at which a particular event has been measured, it is used as index when creating a dataframe.

_time.1 - Duplicate of _time for sanity check and ease of analysis when _time is set as index

_value - Represents the value measured by each sensor type.

device_id - Unique identifier of the manufacturing device, should be the same as the file name, i.e. B827EB8D8E0C.

ingestion_time - Timestamp when the data has been collected and ingested by influxDB.

sid - Unique sensor ID; the power measurements can be found at sid 1.

Annotations

There are two additional files which contain annotation data:

scamp_devices.csv - Contains mapping information between the dataset device ID (defined in column "DeviceIDMonitoring") and the ground truth file ID (defined in column "DeviceID")

scamp_report_3m.csv - Contains the ground truth, which can be used for validation of cycle detection and analysis methods. The columns are as follows:

ReportID - Internal unique ID created during data collection. It can be discarded.

JobID - Internal Scheduling Job unique ID.

DeviceID - The unique ID of the devices used for manufacturing needs to be mapped using the scamp_device.csv data.

StartTime - Start time of operations

EndTime - End time of operations

ProductID - Unique identifier of the product being manufactured.

CycleTime - Average length of cycle in seconds, added manually by operators. It can be unreliable.

QuantityProduced - Number of products manufactured during the timeframe given by StartTime and EndTime.

QuantityScrap - Number of scraped/malformed products in the given timeframe. These are part of the QuantityProduced, not in addition to it.

IntreruptionMinuted - Minutes of production halt.

scamp_patterns.csv - Contains the start and end timestamp for selected example production cycles. These where chosen based on expert users.

Jupyter Notebook

We have provided a sample Jupyter notebook (verify_data.ipynb), which gives examples of how the dataset can be loaded and visualised as well as examples of how the sample patterns and ground truth can be addressed and visualised.

Note

The Jupyter Notebook contains an example of how the data can be loaded and visualised. Please note that both data should be filtered based on sid; the power measurements are collected by sid 1. See Notebook for example.
d
Johns Hopkins COVID-19 Case Tracker
data.world
kaggle.com
csv, zip
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
Explore at:
zip, csvAvailable download formats
Dataset updated
Dec 3, 2025
Authors
The Associated Press
Time period covered
Jan 22, 2020 - Mar 9, 2023
Area covered
Description
Updates

Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

CDC Weekly case and death counts (national and state level)

CDC County level cases and deaths

HHS New hospital admissions

CDC NowCast COVID variant proportions (national and regional level)

April 9, 2020

The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.

April 20, 2020

Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.

April 29, 2020

The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.

September 1st, 2020

Johns Hopkins is now providing counts for the five New York City counties individually.

February 12, 2021

The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."

Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.

February 16, 2021

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updating this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

Queries

Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

Filter cases by state here

Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac

Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true

Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.

Pull the 100 counties with the highest per-capita confirmed cases here

Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.

Interactive

The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

@(https://datawrapper.dwcdn.net/nRyaf/15/)

Interactive Embed Code

<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>

Caveats

This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.

In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.

In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"

This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.

Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.

The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

Attribution

This data should be credited to Johns Hopkins University COVID-19 tracking project
h
AOL-500k-User-Session-Collection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Isom, AOL-500k-User-Session-Collection [Dataset]. https://huggingface.co/datasets/max-chroma/AOL-500k-User-Session-Collection
Explore at:
Authors
Max Isom
Description
500k User Session Collection

This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED. Brief description: This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization… See the full description on the dataset page: https://huggingface.co/datasets/max-chroma/AOL-500k-User-Session-Collection.
h
answersumm
huggingface.co
Updated Sep 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Fabbri (2022). answersumm [Dataset]. https://huggingface.co/datasets/alexfabbri/answersumm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 3, 2022
Authors
Alexander Fabbri
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for answersumm

Dataset Summary

The AnswerSumm dataset is an English-language dataset of questions and answers collected from a StackExchange data dump. The dataset was created to support the task of query-focused answer summarization with an emphasis on multi-perspective answers. The dataset consists of over 4200 such question-answer threads annotated by professional linguists and includes over 8700 summaries. We decompose the task into several annotation… See the full description on the dataset page: https://huggingface.co/datasets/alexfabbri/answersumm.
Air Markets Program Data (AMPD)
data.wu.ac.at
data.amerigeoss.org
csv
Updated Jan 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2014). Air Markets Program Data (AMPD) [Dataset]. https://data.wu.ac.at/schema/data_gov/OWNkYjNiZjUtZmIxZC00MmM3LWJmMjctMDViOTA3NjA2OWIx
Explore at:
csvAvailable download formats
Dataset updated
Jan 1, 2014
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
e8886aabdfc0d065a2972b24a4c4c39d6972254b
Description
The Air Markets Program Data tool allows users to search EPA data to answer scientific, general, policy, and regulatory questions about industry emissions. Air Markets Program Data (AMPD) is a web-based application that allows users easy access to both current and historical data collected as part of EPA's emissions trading programs. This site allows you to create and view reports and to download emissions data for further analysis. AMPD provides a query tool so users can create custom queries of industry source emissions data, allowance data, compliance data, and facility attributes. In addition, AMPD provides interactive maps, charts, reports, and pre-packaged datasets. AMPD does not require any additional software, plug-ins, or security controls and can be accessed using a standard web browser.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
g
Service Stations in Münster | gimi9.com
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Service Stations in Münster | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_26fd36e8-5e24-45a3-b579-6c4a08e7dfda/
Explore at:
Description
This dataset contains a list of petrol station locations in the Münster area. The data is queried directly from the OpenStreetMaps database. Please note the following information about the data: The query does not exactly cover the city area, but a rectangular frame around the city centre. The database of OpenStreetMaps may not be complete. These data are collected by volunteers and usually have a very good quality, but this is not official information from the city administration. The license terms of OpenStreetMap Germany apply: http://www.openstreetmap.de/faq.html#lizenz OpenStreetMap is a project founded in 2004 with the aim of creating a free world map. Volunteers from many countries work on the further development of the software as well as the collection and processing of geodata. Data is collected about roads, railways, rivers, forests, houses and everything else that is commonly seen on maps. This record links directly to the current data query at overpass-api.de. If you find incorrect or missing information in this record, you can log in to OpenStreetMaps and help improve the database yourself. You can complete or correct the data, similar to Wikipedia. How to do this, find out under: https://www.openstreetmap.de/faq.html#wie_mitmachen Furthermore, you can customise this data query yourself and extend the query result to the entire Münsterland or beyond, because the OpenStreetMap database contains Germany-wide data. You can find an English guide for the query language to formulate your data query at the following address: https://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide
Scientific Data Provenance in R: RDataTracker and DDG Explorer
search.dataone.org
portal.edirepository.org
Updated Aug 25, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Lerner; Emery Boose; Aaron Ellison; Leon Osterweil (2014). Scientific Data Provenance in R: RDataTracker and DDG Explorer [Dataset]. https://search.dataone.org/view/knb-lter-hfr.91.17
Explore at:
Dataset updated
Aug 25, 2014
Dataset provided by
Long Term Ecological Research Networkhttp://www.lternet.edu/
Authors
Barbara Lerner; Emery Boose; Aaron Ellison; Leon Osterweil
Description
Scientific data provenance is the information required to document the history of an item of data, including how it was created and how it was transformed. Data provenance has great potential to improve the transparency, reliability, and reproducibility of scientific results. However it has been little used to date by domain scientists because most systems that collect provenance require scientists to learn specialized software tools and jargon. This project is developing tools that allow scientists to collect, visualize, and query provenance directly from the R statistical language. The first tool (RDataTracker) is a library of R functions that can be downloaded and installed as an R package. RDataTracker allows the scientist to collect data provenance during an R console session or while executing an R script. The resulting provenance is stored on the scientist's computer as a DDG (data derivation graph) file. The second tool (DDG Explorer) is a stand-alone Java program that can be downloaded and run to visualize, store, and query DDGs. The third tool is an R script (DDGCheckpoint.R) may be used with RDataTRacker to create and restore checkpoints that store the R environment and user files. Documentation for all tools is included with the RDataTracker package or may be downloaded separately.
n
Repository Analytics and Metrics Portal (RAMP) 2018 data
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2018 data [Dataset]. http://doi.org/10.5061/dryad.ffbg79cvp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ffbg79cvp
Dataset updated
Jul 27, 2021
Dataset provided by
Montana State University
University of New Mexico
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.

Methods

RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.

Data Collection from August 19, 2018 Onward

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
n
FLOWRepository
neuinfo.org
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). FLOWRepository [Dataset]. http://identifiers.org/RRID:SCR_013779/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_013779 https://identifiers.org/RRID:SCR_013779/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
A database of flow cytometry experiments where users can query and download data collected and annotated according to the MIFlowCyt data standard.
Z
Data from: SQL Injection Attack Netflow
data.niaid.nih.gov
portalcienciaytecnologia.jcyl.es
+3more
Updated Sep 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
Explore at:
Dataset updated
Sep 28, 2022
Authors
Ignacio Crespo; Adrián Campazas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

Dataset Aim Samples Benign-malicious traffic ratio D1 Training 400,003 50% D2 Test 57,239 50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
a
Decoding Home Values: The Power of Education vs. Race, Ethnicity, and Gender...
chi-phi-nmcdc.opendata.arcgis.com
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New Mexico Community Data Collaborative (2023). Decoding Home Values: The Power of Education vs. Race, Ethnicity, and Gender [Dataset]. https://chi-phi-nmcdc.opendata.arcgis.com/datasets/decoding-home-values-the-power-of-education-vs-race-ethnicity-and-gender
Explore at:
Dataset updated
Jul 25, 2023
Dataset authored and provided by
New Mexico Community Data Collaborative
Description
A detailed explanation of how this dataset was put together, including data sources and methodologies, follows below.Please see the "Terms of Use" section below for the Data DictionaryDATA ACQUISITION AND CLEANING PROCESSThis dataset was built from 5 separate datasets queried during the months of April and May 2023 from the Census Microdata System (link below):https://data.census.gov/mdat/#/All datasets include information on Property Value (VALP) by: Educational Attainment (SCHL), Gender (SEX), a specified race or ethnicity (RAC or HISP), and are grouped by Public Use Microdata Areas (PUMAS). PUMAS are geographic areas created by the Census bureau; they are weighted by land area and population to facilitate data analysis. Data also Included totals for the state of New Mexico, so 19 total geographies are represented. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Cleaning each dataset started with recoding the SCHL and HISP variables - details on recoding can be found below.After recoding, each dataset was transposed so that PUMAS were rows and SCHL, VALP, SEX, and Race or Ethnicity variables were the columns.Median values were calculated in every case that recoding was necessary. As a result, all Property Values in this dataset reflect median values.At times the ACS data downloaded with zeros instead of the 'null' values in initial query results. The VALP variable also included a "-1" variable to reflect N/A values (details in variable notes). Both zeros and "-1" values were removed before calculating median values, both to keep the data true to the original query and to generate accurate median values.Recoding the SCHL variable resulted in 5 rows for each PUMA, reflecting the different levels of educational attainment in each region. Columns grouped variables by race or ethnicity and gender. Cell values were property values.All 5 datasets were joined after recoding and cleaning the data. Original datasets all include 95 rows with 5 separate Educational Attainment variables for each PUMA, including New Mexico State totals.Because 1 row was needed for each PUMA in order to map this data, the data was split by Educational Attainment (SCHL), resulting in 110 columns reflecting median property values for each race or ethnicity by gender and level of educational attainment.A short, unique 2 to 5 letter alias was created for each PUMA area in anticipation of needing a unique identifier to join the data with. GIS AND MAPPING PROCESSA PUMA shapefile was downloaded from the ACS site. The Shapefile can be downloaded here: https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServerThe DBF from the PUMA shapefile was exported to Excel; this shapefile data included needed geographic information for mapping such as: GEOID, PUMACE. The UIDs created for each PUMA were added to the shapefile data; the PUMA shapfile data and ACS data were then joined on UID in JMP.The data table was joined to the shapefile in ARC GiIS, based on PUMA region (specifically GEOID text).The resulting shapefile was exported as a GDB (geodatabase) in order to keep 'Null' values in the data. GDBs are capable of including a rule allowing null values where shapefiles are not. This GDB was uploaded to NMCDCs Arc Gis platform. SYSTEMS USEDMS Excel was used for data cleaning, recoding, and deriving values. Recoding was done directly in the Microdata system when possible - but because the system is was in beta at the time of use some features were not functional at times.JMP was used to transpose, join, and split data. ARC GIS Desktop was used to create the shapefile uploaded to NMCDC's online platform. VARIABLE AND RECODING NOTESTIMEFRAME: Data was queried for the 5 year period of 2015 to 2019 because ACS changed its definiton for and methods of collecting data on race and ethinicity in 2020. The change resulted in greater aggregation and les granular data on variables from 2020 onward.Note: All Race Data reflects that respondants identified as the specified race alone or in combination with one or more other races.VARIABLE:ACS VARIABLE DEFINITIONACS VARIABLE NOTESDETAILS OR URL FOR RAW DATA DOWNLOADRACBLKBlack or African American ACS Query: RACBLK, SCHL, SEX, VALP 2019 5yrRACAIANAmerican Indian and Alaska Native ACS Query: RACAIAN, SCHL, SEX, VALP 2019 5yrRACASNAsian ACS Query: RACASN, SCHL, SEX, VALP 2019 5yrRACWHTWhite ACS Query: RACWHT, SCHL, SEX, VALP 2019 5yrHISPHispanic Origin ACS Query: HISP ORG, SCHL, SEX, VALP 2019 5yrHISP RECODE: 24 original separate variablesThe Hispanic Origin (HISP) variable originally included 24 subcategories reflecting Mexican, Central American, South American, and Caribbean Latino, and Spanish identities from each Latin American counry. 7 recoded VariablesThese 24 variables were recoded (grouped) into 7 simpler categories for data analysis: Not Spanish/Hispanic/Latino, Mexican, Caribbean Latino, Central American, South American, Spaniard, All other Spanish/Hispanic/Latino Female. Not Spanish/Hispanic/Latino was not really used in the final dataset as the race datasets provided that information.SCHLEducational Attainment25 original separate variablesThe Educational Attainment (SCHL) variable originally included 25 subcategories reflecting the education levels of adults (over 18) surveyed by the ACS. These include: Kindergarten, Grades 1 through 12 separately, 12th grade with no diploma, Highschool Diploma, GED or credential, less than 1 year of college, more than 1 year of college with no degree, Associate's Degree, Bachelor's Degree, Master's Degree, Professional Degree, and Doctorate Degree.SCHL RECODE: 5 recoded variablesThese 25 variables were recoded (grouped) into 5 simpler categories for data analysis: No High School Diploma, High School Diploma or GED, Some College, Bachelor's Degree, and Advanced or Professional DegreeSEXGender2 variables1 - Male, 2 - FemaleVALPProperty Value1 variableValues were rounded and top-coded by ACS for anonymity. The "-1" variable is defined as N/A (GQ/ Vacant lots except 'for sale only' and 'sold, not occupied' / not owned or being bought.) This variable reflects the median value of property owned by individuals of each race, ethnicity, gender, and educational attainment category.PUMAPublic Use Microdata Area18 PUMAsPUMAs in New Mexico can be viewed here:https://nmcdc.maps.arcgis.com/apps/mapviewer/index.html?webmap=d9fed35f558948ea9051efe9aa529eafData includes 19 total regions: 18 Pumas and NM State TotalsNOTES AND RESOURCESThe following resources and documentation were used to navigate the ACS PUMS system and to answer questions about variables:Census Microdata API User Guide:https://www.census.gov/data/developers/guidance/microdata-api-user-guide.Additional_Concepts.html#list-tab-1433961450Accessing PUMS Data:https://www.census.gov/programs-surveys/acs/microdata/access.htmlHow to use PUMS on data.census.govhttps://www.census.gov/programs-surveys/acs/microdata/mdat.html2019 PUMS Documentation:https://www.census.gov/programs-surveys/acs/microdata/documentation.2019.html#list-tab-13709392012014 to 2018 ACS PUMS Data Dictionary:https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2014-2018.pdf2019 PUMS Tiger/Line Shapefileshttps://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2019&layergroup=Public+Use+Microdata+Areas Note 1: NMCDC attemepted to contact analysts with the ACS system to clarify questions about variables, but did not receive a timely response. Documentation was then consulted.Note 2: All relevant documentation was reviewed and seems to imply that all survey questions were answered by adults, age 18 or over. Youth who have inherited property could potentially be reflected in this data.Dataset and feature service created in May 2023 by Renee Haley, Data Specialist, NMCDC.
H
Replication Data for: Advancing Privacy Research: A Novel Realistic1...
dataverse.harvard.edu
search.dataone.org
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman ElSaid (2023). Replication Data for: Advancing Privacy Research: A Novel Realistic1 Persona-Based Datase [Dataset]. http://doi.org/10.7910/DVN/GOHBTR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GOHBTR
Dataset updated
Jul 19, 2023
Dataset provided by
Harvard Dataverse
Authors
AbdElRahman ElSaid
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We introduce a unique approach to privacy research by creating a virtual persona that mimics human web-searching behaviors. The persona's activities, categorized into 'morning', 'afternoon', and 'evening', were automated using the Selenium WebDriver, enabling the persona to conduct searches as a real user would. The resulting dataset comprises 1,537 records, each representing a unique search query. Each record contains the first two pages of a query result, including the query keywords and a list of the first 2 pages of the query result. The study offers a fresh perspective on the study of privacy and personalization in online environments. The potential for reusing this dataset is significant, as it can be applied to studies on privacy, data collection, and search engine personalization, and it can be used to develop and test algorithms and models that aim to protect user privacy.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b

MetaMath QA

Mathematical Questions for Large Language Models

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(78629842 bytes)Available download formats

Dataset updated

Nov 23, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.

Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.

Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Clear search

Close search

Google apps

Main menu

MetaMath QA

MetaMath QA

Mathematical Questions for Large Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Data Dictionary

Preparing data for analysis

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

LC-QuAD 2.0 (Question & Answering)

LC-QuAD 2.0 (Question & Answering)

30,000 pairs of question and its corresponding SPARQL query

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

aol-data.tar.bz2

astro-llms-full-query-data

Data from: SARA - A Collection of Sensitivity-Aware Relevance Assessments

WikiSQL (Questions and SQL Queries)

WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

TREC 2022 Deep Learning test collection

Cyber-Physical System power Consumption

Files

Data

Annotations

Jupyter Notebook

Note

Johns Hopkins COVID-19 Case Tracker

Updates

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

Queries

Interactive

Interactive Embed Code

Caveats

Attribution

AOL-500k-User-Session-Collection

answersumm

Air Markets Program Data (AMPD)

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Service Stations in Münster | gimi9.com

Scientific Data Provenance in R: RDataTracker and DDG Explorer

Repository Analytics and Metrics Portal (RAMP) 2018 data

FLOWRepository

Data from: SQL Injection Attack Netflow

Decoding Home Values: The Power of Education vs. Race, Ethnicity, and Gender...

Replication Data for: Advancing Privacy Research: A Novel Realistic1...

MetaMath QA

Mathematical Questions for Large Language Models

MetaMath QA

Mathematical Questions for Large Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Data Dictionary

Preparing data for analysis

Research Ideas

Acknowledgements

License

Columns

Acknowledgements