42 datasets found
  1. MetaMath QA

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b
    Explore at:
    zip(78629842 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MetaMath QA

    Mathematical Questions for Large Language Models

    By Huggingface Hub [source]

    About this dataset

    This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Data Dictionary

    The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

    Preparing data for analysis

    It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

    ##### Training Models using Mistral 7B

    Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

    ##### Testing phosphors :

    After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

    Research Ideas

    • Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
    • Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
    • Optimizing search algorithms that surface relevant answer results based on types of queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  2. LC-QuAD 2.0 (Question & Answering)

    • kaggle.com
    zip
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). LC-QuAD 2.0 (Question & Answering) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-smarter-querying-with-lc-quad-2-0
    Explore at:
    zip(3004134 bytes)Available download formats
    Dataset updated
    Dec 2, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LC-QuAD 2.0 (Question & Answering)

    30,000 pairs of question and its corresponding SPARQL query

    By Huggingface Hub [source]

    About this dataset

    LC-QuAD 2.0 is a breakthrough dataset designed to advance the state of intelligent querying towards unprecedented heights. By providing a collection of 30,000 different pairs of questions and their respective SPARQL queries each, it presents an enormous opportunity for every person looking to unlock the power of knowledge with smart querying techniques.

    These questions have been carefully devised such that they relate to the latest version of Wikidata and DBpedia, granting tech-savvy individuals an access key to an information repository far beyond what was once thought imaginable. The dataset found under this union is nothing short of amazing - consisting not just of Natural Language Questions but also their solutions in the form of a SPARQL query. With LC-QuAD 2.0, you have at your fingertips more than thirty thousand answers ready for any query you can think up! Unlocking knowledge has never been easier!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Using the LC-QuAD 2.0 dataset can be a great way to power up your intelligent systems with smarter querying. Whether you want to build a question-answering system or create new knowledge graphs and search systems, utilizing this dataset can certainly be helpful. Here is a guide on how to use this dataset:

    • Understand the structure of the data: The LC-QuAD 2.0 consists of 30,000 different pairs of questions and their corresponding SPARQL queries in two files – train (used for training an intelligent system) and test (used for testing an intelligent system). The columns present in each pair are NNQT_question (Natural Language Question), subgraph (Subgraph information for the question), sparql_dbpedia18 (SPARQL query for DBpedia 18), template (Templates from which SPARQL query was generated).

    • Read up on SPARQL: Before you start using this dataset, it is important that you read up more on what SPARQL means and how it works as SPAQL will be used frequently when browsing through this data set. This will make the understanding of the content easier and quicker!

    • Start exploring!: After doing some research about SPARQL, now it’s time to explore! You can start by looking at each pair in detail - read through its natural language question, subgraph information and try understanding its relation with its corresponding sparql queries from both DBpedia 18 or try running these sparql queries yourself against Wikidata or DBPedia platform to see where they lead you eventually! In case any query has multiple results having different variances with respect to answers range , then look inside entity definitions contained within words \ phrases / synonyms reflected by natural language parsing services API's like AIKATsetu etc., before writing authoritative answer modules/endpoints forming partinmonly sustainable pipeline architecture using such prepared & refined datasets like LC-QUAD !

    • Use your own data: Once you have familiarized yourself sufficiently with the available pairs & understand their relevance , consider creating your own data set by adding more complex questions along associated unique attributes which shall give great insights . If not done already evaluate if population enrichment techniques should be applied suiting specific domain's needs your bot purports - either just features selection criterion wise or entire classifier selection algorithm wise as otherwise global extracted vectors may decide either selectively for reducing overfitting/generalization penalty in

    Research Ideas

    • Incorporating the LC-QUAD 2.0 dataset into Intelligent systems such as Chatbots, Question Answering Systems, and Document Summarization programs to allow them to retrieve the required information by transforming natural language questions into SPARQL queries.
    • Utilizing this dataset in Semantic Scholar Search Engines and Academic Digital Libraries which can use natural language queries instead of keywords in order to perform more sophisticated searches and provide more accurate results for researchers in diverse areas.
    • Applying this dataset for building Knowledge Graphs that can store entities along with their attributes, categories and relations thereby allowing better understanding of complex relationships between entities or data and further advancing development of AI agents that are able to answer specific questions or provide personalized recommendations in various contexts or tasks

    Acknowledgements

    &g...

  3. aol-data.tar.bz2

    • figshare.com
    bz2
    Updated Oct 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graham Cormode (2017). aol-data.tar.bz2 [Dataset]. http://doi.org/10.6084/m9.figshare.5527231.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Oct 23, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Graham Cormode
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AOL search data anonymized and released by AOL Research in 2006."500k User Session Collection----------------------------------------------This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.Brief description:This collection consists of ~20M web queries collected from ~650k users over three months.The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research. The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.Each line in the data represents one of two types of events: 1. A query that was NOT followed by the user clicking on a result item. 2. A click through on an item in the result list returned from a query.In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above). In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.Basic Collection StatisticsDates: 01 March, 2006 - 31 May, 2006Normalized queries: 36,389,567 lines of data 21,011,340 instances of new queries (w/ or w/o click-through) 7,887,022 requests for "next page" of results 19,442,629 user click-through events 16,946,938 queries w/o user click-through 10,154,742 unique (normalized) queries 657,426 unique user ID'sPlease reference the following publication when using this collection:G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.Copyright (2006) AOL"

  4. h

    astro-llms-full-query-data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for Language and Speech Processing @ JHU, astro-llms-full-query-data [Dataset]. https://huggingface.co/datasets/jhu-clsp/astro-llms-full-query-data
    Explore at:
    Dataset authored and provided by
    Center for Language and Speech Processing @ JHU
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AstroLLMs Full Query Dataset

    This dataset includes all of the data collected in a four-week deployment of a Large Language Model-powered Slack chatbot trained on astrophysics papers. Astronomers were invited to interact with the chatbot, ask questions, and leave feedback. This data includes 368 question-answer pairs, including feedback, reactions, and labeling.

      Dataset Structure
    

    The columns of this dataset are thread_ts (unique time stamp of the query), channel_id… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/astro-llms-full-query-data.

  5. Z

    Data from: SARA - A Collection of Sensitivity-Aware Relevance Assessments

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McKechnie, Jack; McDonald, Graham (2023). SARA - A Collection of Sensitivity-Aware Relevance Assessments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8006819
    Explore at:
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    University of Glasgow
    Authors
    McKechnie, Jack; McDonald, Graham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SARA - A Collection of Sensitivity-Aware Relevance Assessments

    Presented here is a collection of Sensitivity-Aware Relevance Assessments for the UC Berkely labelled subset of the Enron Email Collection. The Hearst [1] labelled version of the Enron Email Collection is a subset of the CMU collection that contains 1702 emails that were annotated as part of a class project at UC Berkley. Students in the Natural Language Processing course were tasked with annotating the emails as relevant or not relevant to 53 different categories. Therefore, the labelled version of the Enron email collection provides a rich taxonomy of labels which can be used for multiple definitions of sensitivity such as the Purely Personal and Personal but in a Professional Context. The categories that the emails are labelled for can be seen in Table 1. The files for the labelled version of the Enron Email Collection are available from the UC Berkely website.

    We deploy a topic modelling approach to identify topical themes in the labelled Enron collection that serve as a basis for our information needs which are in turn used to gather queries and relevance assessments, the notebook for which is available here. Two separate crowdsourcing tasks are carried out in the development of SARA. Firstly, query formulations are crowdsourced to represent the information needs and, secondly, relevance assessments are crowdsourced for a pooled set of documents from the labelled Enron collection for each of the information needs.

    The SARA Collection of Sensitivity-Aware Relevance Assessments is available through the popular ir_datasets library. More information can be found on the ir_datasets GitHub and website.

    Information Needs

    To create our set of sensitivity-aware relevance assessments for the labelled Enron email collection, we first identify a set of topical subjects that reflect the contents of the emails in the collection. We use a topic modelling approach to identify the information needs. When identifying topics to be used as information needs, we are interested in identifying general themes that relate to the topics of discussion that might likely be covered in the contents (i.e., the body) of the emails in the collection. The topics are chosen to be broad enough to be able to reasonably expect that there would be relevant documents in the collection, and not so specific that it would require specialist knowledge to make a judgement of relevance on the subject. Subsequently, we manually construct short passages of text to serve as descriptions of the information needs that are to be searched for in the collection by the crowdworkers. The information needs that the crowdworkers are available in the information_needs.tsv file.

    Queries

    In order to collect relevance assessments for pairs of emails and information needs, different query formulations are first needed to generate pools of documents. Query formulations for each topic are collected from crowdworkers from the Prolific crowdwork platform. Ten information needs are shown to each crowdworker and they are asked to provide a query formulation that they would use to get relevant documents to satisfy the information need they are presented with. Three queries for each of the fifty information needs are released. The resulting queries are available in the repeated_queries.tsv file.

    Relevance Assesments

    Crowdworkers are shown an information need and an email and asked to rate the document as being either Highly Relevant, Partially Relevant, or Not Relevant to the information need. Each information need/email pair is judged by three crowdworkers and a majority vote is used to generate a ground truth label. Since each information need / email pair is judged by three crowdworkers and there are three possible labels, it is possible for each of the labels to be selected by one crowdworker. In practice, this only happened for 134 pairs. In such cases, ties are broken by having one of the authors read the document and make an additional judgement. In order to ensure that sensitive documents definitely have relevance labels they were also judged by one of the authors for each of the information needs. The relevance assessments are available in the repeated_qrels.txt file. The relevance assessments are in the format 'query iteration document relevancy'. The iteration column is used for IR_Datasets and can be safely ignored and the document name is the filename used in the labelled Enron collection.

    Table 1

              1) Coarse genre
              2) Included/forwarded information
              3) Primary topics (If coarse genre 1.1 is selected)
              4) Emotional tone (If not neutral)
    
    
    
    
              1.1 Company Business, Strategy, etc. (See 3)
              2.1 Includes new text in addition to forwarded material
              3.1 Regulations and regulators (includes price caps)
              4.1 Jubilation
    
    
              1.2 Purely Personal
              2.2 Forwarded email(s) including replies
              3.2 Internal projects -- progress and strategy
              4.2 Hope / anticipation
    
    
              1.3 Personal but in professional context (e.g., it was good working with you)
              2.3 Business letter(s) / document(s)
              3.3 Company image -- current
              4.3 Humor
    
    
              1.4 Logistic Arrangements (meeting scheduling, technical support, etc.)
              2.4 News article(s)
              3.4 Company image -- changing / influencing
              4.4 Camaraderie
    
    
              1.5 Employment arrangements (job seeking, hiring, recommendations, etc.)
              2.5 Government / academic report(s)
              3.5 Political influence / contributions / contacts
              4.5 Admiration
    
    
              1.6 Document editing/checking (collaboration)
              2.6 Government action(s) (such as results of a hearing, etc.)
              3.6 California energy crisis / California politics
              4.6 Gratitude
    
    
              1.7 Empty message (due to missing attachment)
              2.7 Press release(s)
              3.7 Internal company policy
              4.7 Friendship / affection
    
    
              1.8 Empty message
              2.8 Legal documents (complaints, lawsuits, advice)
              3.8 Internal company operations
              4.8 Sympathy / support
    
    
    
              2.9 Pointers to url(s)
              3.9 Alliances / partnerships
              4.9 Sarcasm
    
    
    
              2.10 Newsletters
              3.10 Legal advice
              4.10 Secrecy / confidentiality
    
    
    
              2.11 Jokes, humor (related to business)
              3.11 Talking points
              4.11 Worry / anxiety
    
    
    
              2.12 Jokes, humor (unrelated to business)
              3.12 Meeting minutes
              4.12 Concern
    
    
    
              2.13 Attachment(s) (assumed missing)
              3.13 Trip reports
              4.13 Competitiveness / aggressiveness
    
    
    
    
    
              4.14 Triumph / gloating
    
    
    
    
    
              4.15 Pride
    
    
    
    
    
              4.16 Anger / agitation
    
    
    
    
    
              4.17 Sadness / despair
    
    
    
    
    
              4.18 Shame
    
    
    
    
    
              4.19 Dislike / scorn
    

    The Sensitivity-Aware Relevance Assessments dataset is held under an Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence which allows for it to be adapted, transformed and built upon.

    Questions and comments are welcomed via email.

    References

    [1] Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. In Proc. of Workshop on Effective Tools and Methodologies for Teaching NLP and CL.

  6. WikiSQL (Questions and SQL Queries)

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
    Explore at:
    zip(21491264 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WikiSQL (Questions and SQL Queries)

    80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

    By Huggingface Hub [source]

    About this dataset

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

    Research Ideas

    • This dataset can be used to develop natural language interfaces for relational databases.
    • This dataset can be used to develop a knowledge base of common SQL queries.
    • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  7. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  8. Cyber-Physical System power Consumption

    • zenodo.org
    bin, csv, zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis (2024). Cyber-Physical System power Consumption [Dataset]. http://doi.org/10.5281/zenodo.14215756
    Explore at:
    bin, csv, zipAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Files

    This dataset is comprised of 5 CSV files contained in the data.zip archive. Each one represents a production machine from which various sensor data has been collected. The average cadence for collection was 5 measurements per second. The monitored devices where used for hydroforming.

    The collection period covered the period from 2023-06-01 until 2023-08-05.

    Data

    These files represent a complete data dump from the data available in the time-series database, InfluxDB, used for collection. Because of this some columns have no semantic value for detecting production cycles or any other analytics.

    Each file contains a total of 14 columns. Some of the columns are artefacts of the query used to extract the data from InfluxDB and can be discarded. These columns are: results, table _start, _stop

    • results - An artefact of the InfluxDB query, signifies postprocessing of results in this dataset. It is "mean".
    • table - An artefact of the InfluxDB query, can be discarded.
    • _start and _stop - Refers to ingestion related data, used in monitoring ingestion.
    • _field - An artefact of the InfluxDB query, specifying what field to use for the query.
    • _measurement - An artefact of the InfluxDB query, specifying what measurement to use for the query. Contains the same information as device_id.
    • host - An artefact of the InfluxDB query, the unique name of the host used for the InfluxDB sink in Kubernetes.
    • kafka_topic - Name of the Kafka topic used for collection.

    Pertinent columns are:

    • _time - Denotes the time at which a particular event has been measured, it is used as index when creating a dataframe.
    • _time.1 - Duplicate of _time for sanity check and ease of analysis when _time is set as index
    • _value - Represents the value measured by each sensor type.
    • device_id - Unique identifier of the manufacturing device, should be the same as the file name, i.e. B827EB8D8E0C.
    • ingestion_time - Timestamp when the data has been collected and ingested by influxDB.
    • sid - Unique sensor ID; the power measurements can be found at sid 1.

    Annotations

    There are two additional files which contain annotation data:

    • scamp_devices.csv - Contains mapping information between the dataset device ID (defined in column "DeviceIDMonitoring") and the ground truth file ID (defined in column "DeviceID")
    • scamp_report_3m.csv - Contains the ground truth, which can be used for validation of cycle detection and analysis methods. The columns are as follows:
      • ReportID - Internal unique ID created during data collection. It can be discarded.
      • JobID - Internal Scheduling Job unique ID.
      • DeviceID - The unique ID of the devices used for manufacturing needs to be mapped using the scamp_device.csv data.
      • StartTime - Start time of operations
      • EndTime - End time of operations
      • ProductID - Unique identifier of the product being manufactured.
      • CycleTime - Average length of cycle in seconds, added manually by operators. It can be unreliable.
      • QuantityProduced - Number of products manufactured during the timeframe given by StartTime and EndTime.
      • QuantityScrap - Number of scraped/malformed products in the given timeframe. These are part of the QuantityProduced, not in addition to it.
      • IntreruptionMinuted - Minutes of production halt.
    • scamp_patterns.csv - Contains the start and end timestamp for selected example production cycles. These where chosen based on expert users.

    Jupyter Notebook

    We have provided a sample Jupyter notebook (verify_data.ipynb), which gives examples of how the dataset can be loaded and visualised as well as examples of how the sample patterns and ground truth can be addressed and visualised.

    Note

    The Jupyter Notebook contains an example of how the data can be loaded and visualised. Please note that both data should be filtered based on sid; the power measurements are collected by sid 1. See Notebook for example.

  9. d

    Johns Hopkins COVID-19 Case Tracker

    • data.world
    • kaggle.com
    csv, zip
    Updated Dec 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 3, 2025
    Authors
    The Associated Press
    Time period covered
    Jan 22, 2020 - Mar 9, 2023
    Area covered
    Description

    Updates

    • Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

    • April 9, 2020

      • The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.
    • April 20, 2020

      • Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.
    • April 29, 2020

      • The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.
    • September 1st, 2020

      • Johns Hopkins is now providing counts for the five New York City counties individually.
    • February 12, 2021

      • The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."
      • Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.
    • February 16, 2021

      - Johns Hopkins has reconciled Ohio's historical deaths data with the state.

      Overview

    The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

    The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

    This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

    The AP is updating this dataset hourly at 45 minutes past the hour.

    To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

    Queries

    Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

    Interactive

    The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

    @(https://datawrapper.dwcdn.net/nRyaf/15/)

    Interactive Embed Code

    <iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
    

    Caveats

    • This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.
    • In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.
    • In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"
    • This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.
    • Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
    • Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.
    • The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

    Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

    Attribution

    This data should be credited to Johns Hopkins University COVID-19 tracking project

  10. h

    AOL-500k-User-Session-Collection

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Isom, AOL-500k-User-Session-Collection [Dataset]. https://huggingface.co/datasets/max-chroma/AOL-500k-User-Session-Collection
    Explore at:
    Authors
    Max Isom
    Description

    500k User Session Collection

    This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED. Brief description: This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization… See the full description on the dataset page: https://huggingface.co/datasets/max-chroma/AOL-500k-User-Session-Collection.

  11. h

    answersumm

    • huggingface.co
    Updated Sep 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Fabbri (2022). answersumm [Dataset]. https://huggingface.co/datasets/alexfabbri/answersumm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2022
    Authors
    Alexander Fabbri
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for answersumm

      Dataset Summary
    

    The AnswerSumm dataset is an English-language dataset of questions and answers collected from a StackExchange data dump. The dataset was created to support the task of query-focused answer summarization with an emphasis on multi-perspective answers. The dataset consists of over 4200 such question-answer threads annotated by professional linguists and includes over 8700 summaries. We decompose the task into several annotation… See the full description on the dataset page: https://huggingface.co/datasets/alexfabbri/answersumm.

  12. Air Markets Program Data (AMPD)

    • data.wu.ac.at
    • data.amerigeoss.org
    csv
    Updated Jan 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2014). Air Markets Program Data (AMPD) [Dataset]. https://data.wu.ac.at/schema/data_gov/OWNkYjNiZjUtZmIxZC00MmM3LWJmMjctMDViOTA3NjA2OWIx
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 1, 2014
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    e8886aabdfc0d065a2972b24a4c4c39d6972254b
    Description

    The Air Markets Program Data tool allows users to search EPA data to answer scientific, general, policy, and regulatory questions about industry emissions. Air Markets Program Data (AMPD) is a web-based application that allows users easy access to both current and historical data collected as part of EPA's emissions trading programs. This site allows you to create and view reports and to download emissions data for further analysis. AMPD provides a query tool so users can create custom queries of industry source emissions data, allowance data, compliance data, and facility attributes. In addition, AMPD provides interactive maps, charts, reports, and pre-packaged datasets. AMPD does not require any additional software, plug-ins, or security controls and can be accessed using a standard web browser.

  13. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  14. g

    Service Stations in Münster | gimi9.com

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Service Stations in Münster | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_26fd36e8-5e24-45a3-b579-6c4a08e7dfda/
    Explore at:
    Description

    This dataset contains a list of petrol station locations in the Münster area. The data is queried directly from the OpenStreetMaps database. Please note the following information about the data: The query does not exactly cover the city area, but a rectangular frame around the city centre. The database of OpenStreetMaps may not be complete. These data are collected by volunteers and usually have a very good quality, but this is not official information from the city administration. The license terms of OpenStreetMap Germany apply: http://www.openstreetmap.de/faq.html#lizenz OpenStreetMap is a project founded in 2004 with the aim of creating a free world map. Volunteers from many countries work on the further development of the software as well as the collection and processing of geodata. Data is collected about roads, railways, rivers, forests, houses and everything else that is commonly seen on maps. This record links directly to the current data query at overpass-api.de. If you find incorrect or missing information in this record, you can log in to OpenStreetMaps and help improve the database yourself. You can complete or correct the data, similar to Wikipedia. How to do this, find out under: https://www.openstreetmap.de/faq.html#wie_mitmachen Furthermore, you can customise this data query yourself and extend the query result to the entire Münsterland or beyond, because the OpenStreetMap database contains Germany-wide data. You can find an English guide for the query language to formulate your data query at the following address: https://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide

  15. Scientific Data Provenance in R: RDataTracker and DDG Explorer

    • search.dataone.org
    • portal.edirepository.org
    Updated Aug 25, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barbara Lerner; Emery Boose; Aaron Ellison; Leon Osterweil (2014). Scientific Data Provenance in R: RDataTracker and DDG Explorer [Dataset]. https://search.dataone.org/view/knb-lter-hfr.91.17
    Explore at:
    Dataset updated
    Aug 25, 2014
    Dataset provided by
    Long Term Ecological Research Networkhttp://www.lternet.edu/
    Authors
    Barbara Lerner; Emery Boose; Aaron Ellison; Leon Osterweil
    Description

    Scientific data provenance is the information required to document the history of an item of data, including how it was created and how it was transformed. Data provenance has great potential to improve the transparency, reliability, and reproducibility of scientific results. However it has been little used to date by domain scientists because most systems that collect provenance require scientists to learn specialized software tools and jargon. This project is developing tools that allow scientists to collect, visualize, and query provenance directly from the R statistical language. The first tool (RDataTracker) is a library of R functions that can be downloaded and installed as an R package. RDataTracker allows the scientist to collect data provenance during an R console session or while executing an R script. The resulting provenance is stored on the scientist's computer as a DDG (data derivation graph) file. The second tool (DDG Explorer) is a stand-alone Java program that can be downloaded and run to visualize, store, and query DDGs. The third tool is an R script (DDGCheckpoint.R) may be used with RDataTRacker to create and restore checkpoints that store the R environment and user files. Documentation for all tools is included with the RDataTracker package or may be downloaded separately.

  16. n

    Repository Analytics and Metrics Portal (RAMP) 2018 data

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2018 data [Dataset]. http://doi.org/10.5061/dryad.ffbg79cvp
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Montana State University
    University of New Mexico
    Authors
    Jonathan Wheeler; Kenning Arlitsch
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.

    Methods

    RAMP Data Documentation – January 1, 2017 through August 18, 2018

    Data Collection

    RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

    Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

    url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    country: The country from which the corresponding search originated.
    device: The device used for the search.
    date: The date of the search.
    

    Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

    Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

    More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

    Data Processing

    Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

    Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

    About Citable Content Downloads

    Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

    CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

    For any specified date range, the steps to calculate CCD are:

    Filter data to only include rows where "citableContent" is set to "Yes."
    Sum the value of the "clicks" field on these rows.
    

    Output to CSV

    Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

    The data in these CSV files include the following fields:

    url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    country: The country from which the corresponding search originated.
    device: The device used for the search.
    date: The date of the search.
    citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
    index: The Elasticsearch index corresponding to page click data for a single IR.
    repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
    

    Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.

    Data Collection from August 19, 2018 Onward

    RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

    Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

    url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    

    Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

    The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

    country: The country from which the corresponding search originated.
    device: The device used for the search.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    

    Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

    More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

    Data Processing

    Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

    The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

    Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

    About Citable Content Downloads

    Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository

  17. n

    FLOWRepository

    • neuinfo.org
    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). FLOWRepository [Dataset]. http://identifiers.org/RRID:SCR_013779/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A database of flow cytometry experiments where users can query and download data collected and annotated according to the MIFlowCyt data standard.

  18. Z

    Data from: SQL Injection Attack Netflow

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +3more
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
    Explore at:
    Dataset updated
    Sep 28, 2022
    Authors
    Ignacio Crespo; Adrián Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

    NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

    Datasets

    The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

    The datasets contain both benign and malicious traffic. All collected datasets are balanced.

    The version of NetFlow used to build the datasets is 5.

        Dataset
        Aim
        Samples
        Benign-malicious
        traffic ratio
    
    
    
    
        D1
        Training
        400,003
        50%
    
    
        D2
        Test
        57,239
        50%
    

    Infrastructure and implementation

    Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

    DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

    Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

    The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

    The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

        Parameters
        Description
    
    
    
    
        '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
        Enumerate users, password hashes, privileges, roles, databases, tables and columns
    
    
        --level=5
        Increase the probability of a false positive identification
    
    
        --risk=3
        Increase the probability of extracting data
    
    
        --random-agent
        Select the User-Agent randomly
    
    
        --batch
        Never ask for user input, use the default behavior
    
    
        --answers="follow=Y"
        Predefined answers to yes
    

    Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

    The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

    However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

    To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

  19. a

    Decoding Home Values: The Power of Education vs. Race, Ethnicity, and Gender...

    • chi-phi-nmcdc.opendata.arcgis.com
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New Mexico Community Data Collaborative (2023). Decoding Home Values: The Power of Education vs. Race, Ethnicity, and Gender [Dataset]. https://chi-phi-nmcdc.opendata.arcgis.com/datasets/decoding-home-values-the-power-of-education-vs-race-ethnicity-and-gender
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset authored and provided by
    New Mexico Community Data Collaborative
    Description

    A detailed explanation of how this dataset was put together, including data sources and methodologies, follows below.Please see the "Terms of Use" section below for the Data DictionaryDATA ACQUISITION AND CLEANING PROCESSThis dataset was built from 5 separate datasets queried during the months of April and May 2023 from the Census Microdata System (link below):https://data.census.gov/mdat/#/All datasets include information on Property Value (VALP) by: Educational Attainment (SCHL), Gender (SEX), a specified race or ethnicity (RAC or HISP), and are grouped by Public Use Microdata Areas (PUMAS). PUMAS are geographic areas created by the Census bureau; they are weighted by land area and population to facilitate data analysis. Data also Included totals for the state of New Mexico, so 19 total geographies are represented. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Cleaning each dataset started with recoding the SCHL and HISP variables - details on recoding can be found below.After recoding, each dataset was transposed so that PUMAS were rows and SCHL, VALP, SEX, and Race or Ethnicity variables were the columns.Median values were calculated in every case that recoding was necessary. As a result, all Property Values in this dataset reflect median values.At times the ACS data downloaded with zeros instead of the 'null' values in initial query results. The VALP variable also included a "-1" variable to reflect N/A values (details in variable notes). Both zeros and "-1" values were removed before calculating median values, both to keep the data true to the original query and to generate accurate median values.Recoding the SCHL variable resulted in 5 rows for each PUMA, reflecting the different levels of educational attainment in each region. Columns grouped variables by race or ethnicity and gender. Cell values were property values.All 5 datasets were joined after recoding and cleaning the data. Original datasets all include 95 rows with 5 separate Educational Attainment variables for each PUMA, including New Mexico State totals.Because 1 row was needed for each PUMA in order to map this data, the data was split by Educational Attainment (SCHL), resulting in 110 columns reflecting median property values for each race or ethnicity by gender and level of educational attainment.A short, unique 2 to 5 letter alias was created for each PUMA area in anticipation of needing a unique identifier to join the data with. GIS AND MAPPING PROCESSA PUMA shapefile was downloaded from the ACS site. The Shapefile can be downloaded here: https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServerThe DBF from the PUMA shapefile was exported to Excel; this shapefile data included needed geographic information for mapping such as: GEOID, PUMACE. The UIDs created for each PUMA were added to the shapefile data; the PUMA shapfile data and ACS data were then joined on UID in JMP.The data table was joined to the shapefile in ARC GiIS, based on PUMA region (specifically GEOID text).The resulting shapefile was exported as a GDB (geodatabase) in order to keep 'Null' values in the data. GDBs are capable of including a rule allowing null values where shapefiles are not. This GDB was uploaded to NMCDCs Arc Gis platform. SYSTEMS USEDMS Excel was used for data cleaning, recoding, and deriving values. Recoding was done directly in the Microdata system when possible - but because the system is was in beta at the time of use some features were not functional at times.JMP was used to transpose, join, and split data. ARC GIS Desktop was used to create the shapefile uploaded to NMCDC's online platform. VARIABLE AND RECODING NOTESTIMEFRAME: Data was queried for the 5 year period of 2015 to 2019 because ACS changed its definiton for and methods of collecting data on race and ethinicity in 2020. The change resulted in greater aggregation and les granular data on variables from 2020 onward.Note: All Race Data reflects that respondants identified as the specified race alone or in combination with one or more other races.VARIABLE:ACS VARIABLE DEFINITIONACS VARIABLE NOTESDETAILS OR URL FOR RAW DATA DOWNLOADRACBLKBlack or African American ACS Query: RACBLK, SCHL, SEX, VALP 2019 5yrRACAIANAmerican Indian and Alaska Native ACS Query: RACAIAN, SCHL, SEX, VALP 2019 5yrRACASNAsian ACS Query: RACASN, SCHL, SEX, VALP 2019 5yrRACWHTWhite ACS Query: RACWHT, SCHL, SEX, VALP 2019 5yrHISPHispanic Origin ACS Query: HISP ORG, SCHL, SEX, VALP 2019 5yrHISP RECODE: 24 original separate variablesThe Hispanic Origin (HISP) variable originally included 24 subcategories reflecting Mexican, Central American, South American, and Caribbean Latino, and Spanish identities from each Latin American counry. 7 recoded VariablesThese 24 variables were recoded (grouped) into 7 simpler categories for data analysis: Not Spanish/Hispanic/Latino, Mexican, Caribbean Latino, Central American, South American, Spaniard, All other Spanish/Hispanic/Latino Female. Not Spanish/Hispanic/Latino was not really used in the final dataset as the race datasets provided that information.SCHLEducational Attainment25 original separate variablesThe Educational Attainment (SCHL) variable originally included 25 subcategories reflecting the education levels of adults (over 18) surveyed by the ACS. These include: Kindergarten, Grades 1 through 12 separately, 12th grade with no diploma, Highschool Diploma, GED or credential, less than 1 year of college, more than 1 year of college with no degree, Associate's Degree, Bachelor's Degree, Master's Degree, Professional Degree, and Doctorate Degree.SCHL RECODE: 5 recoded variablesThese 25 variables were recoded (grouped) into 5 simpler categories for data analysis: No High School Diploma, High School Diploma or GED, Some College, Bachelor's Degree, and Advanced or Professional DegreeSEXGender2 variables1 - Male, 2 - FemaleVALPProperty Value1 variableValues were rounded and top-coded by ACS for anonymity. The "-1" variable is defined as N/A (GQ/ Vacant lots except 'for sale only' and 'sold, not occupied' / not owned or being bought.) This variable reflects the median value of property owned by individuals of each race, ethnicity, gender, and educational attainment category.PUMAPublic Use Microdata Area18 PUMAsPUMAs in New Mexico can be viewed here:https://nmcdc.maps.arcgis.com/apps/mapviewer/index.html?webmap=d9fed35f558948ea9051efe9aa529eafData includes 19 total regions: 18 Pumas and NM State TotalsNOTES AND RESOURCESThe following resources and documentation were used to navigate the ACS PUMS system and to answer questions about variables:Census Microdata API User Guide:https://www.census.gov/data/developers/guidance/microdata-api-user-guide.Additional_Concepts.html#list-tab-1433961450Accessing PUMS Data:https://www.census.gov/programs-surveys/acs/microdata/access.htmlHow to use PUMS on data.census.govhttps://www.census.gov/programs-surveys/acs/microdata/mdat.html2019 PUMS Documentation:https://www.census.gov/programs-surveys/acs/microdata/documentation.2019.html#list-tab-13709392012014 to 2018 ACS PUMS Data Dictionary:https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2014-2018.pdf2019 PUMS Tiger/Line Shapefileshttps://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2019&layergroup=Public+Use+Microdata+Areas Note 1: NMCDC attemepted to contact analysts with the ACS system to clarify questions about variables, but did not receive a timely response. Documentation was then consulted.Note 2: All relevant documentation was reviewed and seems to imply that all survey questions were answered by adults, age 18 or over. Youth who have inherited property could potentially be reflected in this data.Dataset and feature service created in May 2023 by Renee Haley, Data Specialist, NMCDC.

  20. H

    Replication Data for: Advancing Privacy Research: A Novel Realistic1...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AbdElRahman ElSaid (2023). Replication Data for: Advancing Privacy Research: A Novel Realistic1 Persona-Based Datase [Dataset]. http://doi.org/10.7910/DVN/GOHBTR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    AbdElRahman ElSaid
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We introduce a unique approach to privacy research by creating a virtual persona that mimics human web-searching behaviors. The persona's activities, categorized into 'morning', 'afternoon', and 'evening', were automated using the Selenium WebDriver, enabling the persona to conduct searches as a real user would. The resulting dataset comprises 1,537 records, each representing a unique search query. Each record contains the first two pages of a query result, including the query keywords and a list of the first 2 pages of the query result. The study offers a fresh perspective on the study of privacy and personalization in online environments. The potential for reusing this dataset is significant, as it can be applied to studies on privacy, data collection, and search engine personalization, and it can be used to develop and test algorithms and models that aim to protect user privacy.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b
Organization logo

MetaMath QA

Mathematical Questions for Large Language Models

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(78629842 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

  • Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
  • Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
  • Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Search
Clear search
Close search
Google apps
Main menu