Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Data Dictionary
The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)
Preparing data for analysis
It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.
##### Training Models using Mistral 7B
Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .
##### Testing phosphors :
After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low
- Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
- Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
- Optimizing search algorithms that surface relevant answer results based on types of queries
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
LC-QuAD 2.0 is a breakthrough dataset designed to advance the state of intelligent querying towards unprecedented heights. By providing a collection of 30,000 different pairs of questions and their respective SPARQL queries each, it presents an enormous opportunity for every person looking to unlock the power of knowledge with smart querying techniques.
These questions have been carefully devised such that they relate to the latest version of Wikidata and DBpedia, granting tech-savvy individuals an access key to an information repository far beyond what was once thought imaginable. The dataset found under this union is nothing short of amazing - consisting not just of Natural Language Questions but also their solutions in the form of a SPARQL query. With LC-QuAD 2.0, you have at your fingertips more than thirty thousand answers ready for any query you can think up! Unlocking knowledge has never been easier!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Using the LC-QuAD 2.0 dataset can be a great way to power up your intelligent systems with smarter querying. Whether you want to build a question-answering system or create new knowledge graphs and search systems, utilizing this dataset can certainly be helpful. Here is a guide on how to use this dataset:
Understand the structure of the data: The LC-QuAD 2.0 consists of 30,000 different pairs of questions and their corresponding SPARQL queries in two files – train (used for training an intelligent system) and test (used for testing an intelligent system). The columns present in each pair are NNQT_question (Natural Language Question), subgraph (Subgraph information for the question), sparql_dbpedia18 (SPARQL query for DBpedia 18), template (Templates from which SPARQL query was generated).
Read up on SPARQL: Before you start using this dataset, it is important that you read up more on what SPARQL means and how it works as SPAQL will be used frequently when browsing through this data set. This will make the understanding of the content easier and quicker!
Start exploring!: After doing some research about SPARQL, now it’s time to explore! You can start by looking at each pair in detail - read through its natural language question, subgraph information and try understanding its relation with its corresponding sparql queries from both DBpedia 18 or try running these sparql queries yourself against Wikidata or DBPedia platform to see where they lead you eventually! In case any query has multiple results having different variances with respect to answers range , then look inside entity definitions contained within words \ phrases / synonyms reflected by natural language parsing services API's like AIKATsetu etc., before writing authoritative answer modules/endpoints forming partinmonly sustainable pipeline architecture using such prepared & refined datasets like LC-QUAD !
Use your own data: Once you have familiarized yourself sufficiently with the available pairs & understand their relevance , consider creating your own data set by adding more complex questions along associated unique attributes which shall give great insights . If not done already evaluate if population enrichment techniques should be applied suiting specific domain's needs your bot purports - either just features selection criterion wise or entire classifier selection algorithm wise as otherwise global extracted vectors may decide either selectively for reducing overfitting/generalization penalty in
- Incorporating the LC-QUAD 2.0 dataset into Intelligent systems such as Chatbots, Question Answering Systems, and Document Summarization programs to allow them to retrieve the required information by transforming natural language questions into SPARQL queries.
- Utilizing this dataset in Semantic Scholar Search Engines and Academic Digital Libraries which can use natural language queries instead of keywords in order to perform more sophisticated searches and provide more accurate results for researchers in diverse areas.
- Applying this dataset for building Knowledge Graphs that can store entities along with their attributes, categories and relations thereby allowing better understanding of complex relationships between entities or data and further advancing development of AI agents that are able to answer specific questions or provide personalized recommendations in various contexts or tasks
&g...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AOL search data anonymized and released by AOL Research in 2006."500k User Session Collection----------------------------------------------This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.Brief description:This collection consists of ~20M web queries collected from ~650k users over three months.The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research. The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.Each line in the data represents one of two types of events: 1. A query that was NOT followed by the user clicking on a result item. 2. A click through on an item in the result list returned from a query.In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above). In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.Basic Collection StatisticsDates: 01 March, 2006 - 31 May, 2006Normalized queries: 36,389,567 lines of data 21,011,340 instances of new queries (w/ or w/o click-through) 7,887,022 requests for "next page" of results 19,442,629 user click-through events 16,946,938 queries w/o user click-through 10,154,742 unique (normalized) queries 657,426 unique user ID'sPlease reference the following publication when using this collection:G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.Copyright (2006) AOL"
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AstroLLMs Full Query Dataset
This dataset includes all of the data collected in a four-week deployment of a Large Language Model-powered Slack chatbot trained on astrophysics papers. Astronomers were invited to interact with the chatbot, ask questions, and leave feedback. This data includes 368 question-answer pairs, including feedback, reactions, and labeling.
Dataset Structure
The columns of this dataset are thread_ts (unique time stamp of the query), channel_id… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/astro-llms-full-query-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SARA - A Collection of Sensitivity-Aware Relevance Assessments
Presented here is a collection of Sensitivity-Aware Relevance Assessments for the UC Berkely labelled subset of the Enron Email Collection. The Hearst [1] labelled version of the Enron Email Collection is a subset of the CMU collection that contains 1702 emails that were annotated as part of a class project at UC Berkley. Students in the Natural Language Processing course were tasked with annotating the emails as relevant or not relevant to 53 different categories. Therefore, the labelled version of the Enron email collection provides a rich taxonomy of labels which can be used for multiple definitions of sensitivity such as the Purely Personal and Personal but in a Professional Context. The categories that the emails are labelled for can be seen in Table 1. The files for the labelled version of the Enron Email Collection are available from the UC Berkely website.
We deploy a topic modelling approach to identify topical themes in the labelled Enron collection that serve as a basis for our information needs which are in turn used to gather queries and relevance assessments, the notebook for which is available here. Two separate crowdsourcing tasks are carried out in the development of SARA. Firstly, query formulations are crowdsourced to represent the information needs and, secondly, relevance assessments are crowdsourced for a pooled set of documents from the labelled Enron collection for each of the information needs.
The SARA Collection of Sensitivity-Aware Relevance Assessments is available through the popular ir_datasets library. More information can be found on the ir_datasets GitHub and website.
Information Needs
To create our set of sensitivity-aware relevance assessments for the labelled Enron email collection, we first identify a set of topical subjects that reflect the contents of the emails in the collection. We use a topic modelling approach to identify the information needs. When identifying topics to be used as information needs, we are interested in identifying general themes that relate to the topics of discussion that might likely be covered in the contents (i.e., the body) of the emails in the collection. The topics are chosen to be broad enough to be able to reasonably expect that there would be relevant documents in the collection, and not so specific that it would require specialist knowledge to make a judgement of relevance on the subject. Subsequently, we manually construct short passages of text to serve as descriptions of the information needs that are to be searched for in the collection by the crowdworkers. The information needs that the crowdworkers are available in the information_needs.tsv file.
Queries
In order to collect relevance assessments for pairs of emails and information needs, different query formulations are first needed to generate pools of documents. Query formulations for each topic are collected from crowdworkers from the Prolific crowdwork platform. Ten information needs are shown to each crowdworker and they are asked to provide a query formulation that they would use to get relevant documents to satisfy the information need they are presented with. Three queries for each of the fifty information needs are released. The resulting queries are available in the repeated_queries.tsv file.
Relevance Assesments
Crowdworkers are shown an information need and an email and asked to rate the document as being either Highly Relevant, Partially Relevant, or Not Relevant to the information need. Each information need/email pair is judged by three crowdworkers and a majority vote is used to generate a ground truth label. Since each information need / email pair is judged by three crowdworkers and there are three possible labels, it is possible for each of the labels to be selected by one crowdworker. In practice, this only happened for 134 pairs. In such cases, ties are broken by having one of the authors read the document and make an additional judgement. In order to ensure that sensitive documents definitely have relevance labels they were also judged by one of the authors for each of the information needs. The relevance assessments are available in the repeated_qrels.txt file. The relevance assessments are in the format 'query iteration document relevancy'. The iteration column is used for IR_Datasets and can be safely ignored and the document name is the filename used in the labelled Enron collection.
Table 1
1) Coarse genre
2) Included/forwarded information
3) Primary topics (If coarse genre 1.1 is selected)
4) Emotional tone (If not neutral)
1.1 Company Business, Strategy, etc. (See 3)
2.1 Includes new text in addition to forwarded material
3.1 Regulations and regulators (includes price caps)
4.1 Jubilation
1.2 Purely Personal
2.2 Forwarded email(s) including replies
3.2 Internal projects -- progress and strategy
4.2 Hope / anticipation
1.3 Personal but in professional context (e.g., it was good working with you)
2.3 Business letter(s) / document(s)
3.3 Company image -- current
4.3 Humor
1.4 Logistic Arrangements (meeting scheduling, technical support, etc.)
2.4 News article(s)
3.4 Company image -- changing / influencing
4.4 Camaraderie
1.5 Employment arrangements (job seeking, hiring, recommendations, etc.)
2.5 Government / academic report(s)
3.5 Political influence / contributions / contacts
4.5 Admiration
1.6 Document editing/checking (collaboration)
2.6 Government action(s) (such as results of a hearing, etc.)
3.6 California energy crisis / California politics
4.6 Gratitude
1.7 Empty message (due to missing attachment)
2.7 Press release(s)
3.7 Internal company policy
4.7 Friendship / affection
1.8 Empty message
2.8 Legal documents (complaints, lawsuits, advice)
3.8 Internal company operations
4.8 Sympathy / support
2.9 Pointers to url(s)
3.9 Alliances / partnerships
4.9 Sarcasm
2.10 Newsletters
3.10 Legal advice
4.10 Secrecy / confidentiality
2.11 Jokes, humor (related to business)
3.11 Talking points
4.11 Worry / anxiety
2.12 Jokes, humor (unrelated to business)
3.12 Meeting minutes
4.12 Concern
2.13 Attachment(s) (assumed missing)
3.13 Trip reports
4.13 Competitiveness / aggressiveness
4.14 Triumph / gloating
4.15 Pride
4.16 Anger / agitation
4.17 Sadness / despair
4.18 Shame
4.19 Dislike / scorn
The Sensitivity-Aware Relevance Assessments dataset is held under an Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence which allows for it to be adapted, transformed and built upon.
Questions and comments are welcomed via email.
References
[1] Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. In Proc. of Workshop on Effective Tools and Methodologies for Teaching NLP and CL.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThis is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is comprised of 5 CSV files contained in the data.zip archive. Each one represents a production machine from which various sensor data has been collected. The average cadence for collection was 5 measurements per second. The monitored devices where used for hydroforming.
The collection period covered the period from 2023-06-01 until 2023-08-05.
These files represent a complete data dump from the data available in the time-series database, InfluxDB, used for collection. Because of this some columns have no semantic value for detecting production cycles or any other analytics.
Each file contains a total of 14 columns. Some of the columns are artefacts of the query used to extract the data from InfluxDB and can be discarded. These columns are: results, table _start, _stop
Pertinent columns are:
There are two additional files which contain annotation data:
We have provided a sample Jupyter notebook (verify_data.ipynb), which gives examples of how the dataset can be loaded and visualised as well as examples of how the sample patterns and ground truth can be addressed and visualised.
The Jupyter Notebook contains an example of how the data can be loaded and visualised. Please note that both data should be filtered based on sid; the power measurements are collected by sid 1. See Notebook for example.
Facebook
TwitterNotice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.
April 9, 2020
April 20, 2020
April 29, 2020
September 1st, 2020
February 12, 2021
new_deaths column.February 16, 2021
The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.
The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.
The AP is updating this dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic
Filter cases by state here
Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac
Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true
Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.
Pull the 100 counties with the highest per-capita confirmed cases here
Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.
The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.
@(https://datawrapper.dwcdn.net/nRyaf/15/)
<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here
This data should be credited to Johns Hopkins University COVID-19 tracking project
Facebook
Twitter500k User Session Collection
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED. Brief description: This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization… See the full description on the dataset page: https://huggingface.co/datasets/max-chroma/AOL-500k-User-Session-Collection.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for answersumm
Dataset Summary
The AnswerSumm dataset is an English-language dataset of questions and answers collected from a StackExchange data dump. The dataset was created to support the task of query-focused answer summarization with an emphasis on multi-perspective answers. The dataset consists of over 4200 such question-answer threads annotated by professional linguists and includes over 8700 summaries. We decompose the task into several annotation… See the full description on the dataset page: https://huggingface.co/datasets/alexfabbri/answersumm.
Facebook
TwitterThe Air Markets Program Data tool allows users to search EPA data to answer scientific, general, policy, and regulatory questions about industry emissions. Air Markets Program Data (AMPD) is a web-based application that allows users easy access to both current and historical data collected as part of EPA's emissions trading programs. This site allows you to create and view reports and to download emissions data for further analysis. AMPD provides a query tool so users can create custom queries of industry source emissions data, allowance data, compliance data, and facility attributes. In addition, AMPD provides interactive maps, charts, reports, and pre-packaged datasets. AMPD does not require any additional software, plug-ins, or security controls and can be accessed using a standard web browser.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Facebook
TwitterThis dataset contains a list of petrol station locations in the Münster area. The data is queried directly from the OpenStreetMaps database. Please note the following information about the data: The query does not exactly cover the city area, but a rectangular frame around the city centre. The database of OpenStreetMaps may not be complete. These data are collected by volunteers and usually have a very good quality, but this is not official information from the city administration. The license terms of OpenStreetMap Germany apply: http://www.openstreetmap.de/faq.html#lizenz OpenStreetMap is a project founded in 2004 with the aim of creating a free world map. Volunteers from many countries work on the further development of the software as well as the collection and processing of geodata. Data is collected about roads, railways, rivers, forests, houses and everything else that is commonly seen on maps. This record links directly to the current data query at overpass-api.de. If you find incorrect or missing information in this record, you can log in to OpenStreetMaps and help improve the database yourself. You can complete or correct the data, similar to Wikipedia. How to do this, find out under: https://www.openstreetmap.de/faq.html#wie_mitmachen Furthermore, you can customise this data query yourself and extend the query result to the entire Münsterland or beyond, because the OpenStreetMap database contains Germany-wide data. You can find an English guide for the query language to formulate your data query at the following address: https://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide
Facebook
TwitterScientific data provenance is the information required to document the history of an item of data, including how it was created and how it was transformed. Data provenance has great potential to improve the transparency, reliability, and reproducibility of scientific results. However it has been little used to date by domain scientists because most systems that collect provenance require scientists to learn specialized software tools and jargon. This project is developing tools that allow scientists to collect, visualize, and query provenance directly from the R statistical language. The first tool (RDataTracker) is a library of R functions that can be downloaded and installed as an R package. RDataTracker allows the scientist to collect data provenance during an R console session or while executing an R script. The resulting provenance is stored on the scientist's computer as a DDG (data derivation graph) file. The second tool (DDG Explorer) is a stand-alone Java program that can be downloaded and run to visualize, store, and query DDGs. The third tool is an R script (DDGCheckpoint.R) may be used with RDataTRacker to create and restore checkpoints that store the R environment and user files. Documentation for all tools is included with the RDataTracker package or may be downloaded separately.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.
Methods
RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.
Data Collection from August 19, 2018 Onward
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
Facebook
TwitterA database of flow cytometry experiments where users can query and download data collected and annotated according to the MIFlowCyt data standard.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
TwitterA detailed explanation of how this dataset was put together, including data sources and methodologies, follows below.Please see the "Terms of Use" section below for the Data DictionaryDATA ACQUISITION AND CLEANING PROCESSThis dataset was built from 5 separate datasets queried during the months of April and May 2023 from the Census Microdata System (link below):https://data.census.gov/mdat/#/All datasets include information on Property Value (VALP) by: Educational Attainment (SCHL), Gender (SEX), a specified race or ethnicity (RAC or HISP), and are grouped by Public Use Microdata Areas (PUMAS). PUMAS are geographic areas created by the Census bureau; they are weighted by land area and population to facilitate data analysis. Data also Included totals for the state of New Mexico, so 19 total geographies are represented. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Datasets were downloaded separately by race and ethnicity because this was the only way to obtain the VALP, SCHL, and SEX variables intersectionally with race or ethnicity data. Cleaning each dataset started with recoding the SCHL and HISP variables - details on recoding can be found below.After recoding, each dataset was transposed so that PUMAS were rows and SCHL, VALP, SEX, and Race or Ethnicity variables were the columns.Median values were calculated in every case that recoding was necessary. As a result, all Property Values in this dataset reflect median values.At times the ACS data downloaded with zeros instead of the 'null' values in initial query results. The VALP variable also included a "-1" variable to reflect N/A values (details in variable notes). Both zeros and "-1" values were removed before calculating median values, both to keep the data true to the original query and to generate accurate median values.Recoding the SCHL variable resulted in 5 rows for each PUMA, reflecting the different levels of educational attainment in each region. Columns grouped variables by race or ethnicity and gender. Cell values were property values.All 5 datasets were joined after recoding and cleaning the data. Original datasets all include 95 rows with 5 separate Educational Attainment variables for each PUMA, including New Mexico State totals.Because 1 row was needed for each PUMA in order to map this data, the data was split by Educational Attainment (SCHL), resulting in 110 columns reflecting median property values for each race or ethnicity by gender and level of educational attainment.A short, unique 2 to 5 letter alias was created for each PUMA area in anticipation of needing a unique identifier to join the data with. GIS AND MAPPING PROCESSA PUMA shapefile was downloaded from the ACS site. The Shapefile can be downloaded here: https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServerThe DBF from the PUMA shapefile was exported to Excel; this shapefile data included needed geographic information for mapping such as: GEOID, PUMACE. The UIDs created for each PUMA were added to the shapefile data; the PUMA shapfile data and ACS data were then joined on UID in JMP.The data table was joined to the shapefile in ARC GiIS, based on PUMA region (specifically GEOID text).The resulting shapefile was exported as a GDB (geodatabase) in order to keep 'Null' values in the data. GDBs are capable of including a rule allowing null values where shapefiles are not. This GDB was uploaded to NMCDCs Arc Gis platform. SYSTEMS USEDMS Excel was used for data cleaning, recoding, and deriving values. Recoding was done directly in the Microdata system when possible - but because the system is was in beta at the time of use some features were not functional at times.JMP was used to transpose, join, and split data. ARC GIS Desktop was used to create the shapefile uploaded to NMCDC's online platform. VARIABLE AND RECODING NOTESTIMEFRAME: Data was queried for the 5 year period of 2015 to 2019 because ACS changed its definiton for and methods of collecting data on race and ethinicity in 2020. The change resulted in greater aggregation and les granular data on variables from 2020 onward.Note: All Race Data reflects that respondants identified as the specified race alone or in combination with one or more other races.VARIABLE:ACS VARIABLE DEFINITIONACS VARIABLE NOTESDETAILS OR URL FOR RAW DATA DOWNLOADRACBLKBlack or African American ACS Query: RACBLK, SCHL, SEX, VALP 2019 5yrRACAIANAmerican Indian and Alaska Native ACS Query: RACAIAN, SCHL, SEX, VALP 2019 5yrRACASNAsian ACS Query: RACASN, SCHL, SEX, VALP 2019 5yrRACWHTWhite ACS Query: RACWHT, SCHL, SEX, VALP 2019 5yrHISPHispanic Origin ACS Query: HISP ORG, SCHL, SEX, VALP 2019 5yrHISP RECODE: 24 original separate variablesThe Hispanic Origin (HISP) variable originally included 24 subcategories reflecting Mexican, Central American, South American, and Caribbean Latino, and Spanish identities from each Latin American counry. 7 recoded VariablesThese 24 variables were recoded (grouped) into 7 simpler categories for data analysis: Not Spanish/Hispanic/Latino, Mexican, Caribbean Latino, Central American, South American, Spaniard, All other Spanish/Hispanic/Latino Female. Not Spanish/Hispanic/Latino was not really used in the final dataset as the race datasets provided that information.SCHLEducational Attainment25 original separate variablesThe Educational Attainment (SCHL) variable originally included 25 subcategories reflecting the education levels of adults (over 18) surveyed by the ACS. These include: Kindergarten, Grades 1 through 12 separately, 12th grade with no diploma, Highschool Diploma, GED or credential, less than 1 year of college, more than 1 year of college with no degree, Associate's Degree, Bachelor's Degree, Master's Degree, Professional Degree, and Doctorate Degree.SCHL RECODE: 5 recoded variablesThese 25 variables were recoded (grouped) into 5 simpler categories for data analysis: No High School Diploma, High School Diploma or GED, Some College, Bachelor's Degree, and Advanced or Professional DegreeSEXGender2 variables1 - Male, 2 - FemaleVALPProperty Value1 variableValues were rounded and top-coded by ACS for anonymity. The "-1" variable is defined as N/A (GQ/ Vacant lots except 'for sale only' and 'sold, not occupied' / not owned or being bought.) This variable reflects the median value of property owned by individuals of each race, ethnicity, gender, and educational attainment category.PUMAPublic Use Microdata Area18 PUMAsPUMAs in New Mexico can be viewed here:https://nmcdc.maps.arcgis.com/apps/mapviewer/index.html?webmap=d9fed35f558948ea9051efe9aa529eafData includes 19 total regions: 18 Pumas and NM State TotalsNOTES AND RESOURCESThe following resources and documentation were used to navigate the ACS PUMS system and to answer questions about variables:Census Microdata API User Guide:https://www.census.gov/data/developers/guidance/microdata-api-user-guide.Additional_Concepts.html#list-tab-1433961450Accessing PUMS Data:https://www.census.gov/programs-surveys/acs/microdata/access.htmlHow to use PUMS on data.census.govhttps://www.census.gov/programs-surveys/acs/microdata/mdat.html2019 PUMS Documentation:https://www.census.gov/programs-surveys/acs/microdata/documentation.2019.html#list-tab-13709392012014 to 2018 ACS PUMS Data Dictionary:https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2014-2018.pdf2019 PUMS Tiger/Line Shapefileshttps://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2019&layergroup=Public+Use+Microdata+Areas Note 1: NMCDC attemepted to contact analysts with the ACS system to clarify questions about variables, but did not receive a timely response. Documentation was then consulted.Note 2: All relevant documentation was reviewed and seems to imply that all survey questions were answered by adults, age 18 or over. Youth who have inherited property could potentially be reflected in this data.Dataset and feature service created in May 2023 by Renee Haley, Data Specialist, NMCDC.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We introduce a unique approach to privacy research by creating a virtual persona that mimics human web-searching behaviors. The persona's activities, categorized into 'morning', 'afternoon', and 'evening', were automated using the Selenium WebDriver, enabling the persona to conduct searches as a real user would. The resulting dataset comprises 1,537 records, each representing a unique search query. Each record contains the first two pages of a query result, including the query keywords and a list of the first 2 pages of the query result. The study offers a fresh perspective on the study of privacy and personalization in online environments. The potential for reusing this dataset is significant, as it can be applied to studies on privacy, data collection, and search engine personalization, and it can be used to develop and test algorithms and models that aim to protect user privacy.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Data Dictionary
The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)
Preparing data for analysis
It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.
##### Training Models using Mistral 7B
Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .
##### Testing phosphors :
After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low
- Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
- Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
- Optimizing search algorithms that surface relevant answer results based on types of queries
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.