Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
These python datasets contain the results presented in the above paper with regard to the variability in trends over North America during DJF due to sampling of internal variability. Two types of files are available. The netcdf file contains samples from the synthetic ensemble of DJF temperatures over North America from 1966-2015. The synthetic ensemble is centered on the observed trend. Recentering the ensemble on the ensemble mean trend from the NCAR CESM1 LENS will create the Observational Large Ensemble, in which each sample can be viewed as a temperature history that could have occurred given various samplings of internal variability. The synthetic ensemble can also be recentered on any other estimate of the forced response to climate change. While the dataset is both land and ocean, it has only been validated over land. The second type of file, presented as python datasets (.npz) contains the results presented in the McKinnon et al (2017) reference. In particular, it contains the 50-year trends for both the observations and the NCAR CESM1 Large Ensemble that actually occurred, and could have occurred given a different sampling of internal variability. The bootstrap results can be compared to the true spread across the NCAR CESM1 Large Ensemble for validation, as was done in the manuscript. Each of these files is named based on the observational dataset, variable, time span, and spatial domain. They contain: BETA: the empirical OLS trend BOOTSAMPLES: the OLS trends estimated after bootstrapping INTERANNUALVAR: the interannual variance in the data after modeling and removing the forced trend empiricalAR1: the empirical AR(1) coefficient estimated from the residuals around the forced trend The first dimension of all variables is 42, which is a stack of the ensemble mean behavior (index 0), the forty members of the NCAR Large Ensemble (indices 1:40), and the observations (last index, -1). The second dimension is spatial. See latlon.npz for the latitude and longitude vectors. The third dimension, when present, is the bootstrap samples. We have saved 1000 bootstrap samples.
CodeSyntax is a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. It contains 18,701 code samples annotated with 1,342,050 relation edges in 43 relation types for Python, and 13,711 code samples annotated with 864,411 relation edges in 39 relation types for Java. It is designed to evaluate the performance of language models on code syntax understanding.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-2D). The Valparaíso Stacking Analysis Tool (VSAT-2D) provides a series of tools for selecting, stacking, and analyzing moment-0 intensity maps from interferometric datasets. It is intended for stacking samples of moment-0 extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-2D on smaller datasets containing any type of astronomical object.
VSAT-2D can be downloaded from the github repository link.
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.
The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.
Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generation of multiple true-false questions
This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.
Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.
As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.
Example MTF question:
Check the correct statements:
[ ] All trees have green leafs.
[ ] Trees grow towards the sky.
[ ] Leafes can fall from a tree.
Features
- generation of false statements
- automatic selection of true statements
- selection of an arbitrary similarity for true and false statements as well as the number of false statements
- generating false statements by adding or deleting negations as well as using a german gpt2
Setup
Installation
1. Create a new environment: `conda create -n mtfenv python=3.9`
2. Activate the environment: `conda activate mtfenv`
3. Install dependencies using anaconda:
```
conda install -y -c conda-forge pdfplumber
conda install -y -c conda-forge nltk
conda install -y -c conda-forge pypdf2
conda install -y -c conda-forge pylatexenc
conda install -y -c conda-forge packaging
conda install -y -c conda-forge transformers
conda install -y -c conda-forge essential_generators
conda install -y -c conda-forge xlsxwriter
```
3. Download spacy: `python3.9 -m spacy download de_core_news_lg`
Getting started
After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.
To create MTF questions for your own texts use the following command:
`python3 main.py --answers 1 --similarity 0.66 --input ./
The parameter `answers` indicates how many false answers should be generated.
By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.
## History and roadmap
* Outlook third iteration: Automatic augmentation of text chapters with generated questions
* Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator
* First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler
Publications, citations, license
Publications
Citation of the Dataset
The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation
Contact
License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.
Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.
This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-3D). The Valparaíso Stacking Analysis Tool (VSAT-3D) provides a series of tools for selecting, stacking, and analyzing 3D spectra. It is intended for stacking samples of datacubes extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-3D on smaller datasets containing any type of astronomical object.
VSAT-3D can be downloaded from the github repository link.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.
This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.
If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.
This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.
The images are labelled densely using polygons and contain the following 24 classes:
unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting
> images
> labels/png
> labels/tiff
- class_to_idx.json
- classes.csv
- classes.json
- idx_to_class.json
aerial@icg.tugraz.at
If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at
The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:
That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.
That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).
Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
Number of Instances: red wine - 1599; white wine - 4898
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wine_quality', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
This USGS data release is intended to provide a baselayer of information on likely stream crossings throughout the United States. The geopackage provides likely crossings of infrastructure and streams and provides observed information that helps validate modeled crossings and build knowledge about associated conditions through time (e.g. crossing type, crossing condition). Stream crossings were developed by intersecting the 2020 United States Census Bureau Topologically Integrated Geographic Encoding and Referencing (TIGER) U.S. road lines with the National Hydrography Dataset High Resolution flowlines. The current version of this data release specifically focuses on road stream crossings (i.e. TIGER2020 Roads) but is designed to support additions of other crossing types that may be included in future iterations (e.g. rail). In total 6,608,268 crossings are included in the dataset and 496,564 observations from the U.S. Department of Transportation, Federal Highway Administration's 2019 National Bridge Inventory (NBI)are included to help identify crossing types of bridges and culverts. This data release also contains Python code that documents methods of data development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset to run example script of the Valparaíso Stacking Analysis Tool (VSAT-1D). The Valparaíso Stacking Analysis Tool (VSAT) provides a series of tools for selecting, stacking, and analyzing 1D spectra. It is intended for stacking samples of spectra belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT on smaller datasets containing any type of astronomical object.
VSAT can be downloaded from the github repository link.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX. Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model. Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model. Dataset History Computer code and templates were written by hand. Dataset Citation Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vegetation type dataset is derived from the 1:1,000,000 Atlas of the Vegetation of China. This atlas is another summarizing achievement of the vegetation ecology workers in China over the past 40 years, following the publication of monographs such as Vegetation of China, which is a basic map of the country's natural resources and natural conditions. Prepared by more than 250 experts from 53 units, including relevant institutes of the Chinese Academy of Sciences, relevant ministries and departments of provinces and districts, and institutions of higher learning, and officially published by the Science Press for domestic and international public distribution.The Bayesian dynamic linear model proposed by Liu et al. (2019, https://doi.org/10.1038/s41558-019-0583-9) was used to calculate the time-varying measure of resilience. We have modified the parameters of the code to be more suitable for the Loess Plateau and Qinba Mountains in China. According to the results of resilience, we could obtain the early warning signals of forest.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch
Datasets There are three types of datasets:
snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated
code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test
training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20
The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:
staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE.
conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE
The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE).
Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License.
ncs-embedder-so-ds-feb20
ncs-embedder-staqc-py
tnbow-embedder-so-ds-feb20
use-embedder-pacs
ensemble-embedder-pacs
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UWB Ranging and Localization Dataset for "High-Accuracy Ranging and Localization with Ultra-Wideband Communication for Energy-Constrained Devices" This dataset accompanies the paper "High-Accuracy Ranging and Localization with Ultra-Wideband Communication for Energy-Constrained Devices," by L. Flueratoru, S. Wehrli, M. Magno, S. Lohan, D. Niculescu, accepted for publication in the IEEE Internet of Things Journal. Please refer to the paper for more information about analyzing the data. If you find this dataset useful, please consider citing our paper in your work. This dataset is split into two parts: "ranging" and "localization." Both parts contain measurements acquired with 3db Access and Decawave MDEK1001 UWB devices. In the "3db" and "decawave" datasets, when a recording has the same name, it means that the measurements were acquired at the exact same locations with the two types of devices. The "3db" ranging dataset contains, apart from these, more measurements acquired in various LOS and NLOS scenarios. In the directory "images" you can find photos of some of the setups. The "ranging" and "localization" directories both contain a "data" directory which holds the datasets and a "code" directory with Python scripts that show how to read and analyze the data. The 3db Access ranging recordings contain the following data: - True distance - Measured distance - Channel on which the measurements were acquired (can be 6.5, 7, or 7.5 GHz) - Time of arrival as identified by the chipset - Channel impulse response (CIR) - Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively) - If NLOS, the type of NLOS obstruction and its tickness. The Decawave ranging recordings contain the following data: - True distance - Measured distance - Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively) - If NLOS, the type of NLOS obstruction and its tickness. The MDEK kit operates only on the 6.5 GHz channel and cannot output the CIR without further code modifications, which is why this data is not available for the Decawave dataset. The localization dataset includes the following data: - True location as measured by an HTC Vive system - Estimated location using a Gauss-Newton trilateration algorithm (please refer to the paper for more details) - Distance measurements between each anchor and the tag.
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('conll2003', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture
Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data
This dataset can be retrieved with CovsirPhy (Python library).
pip install covsirphy --upgrade
import covsirphy as cs
data_loader = cs.DataLoader()
japan_data = data_loader.japan()
# The number of cases (Total/each province)
clean_df = japan_data.cleaned()
# Metadata
meta_df = japan_data.meta()
Please refer to CovsirPhy Documentation: Japan-specific dataset.
Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.
covid_jpn_total.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020)
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)
In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.
Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
The number of vaccinated people:
- Vaccinated_1st
: the number of vaccinated persons for the first time on the date
- Vaccinated_2nd
: the number of vaccinated persons with the second dose on the date
- Vaccinated_3rd
: the number of vaccinated persons with the third dose on the date
Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)
covid_jpn_prefecture.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)
Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
Note:
covid_jpn_prefecture.groupby("Date").sum()
does not match covid_jpn_total
.
When you analyse total data in Japan, please use covid_jpn_total
data.
covid_jpn_metadata.csv
- Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表
- Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)
Hospital_bed: With the primary data of 厚生労働省 感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 第二種感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 医療施設動態調査(令和2年1月末概数), 厚生労働省 感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別 感染症病床数,
Clinic_bed: With the primary data of 医療施設動態調査(令和2年1月末概数) ,
Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).
Admin
To create this dataset, edited and transformed data of the following sites was used.
厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)
Wikipedia: Wikipedia
LinkData: LinkData (Public Domain)
Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main objective of this study is to describe the process of collecting data extracted from Twitter (X) during the Brazilian presidential elections in 2022, encompassing the post-election period and the event of the attack on the buildings of the executive, legislative, and judiciary branches in January 2023. The work of collecting data took one year. Additionally, the study provides an overview of the general characteristics of the dataset created from 282 million tweets, named “The Interfaces Twitter Elections Dataset” (ITED-Br), the third most extensive dataset of tweets with political purposes. The process of collecting and creating the database for this study went through three major stages, subdivided into several processes: (1) A preliminary analysis of the platform and its operation; (2) Contextual analysis, creation of the conceptual model, and definition of Keywords and (3) Implementation of the Data Collection Strategy. Python algorithms were developed to model each primary collection type. The “token farm” algorithm, was employed to iterate over available API keys. While Twitter is generally a “public” access platform and fits into big data standards, extracting valuable information is not trivial due to the volume, speed, and heterogeneity of data. This study concludes that acquiring informational value requires expertise not only in sociopolitical areas but also in computational and informational studies, highlighting the interdisciplinary nature of such research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically