This dataset was created by Himanshu Bhardwaj
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects and is filtered where the books is Longman illustrated dictionary of computing science : computers and their application. It has 10 columns such as authors, average publication date, book publishers, book subject, and books. The data is ordered by earliest publication date (descending).
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study. Methodology We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material: Peer-reviewed articles where available, TRE websites, TRE metadata catalogs. The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months. Technical details This dataset consists of five comma-separated values (.csv) files describing our inventory: countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional) tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional) access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional) inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional) major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional). Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases: schema.sql: Schema definition file to create the tables and views used in the analysis. The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code
dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code
, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code
allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Batik is one of the Indonesian cultural heritages which have noble cultural and philosophical meaning in every motif. This article introduces a dataset from Yogyakarta, Indonesia, "Batik Nitik 960". Dataset was captured from a piece of fabric consisting of 60 Nitik motifs. The dataset provided by Paguyuban Pecinta Batik Indonesia (PPBI) Sekar Jagad Yogyakarta, collection of Winotosasto Batik, and the data taken in APIPS Gallery. The dataset is divided into 60 categories with a total of 960 images, and each class has 16 photos. The images were captured by Sony Alpha a6400, lighting using Godox SK II 400, and data was filtered using jpg format. Each category has four motifs and is augmented using rotation using 90, 180, and 270 degrees. Each class has a philosophical meaning which describes the history of the motif. This dataset allows the training and validation of machine learning models to classify, retrieve, or generate a new batik motif using a generative adversarial network. To our knowledge, this is the first publicly available Batik Nitik dataset with philosophical meaning. Data provides by a collaboration of PPBI Sekar Jagad Yogyakarta, Universitas Muhammadiyah Malang, and Universitas Gadjah Mada. Hope this dataset "Batik Nitik 960" can support batik research and we are open to research collaboration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Analysing Model-Driven Engineering Research
Model-driven engineering (MDE), like any other recent discipline, is continuously evolving as new topics, techniques and application areas emerge. The research interests of the community also evolve due to the progressive challenges posed by the industry as it begins to embrace MDE, helping to drive the discipline forward. With this dataset, we analysed the evolution of the MDE discipline, as evidenced by its associated scholarly data, to gain insights into its evolution over the past years, the research landscape of the domain, the topics being researched, its main contributors and application areas, as well as the main trends in the field. We take advantage of Natural Language Processing and Machine Learning techniques to extract information from the MDE research literature in order to conduct a data-driven analysis of the main features of the domain in general, and of the MoDELS conference in particular.
List of Files
Model-Driven Engineering Ontology (mdeo.ttl)
The Model-Driven Engineering Ontology (MDEO) is a taxonomy of research areas focusing on the Model-Driven Engineering field. It includes 91 concepts arranged in a three-layer mono-hierarchic structure with 573 relationships.
At the top level, there are nine broad concepts including "Model Foundations", "Model Quality", "Modeling Languages".
MDEO has been formalized as OWL ontology following Semantic Web standards. Its data model builds on SKOS (Simple Knowledge Organization System) and it includes five semantic relationships:
Model-Driven Engineering Dataset (dataset.json)
This file contains all the papers we gathered to perform this analysis. First, we extracted a dump of MAG via Microsoft Azure Storage and processed using our local Big Data infrastructure.
Next, we selected all papers from journals and conferences having one-hundred percent focus on the field:
Then, we also included papers having in title or abstract the following chunks of text: "domain specific model", "domain specific modeling", "domain specific modelling", "metamodel", "metamodelling", "metamodeling", "model analysis", "model debugging", "model difference", "model differencing", "model evolution", "model execution", "model maintenance", "model merge", "model quality", "model migration", "model synchronization", "model transformation", "model transformations", "model versioning", "model views", "model viewpoint", "model viewpoints", "model weaving", "model testing", "multiview model", "multiview modeling", "multiview modelling", "OCL", "software model", "system model", "SysML", "UML", "view consistency", "viewpoint consistency", "view integration", "viewpoint integration". As a result, we gathered 43,700 papers.
The dataset is a JSON formatted file, containing a dictionary which keys are paper identifiers. Instead, values are also dictionaries describing papers according to several features. Here you can find an instance of paper available in the dataset:
{
"2116587399": {
"citationcount": 32,
"confname": "models 2012",
"references": [197998272, 2013363798, 1523334793, 1577544661, 2109445551, 2120437191, 2087918852, 1533999404, 2054150958, 2122246939, 2399834472, 2013840728, 2026586559, 2155708393, 2026049208, 2974365732],
"year": "2012-01-01",
"topics": ["theoretical computer science", "modeling language", "semantics", "domain specific modeling", "domain model", "programming language", "computer science", "unified modeling language", "domain knowledge", "metamodeling", "abstract syntax"],
"papertitle": "creating visual domain specific modeling languages from end user demonstration",
"confseries": "MoDELS",
"language": ["", "en"],
"abstract": "Domain-Specific Modeling Languages (DSMLs) have received recent interest due to their conciseness and rich expressiveness for modeling a specific domain. However, DSML adoption has several challenges because development of a new DSML requires both domain knowledge and language development expertise (e.g., defining abstract/concrete syntax and specifying semantics). Abstract syntax is generally defined in the form of a metamodel, with semantics associated to the metamodel. Thus, designing a metamodel is a core DSML development activity. Furthermore, DSMLs are often developed incrementally by iterating across complex language development tasks. An iterative and incremental approach is often preferred because the approach encourages end-user involvement to assist with verifying the DSML correctness and feedback on new requirements. However, if there is no tool support, iterative and incremental DSML development can be mundane and error-prone work. To resolve issues related to DSML development, we introduce a new approach to create DSMLs from a set of domain model examples provided by an end-user. The approach focuses on (1) the identification of concrete syntax, (2) inducing abstract syntax in the form of a metamodel, and (3) inferring static semantics from a set of domain model examples. In order to generate a DSML from user-supplied examples, our approach uses graph theory and metamodel design patterns.",
"conferenceseriesid": 1191550517,
"confplace": "Innsbruck/AUSTRIA",
"urls": ["http://yadda.icm.edu.pl/yadda/element/bwmeta1.element.ieee-000006226010", "http://gray.cs.ua.edu/pubs/mise-2012.pdf", "http://ieeexplore.ieee.org/document/6226010/", "https://ieeexplore.ieee.org/document/6226010/"],
"confseriesname": "Model Driven Engineering Languages and Systems",
"id": 2116587399,
"doi": "10.1109/MISE.2012.6226010",
"authors": [{
"country": "United States",
"affiliation": "University of Alabama, Tuscaloosa",
"name": "eugene syriani",
"id": 578966534,
"gridid": "grid.411015.0",
"affiliationid": 17301866,
"order": 3
}, {
"country": "United States",
"affiliation": "University of Alabama, Tuscaloosa",
"name": "jeff gray",
"id": 2155833130,
"gridid": "grid.411015.0",
"affiliationid": 17301866,
"order": 2
}, {
"country": "United States",
"affiliation": "University of Alabama, Tuscaloosa",
"name": "hyun cho",
"id": 2505758318,
"gridid": "grid.411015.0",
"affiliationid": 17301866,
"order": 1
}],
"mbse_syntactic_topics": ["domain-specific modeling language", "concrete syntax", "metamodel", "modeling language"],
"mbse_annotated": true,
"mbse_semantic_topics": ["modeling language", "metamodel", "concrete syntax"],
"mbse_enhanced_topics": ["concrete syntax", "metamodel", "domain-specific modeling language", "modeling language", "model representation", "syntax", "language definition", "model foundations"]
}
}
Spreadsheet describing the MoDELS conference (MBSE@analysis_on_models.xlsx)
This is the dataset describing the MoDELS conference throughout time. This spreadsheet consists of 25 tabs, which can be categorised according to five main categories: i) metrics, ii) publications, iii) NORM-publications, iv) citations, and v) NORM-citations. The publication and citation tabs report the absolute values, whereas the NORM-publication and NORM-citations report the normalised values. Each of these categories consists of 5 different tabs, each representing a class of entities: i) organizations, ii) topics, iii) authors, iv) countries, and v) conferences.
Here is the full list of tabs with their description:
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an academic graph based on a snapshot of the Microsoft Academic Graph from 26.05.2021. The Microsoft Academic Graph (MAG) is a large-scale dataset containing information about scientific publication records, their citation relations, as well as authors, affiliations, journals, conferences and fields of study. We acknowledge the Microsoft Academic Graph using the URI https://aka.ms/msracad. For more information regarding schema and the entities present in the original dataset please refer to: MAG schema.
MAG for Heterogeneous Graph Learning We use a recent version of MAG from May 2021 and extract all relevant entities to build a graph that can be directly used for heterogeneous graph learning (node classification, link prediction, etc.). The graph contains all English papers, published after 1900, that have been cited at least 5 times per year since the time of publishing. For fairness, we set a constant citation bound of 100 for papers published before 2000. We further include two smaller subgraphs, one containing computer science papers and one containing medicine papers.
Nodes and features We define the following nodes:
paper with mag_id, graph_id, normalized title, year of publication, citations and a 128-dimension title embedding built using word2vec No. of papers: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
author with mag_id, graph_id, normalized name, citations No. of authors: 6,363,201 (all), 1,797,980 (medicine), 557,078 (computer science);
field with mag_id, graph_id, level, citations denoting the hierarchical level of the field where 0 is the highest-level (e.g. computer science) No. of fields: 199,457 (all), 83,970 (medicine), 45,454 (computer science);
affiliation with mag_id, graph_id, citations No. of affiliations: 19,421 (all), 12,103 (medicine), 10,139 (computer science);
venue with mag_id, graph_id, citations, type denoting whether conference or journal No. of venues: 24,608 (all), 8,514 (medicine), 9,893 (computer science).
Edges We define the following edges:
author is_affiliated_with affiliation No. of author-affiliation edges: 8,292,253 (all), 2,265,728 (medicine), 665,931 (computer science);
author is_first/last/other paper No. of author-paper edges: 24,907,473 (all), 5,081,752 (medicine), 1,269,485 (computer science);
paper has_citation_to paper No. of author-affiliation edges: 142,684,074 (all), 16,808,837 (medicine), 4,152,804 (computer science);
paper conference/journal_published_at venue No. of author-affiliation edges: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
paper has_field_L0/L1/L2/L3/L4 field No. of author-affiliation edges: 47,531,366 (all), 9,403,708 (medicine), 3,341,395 (computer science);
field is_in field No. of author-affiliation edges: 339,036 (all), 138,304 (medicine), 83,245 (computer science);
We further include a reverse edge for each edge type defined above that is denoted with the prefix rev_ and can be removed based on the downstream task.
Data structure The nodes and their respective features are provided as separate .tsv files where each feature represents a column. The edges are provided as a pickled python dictionary with schema:
{target_type: {source_type: {edge_type: {target_id: {source_id: {time } } } } } }
We provide three compressed ZIP archives, one for each subgraph (all, medicine, computer science), however we split the file for the complete graph into 500mb chunks. Each archive contains the separate node features and edge dictionary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset provides supporting data and corpora for the empirical study described in:Rafael S. Gonçalves and Mark A. Musen. The variable quality of metadata about biological samples used in biomedical experiments. Scientific Data, in press (2019).Description of filesAnalysis spreadsheet files:- ncbi-biosample-metadata-study.xlsx contains data to support the analysis of the quality of metadata in the NCBI BioSample.- ebi-biosamples-metadata-study.xlsx contains data to support the analysis of the quality of metadata in the EBI BioSamples.Validation data files:- ncbi-biosample-validation-data.tar.gz is an archive containing the validation data for the analysis of the entire NCBI BioSample dataset.- ncbi-biosample-packaged-validation-data.tar.gz is an archive containing the validation data for the analysis of the subset of metadata records in the NCBI BioSample that use a BioSample package definition.- ebi-ncbi-shared-records-validation-data.tar.gz is an archive containing the validation data for the analysis of the set of metadata records that exist both in EBI BioSamples and NCBI BioSample.Corpus files:- ebi-biosamples-corpus.xml.gz corresponds to the EBI BioSamples corpus.- ncbi-biosample-corpus.xml.gz corresponds to the NCBI BioSample corpus.- ncbi-biosample-packaged-records-corpus.tar.gz corresponds to the NCBI BioSample metadata records that declare a package definition.- ebi-ncbi-shared-records-corpus.tar.gz corresponds to the corpus of metadata records that exist both in NCBI BioSample and EBI BioSamples.
Thinking Machines Data Science is releasing TM Open Buildings, a dataset of manually-drawn building outlines covering 12 Philippine cities with detailed annotations on building and roof attributes as seen over satellite imagery. We contribute the buildings in OpenStreetMap and also made available for download in Kaggle. This is made possible with the support from the Lacuna Fund.
The team has consulted HOTOSM Asia Pacific and community architects from the Philippine Action for Community-led Shelter Initiatives (PACSII) to validate our attributes and to ensure that our contributions are documented properly. We also looked at street-level views to check tags whenever available. We will take into consideration the feedback from local mappers as local knowledge always precedes, and will always provide changeset comments that are in compliance with OSM guidelines.
You may view more details of our process in our wiki page. Kindly use our Github Issues tab to file any specific concerns about the dataset.
This TM Open Buildings dataset is made available by Thinking Machines under the Open Database License (ODbL). Any rights in individual contents of the database are licensed under the Database Contents License.
We define the buildings we mapped, as well as the attributes included, in the table below. Please refer to our wiki page for more details.
| Building Type |Subtype | Definition | Mapped Attributes |
|----------------|--------|----------------------------------------------------------------------------------------|---------------------------------------------------------|
| Settlement | Single | Residential houses that are individually distinct from surrounding structures | Roof material, Roof layout, Is within gated community? |
| | Dense | Tight clusters of small residential houses that do not have distinguishable boundaries | - |
| Non-settlement | | Commercial, industrial, or institutional buildings | Building height |
The dataset covers selected 250m x 250m tiles in 12 Philippine cities, namely Dagupan City, Palayan City, City of Navotas, City of Mandaluyong, City of Muntinlupa, Legazpi City, Tacloban City, Iloilo City, Mandaue City, Cagayan de Oro City, Davao City, and Zamboanga City. The tiles are chosen to focus on residential areas that lie on a wide variety of terrains (urban, coastal, riparian, agricultural, etc.). All settlements and non-settlements within each tile are drawn manually. Data on the locations of the tiles are given in the following table.
The following table contains the definitions of the attributes and how it is tagged in OSM.
| Attribute | Type | Characteristics | OSM Key and Tag |
|------------------------|------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Roof Material | Natural/Galvanized Iron (GI)/Mixed | Looks rusty when old, silver/gray when new, lines and patches are usually evident. | roof:material = metal_sheet
|
| | Metal/Tiled | Whole roof is usually one solid color, tiled roofs have texture. | roof:material = roof_tiles
|
| | Concrete | Flat, usually has raised white edges, no visible roof “folds”, may be smooth or have objects on top. | roof:material = concrete
|
| Roof Layout | Single Layer Basic | No complex architecture. Plain flat or rectangular roof. At most 4 faces are visible. 1 single layer visible. | roof:shape = gabled
|
| | Single Layer Intricate | Complex shapes on rooftop and multiple vertices, more than 4 faces visible, but still 1 single layer. | roof:shape = hip-and-gabled
|
| | Multilayer | One roof on top of another. A shadow separating the rooftops is seen. | roof:levels = 2
|
| Within gated community | | Uniform roof and lot sizes, structured street layout, “themed” street names, development name given in address | residential = gated
|
| Is a dense settlement | | Densely packed small houses with overlapping rooftops. Rooftop materials are mostly natural/light, mixed, galvanized iron (GI). Narrow one-lane streets, or no visible streets between houses. | residential = irregular_settlement
|
| Building height | Low | 1-5 storeys | note = ”This building has 1-5 storeys"
|
| | Medium | 6-15 storeys | note=”This building has 6-15 storeys"
|
| | High | >15 storeys | note=”This building has more than 15 storeys"
|
We used Mapbox Satellite Streets imagery as of August-September 2023 to trace the building outlines and deduce roof and building attributes. This imagery is a combination of multiple global satellite imagery sources from commercial providers, NASA, and USGS, which have different capture dates. To check if the outlines in the dataset are the most updated ones in an area of interest, you may compare with another imagery source with a known capture date.
TM Open Buildings is a dataset of building footprint outlines of settlement and non-settlements with annotated physical characteristics as seen on satellite imagery (i.e. building and rooftop attributes).
The data was created with funding from the Lacuna Fund as part of the datasets developed under Project CCHAIn (Climate Change, Health, and Artificial Intelligence) which aims to address the knowledge gap on health impacts from climate change for informal settlements in the Philippines. For example, overlaying this data with various hazard information such as flooding, landslides, and fault lines can bring much more granular and targeted insights to disaster risk reduction research and response.
The data was developed from August-October 2023. To support the nature of the OSM platform, we welcome users to actively participate in the continuous updating and improvement of the data based on local knowledge.
You may download in Kaggle or view the tiles in OpenStreetMap.
Thinking Machines collaborated with data annotators to draw and annotate satellite imagery in select areas around the country, in line with the project’s geographical focus. These annotations were then quality checked, post-processed, and conflated by TM with any existing OSM tags before uploading on the OSM platform.
We used EPSG: 4326 to draw the outlines.
You can use outlines in combination with the basemap imagery and the tile bounding boxes provided in this table to create an annotated tile that can be used to train a computer vision model. This model can detect buildings and/or assigned roof attributes on other areas we have not yet covered.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Azerbaijani Sign Language Dataset (AzSLD) is a comprehensive, large dataset designed to facilitate the development and evaluation of machine learning models for the recognition and translation of Azerbaijani Sign Language (AzSL).
AzSLD is the first publicly available dataset focused on Azerbaijani Sign Language. It contributes to the global effort to improve accessibility for the deaf and hard-of-hearing community in Azerbaijan. The dataset aims to bridge the gap between technology and accessibility by providing high-quality data for researchers, developers, and practitioners working on sign language recognition or translation systems.
The data collection costs are covered by the "Strengthening Data Analytics Research and Training Capacity through Establishment of dual Master of Science in Computer Science and Master of Science in Data Analytics (MSCS/DA) degree program at ADA University" project, funded by BP and the Ministry of Education of the Republic of Azerbaijan.
AzSLD is organized into three primary components:
This component contains video sequences of complete sentences in AzSL. It is designed to capture the fluidity and contextual nature of sign language, providing data for more complex language modeling tasks. It includes over 60 hours of high-definition video recordings, annotated with timestamped glosses for 500 distinct classes, enabling precise analysis and robust model training. Ground truth annotations of sentences for each class were added in a separate file. The videos were performed by 18 to 25 different signers, with a slight imbalance among them.
2. AzSLD_Words
This component comprises a collection of short video samples representing frequently used words in Azerbaijani Sign Language. It is divided into two subsets:
Folder names indicate the ground truth labels for the ease of word-level model evaluation.
This component includes over 14,000 video and image samples of letters of the Azerbaijani alphabet. Each sign is captured from multiple angles to ensure comprehensive coverage of dactylology in AzSL. This component is ideal for tasks involving letter recognition and the integration of fingerspelling into broader sign language recognition systems.
The dataset includes 10,104 synchronized video recordings from two camera angles to capture both frontal and side views of hand and body movements, ensuring that the subtle nuances of sign language are well-represented.
The dataset features recordings from a diverse group of native AzSL signers, encompassing variations in age, gender, and signing style. This diversity is crucial for training models that are robust to variations in signing.
Each video is annotated with comprehensive metadata, including the sign’s label (dactyl, word, or sentence), signer ID, and timestamped glosses for sentence-level signs.
The dataset comprises RGB videos in high-definition (HD) resolution at 35 frames per second, accompanied by JSON files containing annotations and metadata. The data is systematically organized into folders by category for ease of navigation.
Ethical Transparency
All participants provided informed consent for collecting, publishing, and using the data, ensuring compliance with ethical research standards.
Accessibility
The AzSLD is available under Creative Commons Attribution 4.0 International with free access for academic research through Zenodo.
Citation: When using AzSLD in your research, please cite the following paper:
Alishzade, N., Hasanov, J. (2025). AzSLD: Azerbaijani sign language dataset for fingerspelling, word, and sentence translation with baseline software, Data in Brief, Volume 58, 2025, 111230, ISSN 2352-3409, https://url.au.m.mimecastprotect.com/s/szU6C2xMQziEvMn1kFBi9S5WqA6?domain=doi.org" href="https://url.au.m.mimecastprotect.com/s/szU6C2xMQziEvMn1kFBi9S5WqA6?domain=doi.org" target="_blank" rel="noopener noreferrer">https://doi.org/10.1016/j.dib.2024.111230.
The preprint is available at: https://arxiv.org/abs/2411.12865
Contact:
For questions, feedback, or contributions, please contact the project team at: slr.project.ada@gmail.com
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic EEG data generated by the ‘bai’ model based on real data.
Total Features/Columns: 1140
License: Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original dataset description and JSON file structure: https://www.lvisdataset.org/dataset Best practices: https://www.lvisdataset.org/bestpractices
LVIS is based on the COCO 2017 dataset, that you can find here: https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset Image data is the same as COCO 2017, the difference is in the annotations.
LVIS has annotations for instance segmentation in a format similar to COCO. The annotations are stored using JSON. The LVIS API can be used to access and manipulate annotations. Each image now comes with two additional fields. not_exhaustive_category_ids : List of category ids which don't have all of their instances marked exhaustively. neg_category_ids : List of category ids which were verified as not present in the image. coco_url : Image URL. The last two path elements identify the split in the COCO dataset and the file name (e.g., http://images.cocodataset.org/train2017/000000391895.jpg). This information can be used to load the correct image from your downloaded copy of the COCO dataset. Categories LVIS categories are loosely based on WordNet synsets. synset : Provides a unique string identifier for each category. Loosely based on WordNet synets. synonyms : List of object names that belong to the same synset. def : The meaning of the synset. Most of the meanings are derived from WordNet. image_count : Number of images in which the category is annotated. instance_count : Number of annotated instances of the category. frequency : We divide the categories into three buckets based on image_count in the train set.
Information on more than 180,000 Terrorist Attacks
The Global Terrorism Database (GTD) is an open-source database including information on terrorist attacks around the world from 1970 through 2017. The GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 180,000 attacks. The database is maintained by researchers at the National Consortium for the Study of Terrorism and Responses to Terrorism (START), headquartered at the University of Maryland. [More Information][1]
Geography: Worldwide
Time period: 1970-2017, except 1993
Unit of analysis: Attack
Variables: >100 variables on location, tactics, perpetrators, targets, and outcomes
Sources: Unclassified media articles (Note: Please interpret changes over time with caution. Global patterns are driven by diverse trends in particular regions, and data collection is influenced by fluctuations in access to media coverage over both time and place.)
Definition of terrorism:
"The threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation."
See the [GTD Codebook][2] for important details on data collection methodology, definitions, and coding schema.
The Global Terrorism Database is funded through START, by the US Department of State (Contract Number: SAQMMA12M1292) and the US Department of Homeland Security Science and Technology Directorate’s Office of University Programs (Award Number 2012-ST-061-CS0001, CSTAB 3.1). The coding decisions and classifications contained in the database are determined independently by START researchers and should not be interpreted as necessarily representing the official views or policies of the United States Government.
[GTD Team][3]
The GTD has been leveraged extensively in [scholarly publications][4], [reports][5], and [media articles][6]. [Putting Terrorism in Context: Lessons from the Global Terrorism Database][7], by GTD principal investigators LaFree, Dugan, and Miller investigates patterns of terrorism and provides perspective on the challenges of data collection and analysis. The GTD's data collection manager, Michael Jensen, discusses important [Benefits and Drawbacks of Methodological Advancements in Data Collection and Coding][8].
Use of the data signifies your agreement to the following [terms and conditions][9].
END USER LICENSE AGREEMENT WITH UNIVERSITY OF MARYLAND
IMPORTANT – THIS IS A LEGAL AGREEMENT BETWEEN YOU ("You") AND THE UNIVERSITY OF MARYLAND, a public agency and instrumentality of the State of Maryland, by and through the National Consortium for the Study of Terrorism and Responses to Terrorism (“START,” “US,” “WE” or “University”). PLEASE READ THIS END USER LICENSE AGREEMENT (“EULA”) BEFORE ACCESSING THE Global Terrorism Database (“GTD”). THE TERMS OF THIS EULA GOVERN YOUR ACCESS TO AND USE OF THE GTD WEBSITE, THE DATA, THE CODEBOOK, AND ANY AUXILIARY MATERIALS. BY ACCESSING THE GTD, YOU SIGNIFY THAT YOU HAVE READ, UNDERSTAND, ACCEPT, AND AGREE TO ABIDE BY THESE TERMS AND CONDITIONS. IF YOU DO NOT ACCEPT THE TERMS OF THIS EULA, DO NOT ACCESS THE GTD.
TERMS AND CONDITIONS
GTD means Global Terrorism Database data and the online user interface (www.start.umd.edu/gtd) produced and maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START). This includes the data and codebook, any auxiliary materials present, and the user interface by which the data are presented.
LICENSE GRANT. University hereby grants You a revocable, non-exclusive, non-transferable right and license to access the GTD and use the data, the codebook, and any auxiliary materials solely for non-commercial research and analysis.
RESTRICTIONS. You agree to NOT: a. publicly post or display the data, the codebook, or any auxiliary materials without express written permission by University of Maryland (this excludes publication of analysis or visualization of the data for non-commercial purposes); b. sell, license, sublicense, or otherwise distribute the data, the codebook, or any auxiliary materials to third parties for cash or other considerations; c. modify, hide, delete or interfere with any notices that are included on the GTD or the codebook, or any auxiliary materials; d. use the GTD to draw conclusions about the official legal status or criminal record of an individual, or the status of a criminal or civil investigation; e. interfere with or disrupt the GTD website or servers and networks connected to the GTD website; or f. use robots, spiders, crawlers, automated devices and similar technologies to screen-scrape the site or to engage in data aggregation or indexing of the data, the codebook, or any auxiliary materials other than in accordance with the site’s robots.txt file.
YOUR RESPONSIBILITIES: a. All information sourced from the GTD should be acknowledged and cited as follows: "National Consortium for the Study of Terrorism and Responses to Terrorism (START), University of Maryland. (2018). The Global Terrorism Database (GTD) [Data file]. Retrieved from https://www.start.umd.edu/gtd" b. You agree to acknowledge any copyrightable materials with a copyright notice “Copyright University of Maryland 2018.” c. Any modifications You make to the GTD for published analysis must be clearly documented and must not misrepresent analytical decisions made by START. d. You agree to seek out an additional agreement in order to use the GTD, the data, the codebook or auxiliary materials for commercial purposes, or to create commercial product or services based on the GTD, the data, the codebook or auxiliary materials.
INTELLECTUAL PROPERTY. The University owns all rights, title, and interest in the GTD, the data and codebook, and all auxiliary materials. This EULA does not grant You any rights, title, or interests in the GTD or the data, the codebook, user interface, or any auxiliary materials other than those expressly granted to you under this EULA.
DISCLAIMER AND LIMITATION ON LIABILITY. a. THE GTD, THE CODEBOOK, USER INTERFACE, OR ANY AUXILIARY MATERIALS ARE MADE AVAILABLE ON AN "AS IS" BASIS. UNIVERSITY DISCLAIMS ANY AND ALL REPRESENTATIONS AND WARRANTIES – WHETHER EXPRESS OR IMPLIED, ORAL OR WRITTEN, IN FACT OR ARISING BY OPERATION OF LAW – WITH RESPECT TO THE GTD, THE CODEBOOK, AND ANY AUXILIARY MATERIALS INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THE INTELLECTUAL PROPERTY OR PROPRIETARY RIGHTS OF ANY THIRD PARTY. UNIVERSITY MAKES NO REPRESENTATION OR WARRANTY THAT THE GTD, THE CODEBOOK, ANY AUXILIARY MATERIALS, OR USER INTERFACE WILL OPERATE ERROR FREE OR IN AN UNINTERRUPTED FASHION. b. In no event will the University be liable to You for any incidental, special, punitive, exemplary or consequential damages of any kind, including lost profits or business interruption, even if advised of the possibility of such claims or demands, whether in contract, tort, or otherwise, arising in connection with Your access to and use of the GTD, the codebook, user interface, or any auxiliary materials or other dealings. This limitation upon damages and claims is intended to apply without regard to whether other provisions of this EULA have been breached or proven ineffective. In no event will University’s total liability for the breach or nonperformance of this EULA exceed the fees paid to University within the current billing cycle. c. Every reasonable effort has been made to check sources and verify facts in the GTD; however, START cannot guarantee that accounts reported in the open literature are complete and accurate. START shall not be held liable for any loss or damage caused by errors or omissions or resulting from any use, misuse, or alteration of GTD data by the USER. The USER should not infer any additional actions or results beyond what is presented in a GTD entry and specifically, the USER should not infer an individual referenced in the GTD was charged, tried, or convicted of terrorism or any other criminal offense. If new documentation about an event becomes available, START may modify the data as necessary and appropriate. d. University is under no obligation to update the GTD, the codebook, user interface, or any auxiliary materials.
INDEMNITY. You hereby agree to defend, indemnify, and hold harmless the University and its employees, agents, directors, and officers from and against any and all claims, proceedings, damages, injuries, liabilities, losses, costs, and expenses (including reasonable attorneys’ fees and litigation expenses) relating to or arising out of Your use of the GTD, the codebook, or any auxiliary materials or Your breach of any term in this EULA.
TERM AND TERMINATION a. This EULA and your right to access the GTD website and use the data, the codebook, and any auxiliary materials will take effect when you access the GTD. b. University reserves the right, at any time and without prior notice, to modify, discontinue or suspend, temporarily or permanently, Your access to the GTD website (or any part thereof) without liability to You.
MISCELLANEOUS a. The University may modify this EULA at any time. Check the GTD website for modifications. b. No term of this Agreement can be waived except by the written consent of the party waiving compliance. c. If any provision of this EULA is determined by a court of competent jurisdiction to be void, invalid, or otherwise unenforceable, such determination shall not affect the remaining provisions of this Agreement. d. This Agreement does not create a joint venture, partnership, employment, or agency relationship between the Parties. e. There are no third
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Type of data: 720 x 720 px images of colored flowers. Data format: JEPG
Dataset contents: Original images of different varieties of colored flowers in Bangladesh from single flower and bulk-flower perspectives.
Number of classes: Thirteen colored flower varieties - (1) Chandramallika, (2) Cosmos Phul, (3) Gada, (4) Golap, (5) Jaba, (6) Kagoj Phul, (7) Noyontara, (8) Radhachura, (9) Rangan, (10) Salvia, (11) Sandhyamani, (12) Surjomukhi, and (13) Zinnia.
Total number of images in the dataset: 7,993.
Distribution of instances: (1) Chandramallika = 620 images in total. Single = 306, Bulk = 314. (2) Cosmos Phul = 620 images in total. Single = 307, Bulk = 313. (3) Gada = 617 images in total. Single = 304, Bulk = 313. (4) Golap = 605 images in total. Single = 302, Bulk = 303. (5) Jaba = 604 images in total. Single = 300, Bulk = 304. (6) Kagoj Phul = 612 images in total. Single = 301, Bulk = 311. (7) Noyontara = 609 images in total. Single = 303, Bulk = 306. (8) Radhachura = 617 images in total. Single = 309, Bulk = 308. (9) Rangan = 606 images in total. Single = 305, Bulk = 301. (10) Salvia = 634 images in total. Single = 313, Bulk = 321. (11) Sandhyamani = 615 images in total. Single = 305, Bulk = 310. (12) Surjomukhi = 621 images in total. Single = 310, Bulk = 311. (13) Zinnia = 613 images in total. Single = 307, Bulk = 306.
Dataset size: The total size of the dataset is 2.79 GB and the compressed ZIP file size is 2.71 GB.
Data acquisition process: Images of colored flowers are captured using a high-definition smartphone camera from different angles and two perspectives: single-flower and bulk-flower.
Data source location: Plant nurseries, local gardens, and flower shops located in different areas of Dhaka and Gazipur districts of Bangladesh.
Where applicable: Training and evaluating machine learning and deep learning models to distinguish colored flower varieties in Bangladesh to support automated identification and classification systems of various colored flowers which can be utilized in areas of computer vision, botanical research, floral biodiversity monitoring, agriculture and horticulture, environmental conservation, AI-based flower recognition, educational resources, food industry, pollination and ecology research, aesthetic and design applications.
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These three artificial datasets are for mining erasable itemset. The definition of erasable itemset is in the following reference papers. Note that the three data sets all include 200 different items. But for each item, we did not give the profit value of it. Users can generate as they require, with normal or randomly distribution.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
How to citePlease cite the original dictionary and the dataset and code in this repository as follows:Kähler, Hans. (1987). Enggano-Deutsches Wörterbuch (Veröffentlichungen Des Seminars Für Indonesische Und Südseesprachen Der Universität Hamburg 14). Berlin; Hamburg: Dietrich Reimer Verlag. https://search.worldcat.org/title/18191699.Rajeg, Gede Primahadi Wijaya; Pramartha, Cokorda Rai Adi; Sarasvananda, Ida Bagus Gede; Widiatmika, Putu Wahyu; Segara, Ida Bagus Made Ari; Pita, Yul Fulgensia Rusman; et al. (2024). Retro-digitised Enggano-German dictionary derived from Kähler’s (1987) “Enggano-Deutsches Wörterbuch”. University of Oxford. Dataset. https://doi.org/10.25446/oxford.28057742OverviewThis is a hand-digitised Enggano-German Dictionary derived from Hans Kähler's (1987) “Enggano-Deutsches Wörterbuch”. We crowdsourced the digitisation process by transcribing the dictionary's content into an online database system; the system was set up by Cokorda Pramartha and I B. G. Sarasvananda in collaboration with the first author. The database is exported into a .csv file to be further processed computationally and manually, such as fixing typos and incorrect mapping of the entry element, providing the English and Indonesian translations, and standardising the orthography.A pre-release can be accessed here. The minor update in the current version includes adding a description of the column names for the tabular data of the digitised dictionary. The dictionary is stored as a table for three file types: .rds (for the R data format), .csv, and .tsv.Aspects to be worked out for the future development of the dataset can be accessed here.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the definition of the structure used for preparation of the publication:Jedlińska A., Pisarski D., Mikułowski G., Błachowski B., Jankowski Ł., Semi-Active Control of a Shear Building based on Reinforcement Learning: Robustness to measurement noise and model error, FedCSIS 2023, 18th Conference on Computer Science and Intelligence Systems, 2023-09-17/09-20, Warsaw (PL), pp.1001-1004, 2023. https://doi.org/10.15439/2023F8946This research has been supported by the National Science Centre, Poland, under grant agreement 2020/39/B/ST8/02615.The structure definition files are in the txt/CSV format, and they have been exported using the Wolfram Mathematica environment. The model error and measurement noise files are in the JSON format, and they have been generated using the Python programming language.
This dataset was created by Himanshu Bhardwaj