41 datasets found

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11391315
Dataset updated
Jun 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
Time period covered
May 29, 2024
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Loist, Skadi
Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
d
Consumer Airfare Report: Table 1a - All U.S. Airport Pair Markets
catalog.data.gov
data.virginia.gov
+2more
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of the Secretary of Transportation (2025). Consumer Airfare Report: Table 1a - All U.S. Airport Pair Markets [Dataset]. https://catalog.data.gov/dataset/consumer-airfare-report-table-1a-all-u-s-airport-pair-markets
Explore at:
Dataset updated
Jan 27, 2025
Dataset provided by
Office of the Secretary of Transportation
Area covered
United States
Description
Available only on the web, provides information for airport pair markets rather than city pair markets. This table only lists airport markets where the origin or destination airport is an airport that has other commercial airports in the same city. Midway Airport (MDW) and O'Hare (ORD) are examples of this. All records are aggregated as directionless markets. The combination of Airport_1 and Airport_2 define the airport pair market. All traffic traveling in both directions is added together. https://www.transportation.gov/policy/aviation-policy/competition-data-analysis/research-reports
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LScDC Word-Category RIG Matrix [Dataset]. https://figshare.le.ac.uk/articles/dataset/LScDC_Word-Category_RIG_Matrix/12133431
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical...
zenodo.org
bin
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julián García Pardiñas; Julián García Pardiñas; Marta Calvi; Marta Calvi; Jonas Eschle; Jonas Eschle; Andrea Mauri; Andrea Mauri; Simone Meloni; Martina Mozzanica; Martina Mozzanica; Nicola Serra; Nicola Serra; Simone Meloni (2023). Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions" [Dataset]. http://doi.org/10.5281/zenodo.7799170
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7799170
Dataset updated
Apr 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julián García Pardiñas; Julián García Pardiñas; Marta Calvi; Marta Calvi; Jonas Eschle; Jonas Eschle; Andrea Mauri; Andrea Mauri; Simone Meloni; Martina Mozzanica; Martina Mozzanica; Nicola Serra; Nicola Serra; Simone Meloni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DFEI dataset

The full description can also be found in README.md.

The dataset was used in the paper “GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions”. The project describes a full event interpretation at the LHCb experiment, situated at the Large Hadron Collider in CERN, Geneva. An “event” consists of detector responses that were converted to tracks - each track represents a particle.

The aim of the algorithm is to make sense of the tracks and bundle together tracks coming from the same origin, as well as interpreting their decay hierarchy.

Generated events

The events in this dataset are based on simulation generated with PYTHIA8 and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.

LHCb period Num. vis. pp collisions Num. tracks Num. b hadrons Num. c hadrons
Runs 3-4 (Upgrade I) ∼ 5 ∼ 150 ≪ 1 ∼ 1

Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described in the paper in the appendix “Simulation”. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.

Datasets

The datasets are divided in three categories

Training and testing

The file Dataset_InclusiveHb_Training.root contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.

Evaluation

The inclusive dataset Dataset_InclusiveHb_Evaluation.root contains the evaluation events (50,000).

Exclusive decays

In addition to this inclusive dataset, several other smaller samples (of few thousand events each) have also been generated, requiring that all the events in each sample contained a specific (exclusive) type of b-hadron decay. The specific modes have been chosen to be representative of the most common classes of decay topologies of physics interest for LHCb. These samples contain only events in which all the particles originating from each of the considered exclusive decays have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region.

The datasets contained are:

Dataset_Bd_DD.root

Dataset_Bd_Kpi.root

Dataset_Bd_Kstmumu.root

Dataset_Bs_Dspi.root

Dataset_Bs_Jpsiphi.root

Dataset_Bu_KKpi.root

Dataset_Lb_Lcpi.root

More information on them can be found in the paper.

Loading the data

The dataset is saved in the binary ROOT format with a key-array mapping. It can be loaded using the uproot Python library to convert it to a pandas DataFrame or similar.

An example snippet is given here:

import uproot # treename = "Particles" treename = "Relations" with uproot.open('/path/to/file.root') as file: df = file[treename].arrays( # we can specify only a set of branches # ['EventNumber', "FromSamePV_true"], library='pd') # 'pd' for pandas

The returned file behaves like a mapping that contains two different data holders. They are accessible with Relations or Particles that contain either the relations between the particles or the particles themselves.

Regarding the Relations, only edges connecting two different particles are contained in the dataset. The edges are treated as not directional, so a single edge is considered for each pair of particles.

Variables

The relevant features used in the GNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented. When specified in the name of the variables, the suffix “_true” refers to ground-truth information, and the suffix “_reco” refers to the output of the emulated LHCb reconstruction.

General:

EventNumber: unique number to identify the event that the entry belongs to.

Node variables:

ParticleKey: unique number to identify each particle in a given event.

Identity (ID): numerical code identifying the type of particle, following the Monte Carlo Particle Numbering Scheme.

FromPrimaryBeautyHadron: boolean variable indicating whether the particles has been produced in a beauty hadron decay or not.

Transverse momentum (p_T): component of the three-momentum transverse to the beamline, i.e. the x and y component combined.

Impact parameter with respect to the associated primary vertex (IP): distance of closest approach between the particle trajectory and its associated primary vertex (proton-proton collision point), defined as the one with the smallest IP for the given particle amongst all the primary vertices in the event.

Pseudorapidity (η): spatial coordinate describing the angle of a particle relative to the beam axis, computed as η = arctanh(p_z/∥p⃗∥).

Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.

O_x, O_y, O_z: cartesian coordinates of the origin point of the particle.

p_x, p_y, p_z: cartesian coordinates of the three-momentum.

PV_x, PV_y, PV_z: cartesian coordinates of the position of the associated primary vertex.

Edge variables:

FirstParticleKey: ParticleKey of one of the two particles connected by the edge.

SecondParticleKey: ParticleKey of the other particle, verifying FirstParticleKey > SecondParticleKey.

FromSamePrimaryBeautyHadron: boolean variable indicating whether the two particles originate from the same beauty hadron decay.

Opening angle (θ): angle between the three-momentum directions of the two particles.

Momentum-transverse distance (d_⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.

Distance along the beam axis (Δ_z): difference between the z-coordinate of the origin points of the two particles.

FromSamePV: boolean variable indicating whether the two particles share the same associated primary vertex.

Order of the “topological” Lowest Common Ancestor (TopoLCAOrder): variable that can take the values 0, 1, 2 or 3, as explained in the paper.

Identity of the “topological” Lowest Common Ancestor (TopoLCAID): numerical code identifying the particle type of the ancestor, following the Monte Carlo Particle Numbering Scheme.
Opal Trips - Light Rail
opendata.transport.nsw.gov.au
developer.transport.nsw.gov.au
Updated Jan 11, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opendata.transport.nsw.gov.au (2017). Opal Trips - Light Rail [Dataset]. https://opendata.transport.nsw.gov.au/dataset/opal-trips-light-rail
Explore at:
Dataset updated
Jan 11, 2017
Dataset provided by
Transport for NSWhttp://www.transport.nsw.gov.au/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Official Light Rail Utilisation figures. Opal tap-on/tap-off data (representing an individual entering and exiting a Light Rail station), is aggregated to a total monthly figure representing the estimated number of trips. Starting July 1, 2024, the methodology for calculating trip numbers for individual lines and operators will change to more accurately reflect the services our passengers use within the transport network. This new approach will apply to trains, metros, light rail, and ferries, and will soon be extended to buses. Aggregations between line, agency, and mode levels will no longer be valid, as a passenger may use multiple lines on a single trip. Trip numbers at the line, operator, or mode level should be used as reported, without further combinations. The dataset includes reports based on both the new and old methodologies, with a transition to the new method taking place over the coming months. As a result of this change, caution should be exercised when analysing longer trends that utilise both datasets. More information on NRT ROAM can be accessed here Caution School Student travel using concessional Opal cards is included. However this may be underrepresented, due to inconsistent tap-on/tap-off behaviour by students at light rail stations Magnetic Stripe Ticketing (MST – paper tickets) data was also available in July 2016. MST patronage data for July is available here Opal data may be subject to minor revision for the two months following upload Data is static at a point in time, and may not match other reports that are real time All non-Opal travel is excluded, for example transport concession entitlement cards, integrated ticketing for major events, and fare non-compliance An Opal Trip is defined as a tap-on/tap-off pair (including where only a single tap-on or tap-off is recorded) A significant portion of the Light Rail line was closed during the months of January 2017 and 2018, resulting in lower number of trips in both months Please note: the data includes Newcastle Light Rail
NASA Acronyms in Public Abstracts
data.nasa.gov
application/rdfxml +5
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA Office of Chief Information Officer (2021). NASA Acronyms in Public Abstracts [Dataset]. https://data.nasa.gov/Raw-Data/NASA-Acronyms-in-Public-Abstracts/byqb-7uyn
Explore at:
xml, csv, application/rdfxml, tsv, json, application/rssxmlAvailable download formats
Dataset updated
Dec 9, 2021
Dataset provided by
NASAhttp://nasa.gov/
Authors
NASA Office of Chief Information Officer
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
NASA Acronyms in Public Abstracts

Dataset Description

This dataset was created as a data source for machine-learning models used to disambiguate acronyms with multiple definitions. This dataset includes files that cover 406,005 abstracts. 484 acronyms with multiple definitions and multiple examples of use in different abstracts were extracted.

This was found to be a suitable dataset for training disambiguation models that use the context of the surrounding sentences to predict the correct meaning of the acronym. The prototype machine-learning models created from this dataset have not been released.

The NASA Science Technology and Information Program (https://www.sti.nasa.gov/) provided the NASA Office of the Chief Information Officer Transformation and Data Division Data Analytics team with a large JSONL of public abstracts from NASA authored papers and reports. These can be found in the results_merged.jsonl. These documents were exported in late 2018 and processed in 2019. They should not be thought to be extensive or complete of all public NASA abstracts. Please contact https://www.sti.nasa.gov/ if you want a full and up-to-date data dump. This dataset is processed for a specific purpose at a specific point in time.

JSONL is used as the format instead of JSON as it is faster and easier to access specific lines without having to check the dictionary for each metadata instance.

This dataset could be used for various purposes including lists of acronyms, lists of acronym definitions, and natural language processing models to disambiguate the meanings of acronyms with more than one definition. Anthony Buonomo, Jack Steilberg, and Justin Gosses contributed preparing this dataset as part of an intern project.

Individual File Descriptions

README.md:

This is this file and contains a description of the individual files.

results_merged.jsonl:

Holds the abstracts and associated abstract metadata in a JSONL format where each metadata object is a separate line. There are 406005 number of lines or abstracts in the JSONL file.

The keys for each object include:

'contributor.originator',

'creator',

'date.available',

'date.issued',

'description',

'format',

'identifier',

'identifier.casi_id',

'language',

'relation.requires',

'rights',

'rights.accessRights',

'subject',

'subject.NASATerms',

'title',

'type'

test_records.jsonl:

This is a file similar to results_merged.jsonl but it only includes 102 lines of metadata instances, which makes it much easier to work with when testing.

processed_acronyms.jsonl:

Each line in this file is an acronym found to have more than one defintion. There are 484 acronyms found with multiple definitions suitable for model building. Each line contains information on acronym, definitions, and where found in the corpus. The corpus is the file results_merged.jsonl

The keys include:

"acronym"

"definition"

"corpus_positions"

"freq"

"ac_freq":

"mult_defs"

"group_ids"

formatted_acronyms.jsonl:

This file contains approximately 92,000 words extracted that might be acronyms, their defintions if found, and their position within the corpus. Many do not have extracted definitions. It should be noted that not all of them area acronyms. A relatively broad definition was used to generate this file.

Each acronym instance is on a separate line and has the following keys:

"acronym"

"definition"

"corpus_positions"

"freq"

- "ac_freq"

acronyms.jsonl:

Each line in this JSONL file maps back to each line that contains metadata for an abstrat in results_merged.jsonl.

Each object on each line is key:value pairs of acronym & detected definition. If a definition is not detected, it is left as empty string "".

EXAMPLE CODE TO LOAD JSONL FILES IN PYTHON

Because JSONL is a little different than JSON, here's some example code for loading a file: ``` import json with open('results_merged.jsonl', 'r') as json_file: json_list = list(json_file) for json_str in json_list: result = json.loads(json_str) print(f"result: {result}") print(isinstance(result, dict))
Z
ActiveHuman Part 1
data.niaid.nih.gov
zenodo.org
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8359765
Explore at:
Dataset updated
Nov 14, 2023
Dataset authored and provided by
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 1/2 of the ActiveHuman dataset! Part 2 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
h
hind_encorp
huggingface.co
paperswithcode.com
+3more
Updated Mar 22, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
Datasets from an interlaboratory comparison to characterize a multi-modal...
s.cnmilf.com
datasets.ai
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Datasets from an interlaboratory comparison to characterize a multi-modal polydisperse sub-micrometer bead dispersion [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/datasets-from-an-interlaboratory-comparison-to-characterize-a-multi-modal-polydisperse-sub-0a6ef
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
These four data files contain datasets from an interlaboratory comparison that characterized a polydisperse five-population bead dispersion in water. A more detailed version of this description is available in the ReadMe file (PdP-ILC_datasets_ReadMe_v1.txt), which also includes definitions of abbreviations used in the data files. Paired samples were evaluated, so the datasets are organized as pairs associated with a randomly assigned laboratory number. The datasets are organized in the files by instrument type: PTA (particle tracking analysis), RMM (resonant mass measurement), ESZ (electrical sensing zone), and OTH (other techniques not covered in the three largest groups, including holographic particle characterization, laser diffraction, flow imaging, and flow cytometry). In the OTH group, the specific instrument type for each dataset is noted. Each instrument type (PTA, RMM, ESZ, OTH) has a dedicated file. Included in the data files for each dataset are: (1) the cumulative particle number concentration (PNC, (1/mL)); (2) the concentration distribution density (CDD, (1/mL·nm)) based upon five bins centered at each particle population peak diameter; (3) the CDD in higher resolution, varied-width bins. The lower-diameter bin edge (µm) is given for (2) and (3). Additionally, the PTA, RMM, and ESZ files each contain unweighted mean cumulative particle number concentrations and concentration distribution densities calculated from all datasets reporting values. The associated standard deviations and standard errors of the mean are also given. In the OTH file, the means and standard deviations were calculated using only data from one of the sub-groups (holographic particle characterization) that had n = 3 paired datasets. Where necessary, datasets not using the common bin resolutions are noted (PTA, OTH groups). The data contained here are presented and discussed in a manuscript to be submitted to the Journal of Pharmaceutical Sciences and presented as part of that scientific record.
d
Catalan LMF Apertium Dictionary - Dataset - B2FIND
b2find.dkrz.de
Updated Mar 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Catalan LMF Apertium Dictionary - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/dcf0f052-8b62-5d81-a340-68c57841f552
Explore at:
Dataset updated
Mar 5, 2024
Description
This is the LMF version of the Catalan Apertium ditionary. Monolingual dictionaries for Spanish, Catalan, Gallego and Euskera have been generated from the Apertium expanded lexicons of the es-ca (for both Spanish and Catalan) es-gl (for Galician) and eu-es (for Basque). Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides: a language-independent machine translation engine; tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.
Data sample for Mercury simulator
zenodo.org
zip
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerald Gurtner; Gerald Gurtner; Luis Delgado; Luis Delgado; Graham Tanner; Tatjana Bolic; Tatjana Bolic; Graham Tanner (2023). Data sample for Mercury simulator [Dataset]. http://doi.org/10.5281/zenodo.10246302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10246302
Dataset updated
Dec 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gerald Gurtner; Gerald Gurtner; Luis Delgado; Luis Delgado; Graham Tanner; Tatjana Bolic; Tatjana Bolic; Graham Tanner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 27, 2023
Description
Sample data to run Mercury
Mercury is available at: https://github.com/UoW-ATM/Mercury
This file contains a sample of input data for the open-source air mobility simulator Mercury. Instead of this dataset, please use v3 for Mercury release v3.0 (https://doi.org/10.5281/zenodo.10222526).
The dataset is structured as follows:
input: Input folder for Mercury
input/scenario=-1: Folder containing scenario -1 with about 1000 flights anonymised.
input/scenario=-1/scenario_config.toml: Configuration file for the scenario
input/scenario=-1/data: data provided is organised as follows:
ac_performance: aircraft performance dataset. Requires BADA files (not provided, BADA files need to be structured inside provided folders (see Mercury Readme)
airlines: static information on airlines used in the scenario
airports: static information on airports, including capacity declarations and minimum turnaround time. Two subfolders included: taxi (with taxi-in and taxi-out times) and curfew (with curfew times)
costs: data required to define cost functions
delay: delay parameters (non-ATFM)
eaman: definition of EAMAN in scenario (scope)
flight_plans: information on the flight plans, contains:
crco: not provided as not needed to run Mercury (used for FP generation)
en_route_wind: not provided as not needed to run Mercury (used for FP generation)
flight_plans_pool: pool of flight plans for o-d ac type triplets
flight_uncertainty: distributions to model uncertainty on the realisation of the flight plans
routes: routes available between o-d pairs (used only for HMI and for FP generation (not needed to run Mercury))
trajectories: trajectories available from o-d ac type triplets (used only for HMI and for FP generation (not needed to run Mercury))
network_manager: ATFM probabilities, distributions and definition
pax: passenger itineraries for flights provided
scenario: static information on scenario (to be deprecated in subsequent updates)
schedules: flight schedules provided (note these will determine which airports, flight plans, etc. are provided in the dataset)
input/scenario=-1/case_sdudies: folder to contain the definition of case studies
case_study=0: default case study with the configuration file (case_study_config.toml)
d
Investigating the temperature evolution of EuTiO3 local structure by means...
b2find.dkrz.de
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Investigating the temperature evolution of EuTiO3 local structure by means of Pair Distribution FunctionAnalysis - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/ac512b59-d115-538c-b84d-bf7f7ca4ae9d
Explore at:
Dataset updated
Nov 12, 2024
Description
Magnetoelectric EuTiO3 undergoes a long range cubic to tetragonal phase transition at TC=235 K. However, calorimetric measurements suggested a higher transition temperature, TA=282 K. Recently our group demonstrated that tetragonal nanodomains exist also at T>TC. The proposed experiment aims to map the local and mesoscopic structure of EuTiO3 from below TC to above TA by means PDF analysis at the D4 instrument. Neutron diffraction should supply a more accurate description of Eu local/average structure in respect to X-rays since Ti and O positions are more accurately determined.The accurate description of the temperature evolution of the local structure will allow to deepen the comprehension of the phase transition mechanism for this interesting magnetoelectric material..Due to the huge absorption coefficient of Eu, this experiment is only possible at D4 instruments using lambda=0.7-0.8 Å and a double walled cylindrical sample holder. In fact, the energy dependence of the Eu absorption cross section make it unfeasible at spallation sources.
analphipy: A python package to analyze pair-potential metrics.
catalog.data.gov
s.cnmilf.com
Updated Oct 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). analphipy: A python package to analyze pair-potential metrics. [Dataset]. https://catalog.data.gov/dataset/analphipy-a-python-package-to-analyze-pair-potential-metrics-1b9f4
Explore at:
Dataset updated
Oct 15, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
analphipy is a python package to calculate metrics for classical models for pair potentials. It provides a simple and extendable api for pair potentials creation. Several routines to calculate metrics are included in the package. The main features of analphipy are 1) Pre-defined spherically symmetric potentials. 2) Simple interface to extended to user defined pair potentials. 3) Routines to calculate Noro-Frenkel effective parameters. 4) Routines to calculate Jensen-Shannon divergence.
Large Scale International Boundaries
catalog.data.gov
geodata.state.gov
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of State (Point of Contact) (2025). Large Scale International Boundaries [Dataset]. https://catalog.data.gov/dataset/large-scale-international-boundaries
Explore at:
Dataset updated
Feb 28, 2025
Dataset provided by
United States Department of Statehttp://state.gov/
Description
Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://hiu.state.gov/data/cartographic_guidance_bulletins/ Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of “Extension” represents a field containing data interoperability information. Other attributes not listed above include “FID”, “Shape_length” and “Shape.” These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields “CC1” and “CC2” fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. “CC1_GENC3” and “CC2_GENC3” fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes “Q2” or “QX2” denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The “COUNTRY1” and “COUNTRY2” fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the ‘"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. A value of “1” in the “RANK” field corresponds to an "International Boundary" value in the “STATUS” field. Values of ”2” and “3” correspond to “Other Line of International Separation” and “Special Line,” respectively. The “LABEL” field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The “NOTES” field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: - International Boundaries (Rank 1); - Other Lines of International Separation (Rank 2); and - Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the “CC1,” “CC1_GENC3,” “CC2,” “CC2_GENC3,” “RANK,” or “NOTES” fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields “CC1_GENC3” and “CC2_GENC” contain the corresponding three-character GENC code to the “CC1” and “CC2” attributes. The code “QX2” is the three-character counterpart of the code “Q2,” which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the “CC1_WPID” and “CC2_WPID” fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The “LSIB_ID” attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new “LSIB_ID.” The “ANTECIDS,” or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The “PREVIDS,” or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the feature—either line geometry or attribute—but it is still conceptually the same feature. The “PARENTID” field
m
Pothole Mix
data.mendeley.com
paperswithcode.com
Updated May 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Ranieri (2022). Pothole Mix [Dataset]. http://doi.org/10.17632/kfth5g2xk3.2
Explore at:
Unique identifier
https://doi.org/10.17632/kfth5g2xk3.2
Dataset updated
May 27, 2022
Authors
Andrea Ranieri
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This dataset for the semantic segmentation of potholes and cracks on the road surface was assembled from 5 other datasets already publicly available, plus a very small addition of segmented images on our part. To speed up the labeling operations, we started working with depth cameras to try to automate, to some extent, this extremely time-consuming phase.

The main dataset is composed of 4340 (image,mask) pairs at different resolutions divided into training/validation/test sets with a proportion of 3340/496/504 images equal to 77/11/12 percent. This is the dataset used in the SHREC2022 competition and it is the dataset that allowed us to train the neural networks for semantic segmentation capable of obtaining the nice images and videos that you have probably already seen.

Along the main dataset we also release a set of RGB-D videos consisting of 797 RGB clips and as many clips with their disparity maps, captured with the excellent OAK-D cameras we won for being finalists at the OpenCV AI Competition 2021. In an effort to achieve (semi-)automatic labeling for these clips, we postprocessed the disparity maps using classic CV algorithms and managed to obtain 359 binary mask clips. Obviously these masks are not perfect (they cannot be by definition, otherwise the problem of automatic road damage detection would not exist), but nonetheless we believe they are an excellent starting point to create, for example, new data augmentations (creating potholes on "intact road images" belonging to other standard road datasets) or to be used as textures in the creation of 3D scenes from which to extract large amounts of images/masks for the training of neural networks. You can have a preview of what you will find in these clips by watching this video showing the overlay of images and binary masks: http://deeplearning.ge.imati.cnr.it/genova-5G/video/pothole-mix-videos/pothole-mix-rgb-d-overlay-videos-concat.html

Please take a look at the readme file inside the main dataset zipfile to have some more details about the single sub-datasets and their sources.
T
paws_wiki
tensorflow.org
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). paws_wiki [Dataset]. https://www.tensorflow.org/datasets/catalog/paws_wiki
Explore at:
Dataset updated
Dec 15, 2022
Description
Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

For further details, see the accompanying paper: PAWS: Paraphrase Adversaries from Word Scrambling at https://arxiv.org/abs/1904.01130

This corpus contains pairs generated from Wikipedia pages, containing pairs that are generated from both word swapping and back translation methods. All pairs have human judgements on both paraphrasing and fluency and they are split into Train/Dev/Test sections.

All files are in the tsv format with four columns:

id: A unique id for each pair.

sentence1: The first sentence.

sentence2: The second sentence.

(noisy_)label: (Noisy) label for each pair.

Each label has two possible values: 0 indicates the pair has different meaning, while 1 indicates the pair is a paraphrase.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('paws_wiki', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
File group profile dataset
figshare.com
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kresimir Duretec; Christoph Becker (2016). File group profile dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1297321.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1297321.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Kresimir Duretec; Christoph Becker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains file group profiles for specific combinations of file formats and their properties. Combinations are defined by the Analysis Groups dataset (X). Currently three different sets are available. PDF_filtered covers PDF file format versions. Years covered are 1996-2010. HTML_filtered coveres HTML and XHTML file format versions. Years covered are 1996-2010. Distiller_filtered coveres 9 major releases of Adobe Distiller. Years covered are 1996-2010. This dataset is an intermediary result of the analysis workflow (https://github.com/datascience/FormatAnalysis). It is created by filtering and conflict reduction of the Format profile dataset from the UK web archive (DOI: 10.5259/ukwa.ds.2/fmt/1). First 2 columns cover combinations of property value pairs for different properties (mime type and software for Distiller_filtered and mie type and version for HTML_filtered and PDF_filtered). Last two columns cover year and amount of files with defined combination of properties that were available in a given dataset (UK web archive format profile) at specific year.
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
csv
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
Explore at:
csvAvailable download formats
Dataset updated
Apr 15, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files

LHCb period	Num. vis. pp collisions	Num. tracks	Num. b hadrons	Num. c hadrons
Runs 3-4 (Upgrade I)	∼ 5	∼ 150	≪ 1	∼ 1

Facebook

Twitter

Click to copy link

Link copied

Cite

Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification

Explore at:

application/gzip, bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11391315

Dataset updated

Jun 4, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Time period covered

May 29, 2024

Description

CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published

Clear search

Close search

Google apps

Main menu

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

Film Circulation dataset

Consumer Airfare Report: Table 1a - All U.S. Airport Pair Markets

University SET data, with faculty and courses characteristics

LScDC Word-Category RIG Matrix

Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical...

Opal Trips - Light Rail

NASA Acronyms in Public Abstracts

Dataset Description

Individual File Descriptions

- "ac_freq"

EXAMPLE CODE TO LOAD JSONL FILES IN PYTHON

ActiveHuman Part 1

hind_encorp

Datasets from an interlaboratory comparison to characterize a multi-modal...

Catalan LMF Apertium Dictionary - Dataset - B2FIND

Data sample for Mercury simulator

Investigating the temperature evolution of EuTiO3 local structure by means...

analphipy: A python package to analyze pair-potential metrics.

Large Scale International Boundaries

Pothole Mix

paws_wiki

File group profile dataset

Pairwise Multi-Class Document Classification for Semantic Relations between...

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationSee More Versions

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification