72 datasets found

l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
f
BHL Optical Character Recognition (OCR) - Full Text Export (new)
smithsonian.figshare.com
figshare.com
bin
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Richard; Jacqueline Dearborn (2025). BHL Optical Character Recognition (OCR) - Full Text Export (new) [Dataset]. http://doi.org/10.25573/data.21422193.v22
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25573/data.21422193.v22
Dataset updated
Mar 10, 2025
Dataset provided by
Smithsonian Libraries and Archives
Authors
Joel Richard; Jacqueline Dearborn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset contains a full export of the 60+ million pages of OCR content in the Biodiversity Heritage Library, for items hosted by BHL. For contextual information and key definitions about this dataset see the Biodiversity Heritage Library Open Data Collection and the data dictionary below.

Data Dictionary: s.si.edu/bhlocrtxt Release Date: the 17th of each month Frequency: Monthly bureauCode: 452:11 Access Level: public
Z
Player Experience in Video Game Character Analysis: A Study of Female...
data.niaid.nih.gov
Updated Jun 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Guzman, Wendell; Chavez, John Xavier (2024). Player Experience in Video Game Character Analysis: A Study of Female Characters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11641622
Explore at:
Dataset updated
Jun 30, 2024
Dataset provided by
Mapúa University
Authors
de Guzman, Wendell; Chavez, John Xavier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset is part of the study titled "Player Experience in Video Game Character Analysis: A Study of Female Characters", conducted at Mapúa University. The research aims to integrate player experience into an existing framework for video game character analysis.

Content

The dataset includes:

A partial transcript of 5 semi-structured interviews with the key informants. Originally, 8 interviews were conducted, but the audio/video recordings for 3 interviews were lost and thus their transcripts are not available.

Significant codes presented in tabulated form.

Data Collection Method

Data were collected through in-depth interviews conducted via Facebook Messenger and Discord from March to April 2024. Participants were various video game players from different backgrounds and age groups, ranging from 20 to 40 years old. Due to technical issues, the recordings of 3 interviews were lost, resulting in only 5 available transcripts.

Data Processing and Analysis

The 5 available interviews were transcribed verbatim. Data were analyzed using thematic analysis, involving initial coding, theme development, and refinement.

Usage data

The dataset is organized into several sections within a single Word document (.docx). This word document has headings for navigation and a definition of terms.

Limitations

The dataset only includes 5 out of 8 due to technical difficulties encountered after the recording of the interview. This may impact the comprehensiveness of the findings.

Contextual Reference

The manuscript associated with this dataset heavily references the works "A Structural Model for Player-Characters as Semiotic Constructs." (DOI: https://doi.org/10.26503/TODIGRA.V2I2.37) and "Object, me, symbiote, other: A social typology of player-avatar relationships." (DOI:https://doi.org/10.5210/FM.V20I2.5433) which explore the foundational frameworks on video game character analysis.

For any further information or clarifications, please contact wbdg2000@gmail.com
Namesakes
figshare.com
json
Updated Nov 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17009105.v1
Dataset updated
Nov 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
w
Dataset of books called ABC Japanese-English dictionary : an entirely new...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called ABC Japanese-English dictionary : an entirely new method of classification of the Chinese-Japanese characters [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=ABC+Japanese-English+dictionary+%3A+an+entirely+new+method+of+classification+of+the+Chinese-Japanese+characters
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is ABC Japanese-English dictionary : an entirely new method of classification of the Chinese-Japanese characters. It features 7 columns including author, publication date, language, and book publisher.
Game of Thrones mortality and survival dataset
figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reidar Lystad; Benjamin Brown (2023). Game of Thrones mortality and survival dataset [Dataset]. http://doi.org/10.6084/m9.figshare.8259680.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8259680.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Reidar Lystad; Benjamin Brown
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes data from Game of Thrones Seasons 1–8. The dataset comprises two separate datasets and an accompanying data dictionary. The character dataset contains 359 observations (i.e. characters) and 35 variables, including information about sociodemographics, exposures, and mortality. The episode dataset contains 73 observations (i.e. episodes) and 8 variables, including information about episode running time.An earlier version of the dataset, which included data from Game of Thrones Seasons 1–7 only, was used in the following original research article: Lystad RP, Brown BT. “Death is certain, the time is not”: mortality and survival in Game of Thrones. Injury Epidemiology 2018;5:44.
H
Data from: Definition of character for medical education based on expert...
dataverse.harvard.edu
Updated Nov 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yera Hur (2021). Definition of character for medical education based on expert opinions in Korea [Dataset]. http://doi.org/10.7910/DVN/S5JLIB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/S5JLIB
Dataset updated
Nov 10, 2021
Dataset provided by
Harvard Dataverse
Authors
Yera Hur
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
South Korea
Description
A single questionnaire with 3 major questions on character was distributed to medical education experts in Korea via e-mail. The questions were: “How would you define the ‘character’ that is required from a good doctor in the era of the fourth industrial revolution?”, “What are the issues of character education in current medical education (if any?)”, and “If you agree that there are any issue(s) of character education in current medical education, what possible solutions do you suggest?” The survey was distributed twice. In the first round of the survey, 145 e-mails were sent, and the response rate was 23.4% (34 responses). In the second round of survey distribution, 29 additional responses were gathered. Thus, responses from 63 medical education experts from 30 medical schools or colleges and 19 non-medical education experts were used in the final analysis.
h
DDD
huggingface.co
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iconic Interactive (2023). DDD [Dataset]. https://huggingface.co/datasets/IconicAI/DDD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset authored and provided by
Iconic Interactive
Description
Deep Dungeons and Dragons

A dataset of long-form multi-turn and multi-character collaborative RPG stories, complete with associated character cards. This dataset comprises of 56,000 turns in 1544 stories following 9771 characters: a total of 50M Llama tokens.Each turn comprises a multi-paragraph continuation of a story from the perspective of a defined character including both dialogue and prose. This dataset is a cleaned and reformatted version of Deep Dungeons and Dragons… See the full description on the dataset page: https://huggingface.co/datasets/IconicAI/DDD.

Greatest Comic Book Characters

kaggle.com

zip

Updated Oct 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Aman Chauhan (2022). Greatest Comic Book Characters [Dataset]. https://www.kaggle.com/datasets/whenamancodes/greatest-comic-book-characters

Explore at:

zip(610385 bytes)Available download formats

Dataset updated

Oct 26, 2022

Authors

Aman Chauhan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This folder contains data behind the story Comic Books Are Still Made By Men, For Men And About Men.

The data comes from Marvel Wikia and DC Wikia. Characters were scraped on August 24. Appearance counts were scraped on September 2. The month and year of the first issue each character appeared in was pulled on October 6.

Data Dictionary

Column	Definition
page_id	The unique identifier for that characters page within the wikia
name	The name of the character
urlslug	The unique url within the wikia that takes you to the character
ID	The identity status of the character (Secret Identity, Public identity, [on marvel only: No Dual Identity])
ALIGN	If the character is Good, Bad or Neutral
EYE	Eye color of the character
HAIR	Hair color of the character
SEX	Sex of the character (e.g. Male, Female, etc.)
GSM	If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters)
ALIVE	If the character is alive or deceased
APPEARANCES	The number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.)
FIRST APPEARANCE	The month and year of the character's first appearance in a comic book, if available
YEAR	The year of the character's first appearance in a comic book, if available

h
OLHWD
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mission R, OLHWD [Dataset]. https://huggingface.co/datasets/Immortalman12/OLHWD
Explore at:
Authors
Mission R
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The datas of Online Chinese Handwritings.

all_datas.npy : The handwritten text line datas. (From CASIA-OLHWDB(2.0-2.2)) datas.npy : The handwritten single character datas. (From CASIA-OLHWDB (1.0-1.2)) mydict.npy : All of the character types in the single character dataset. dictionary.npy : All of the character types in the text line dataset. test_datas.npy : The handwritten single character datas for test. (From ICDAR-2013 competition database)
I
Dataset for Dense sampling of taxa and characters improves phylogenetic...
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Updated Aug 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanghui Cao; Christopher H. Dietrich; James N. Zahniser; Dmitry A. Dmitriev (2022). Dataset for Dense sampling of taxa and characters improves phylogenetic resolution among deltocephaline leafhoppers (Hemiptera: Cicadellidae: Deltocephalinae) [Dataset]. http://doi.org/10.13012/B2IDB-8842653_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-8842653_V2
Dataset updated
Aug 29, 2022
Authors
Yanghui Cao; Christopher H. Dietrich; James N. Zahniser; Dmitry A. Dmitriev
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
U.S. National Science Foundation (NSF)
Description
The following files were used to reconstruct the phylogeny of the leafhopper subfamily Deltocephalinae, using IQ-TREE v1.6.12 and ASTRAL v 4.10.5. 1) taxon_sampling.csv: contains the sequencing ids (1st column) and the taxonomic information (2nd column) of each sample. Sequencing ids were used in the alignment files and partition files. 2)concatenated_nt.phy: concatenated nucleotide alignment used for the maximum likelihood analysis of Deltocephalinae by IQ-TREE v1.6.12. The file lists the sequences of 163,365 nucleotide positions from 429 genes in 730 samples. Hyphens are used to represent gaps. 3) concatenated_nt_partition.nex: the partitions for the concatenated nucleotide alignment. The file partitions the 163,365 nucleotide characters into 429 character sets, and defines the best substitution model for each character set. 4) concatenated_aa.phy: concatenated amino acid alignment used for the maximum likelihood analysis of Deltocephalinae by IQ-TREE v1.6.12. The file gives the sequences of 53,969 amino acids from 429 genes in 730 samples. Hyphens are used to represent gaps. 5) concatenated_aa_partition.nex: the partitions for the concatenated amino acid alignment. The file partitions the 53,969 characters into 429 character sets, and defines the best substitution model for each character set. 6) concatenated_nt_106taxa.phy: a reduced concatenated nucleotide alignment representing 107 samples x 86 genes. This alignment is used to estimate the divergence times of Deltocephalinae using MCMCTree in PAML v4.9. The file lists the sequences of 79,239 nucleotide positions from 86 genes in 107 samples. Hyphens are used to represent gaps. 7) concatenated_nt_106taxa_partition.nex: the partitions for the nucleotide alignment concatenated_nt_106taxa.phy. The file partitions the 79,239 nucleotide characters into 86 character sets, and defines the best substitution model for each character set. 8) individual_gene_alignment.zip: contains 429 FAS files, one for each of the partitioned nucleotide character sets in the concatenated_nt_partition.nex file. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12, followed by multispecies coalescent analysis using ASTRAL v 4.10.5.
d
HSIP Law Enforcement Locations in New Mexico
catalog.data.gov
gstore.unm.edu
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). HSIP Law Enforcement Locations in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-law-enforcement-locations-in-new-mexico
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Area covered
New Mexico
Description
Law Enforcement Locations Any location where sworn officers of a law enforcement agency are regularly based or stationed. Law Enforcement agencies "are publicly funded and employ at least one full-time or part-time sworn officer with general arrest powers". This is the definition used by the US Department of Justice - Bureau of Justice Statistics (DOJ-BJS) for their Law Enforcement Management and Administrative Statistics (LEMAS) survey. Although LEMAS only includes non Federal Agencies, this dataset includes locations for federal, state, local, and special jurisdiction law enforcement agencies. Law enforcement agencies include, but are not limited to, municipal police, county sheriffs, state police, school police, park police, railroad police, federal law enforcement agencies, departments within non law enforcement federal agencies charged with law enforcement (e.g., US Postal Inspectors), and cross jurisdictional authorities (e.g., Port Authority Police). In general, the requirements and training for becoming a sworn law enforcement officer are set by each state. Law Enforcement agencies themselves are not chartered or licensed by their state. County, city, and other government authorities within each state are usually empowered by their state law to setup or disband Law Enforcement agencies. Generally, sworn Law Enforcement officers must report which agency they are employed by to the state. Although TGS's intention is to only include locations associated with agencies that meet the above definition, TGS has discovered a few locations that are associated with agencies that are not publicly funded. TGS deleted these locations as we became aware of them, but some may still exist in this dataset. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset; however, some personal homes are included due to the fact that the New Mexico Mounted Police work out of their homes. TGS has made a concerted effort to include all local police; county sheriffs; state police and/or highway patrol; Bureau of Indian Affairs; Bureau of Land Management; Bureau of Reclamation; U.S. Park Police; Bureau of Alcohol, Tobacco, Firearms, and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. FBI entities are intended to be excluded from this dataset, but a few may be included. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, the NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 08/14/2006 and the newest record dates from 10/23/2009
e
Simple download service (Atom) of the dataset: Linear fishing lot in...
data.europa.eu
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simple download service (Atom) of the dataset: Linear fishing lot in Saône-et-Loire [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-265262b3-4d39-43a2-b1f4-e325bdce83d5
Explore at:
unknownAvailable download formats
Description
Linear lot of the river public domain allocated to an AAPPMA The fishing lots are defined by a Prefectural Decree Article R435-16 et seq. of the Environmental Code Description of the data awarded Attribute Name Definition Type/Length Unit of Measurement Constraints ID_LOT_PECHE_L geomap identifier Character (5) CODE_HYDRO_COURS_EAU River Hydrographic Code Character (8) ID_LOT_PUBLIC fishing lot identifier Character (6) NOM_COURS_EAU Name of watercourse Character (50) Location of the fishing lot on the watercourse Character (250) X_LIM_AMONT coordinates X (L93) of the upstream limit of the floating lot Y_LIM_AMONT coordinates Y (L93) of the upstream limit of the floating lot X_LIM_AVAL coordinates X (L93) of the downstream limit of the floating lot Y_LIM_AVAL coordinates Y (L93) of the downstream limit of the floating lot Length of floating lot ASSO_PECHE fishing association that manages the lot Character (50) Com_ASSO_PECHE common location of the fishing association Character (35) NB_LICENCE_AMATEUR number of amateur fishing licences Character (20) in number or “authorised”, “unauthorised” NB_LICENCE_PRO number of fishing licences Character (20) in number or “authorised”, “unauthorised” CARPE_NUIT fishing for night carp authorised “YES” or not authorised “NO” Character (3) DATE_MISE_A_Day Date data update date
Star Wars social network
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evelina Gabasova; Evelina Gabasova (2023). Star Wars social network [Dataset]. http://doi.org/10.5281/zenodo.1411479
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1411479
Dataset updated
Apr 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Evelina Gabasova; Evelina Gabasova
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Star Wars social network

This dataset contains the social network of Star Wars characters extracted from movie scripts. In short, two characters are connected if they speak together within the same scene. The data contain characters and links from episodes I to VII.

How the data were created is described in my blog posts:

The Star Wars social network

Star Wars social network: Force Awakens

The associated code is available in the main Github repository evelinag/StarWars-social-network.

Contents of the files are the following:

starwars-episode-N-interactions.json contains the social network extracted from Episode N, where the links between characters are defined by the times the characters speak within the same scene.

starwars-episode-N-mentions.json contains the social network extracted from Episode N, where the links between characters are defined by the times the characters are mentioned within the same scene.

starwars-episode-N-interactions-allCharacters.json is the interactions network with R2-D2 and Chewbacca added in using data from mentions network.

starwars-full-... contain the corresponding social networks for the whole set of 6 episodes.

Description of networks

The json files representing the networks contain the following information:

Nodes

The nodes contain the following fields:

name: Name of the character

value: Number of scenes the character appeared in

colour: Colour in the visualization

Links

Links represent connections between characters. The link information corresponds to:

source: zero-based index of the character that is one end of the link, the order of nodes is the order in which they are listed in the “nodes” element

target: zero-based index of the character that is the the other end of the link.

value: Number of scenes where the “source character” and “target character” of the link appeared together. Please not that the network is undirected. Which character represents the source and the target is arbitrary, they correspond only to two ends of the link.
d
HSIP Correctional Institutions in New Mexico
catalog.data.gov
s.cnmilf.com
+1more
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). HSIP Correctional Institutions in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-correctional-institutions-in-new-mexico
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Area covered
New Mexico
Description
Jails and Prisons (Correctional Institutions). The Jails and Prisons sub-layer is part of the Emergency Law Enforcement Sector and the Critical Infrastructure Category. A Jail or Prison consists of any facility or location where individuals are regularly and lawfully detained against their will. This includes Federal and State prisons, local jails, and juvenile detention facilities, as well as law enforcement temporary holding facilities. Work camps, including camps operated seasonally, are included if they otherwise meet the definition. A Federal Prison is a facility operated by the Federal Bureau of Prisons for the incarceration of individuals. A State Prison is a facility operated by a state, commonwealth, or territory of the US for the incarceration of individuals for a term usually longer than 1 year. A Juvenile Detention Facility is a facility for the incarceration of those who have not yet reached the age of majority (usually 18 years). A Local Jail is a locally administered facility that holds inmates beyond arraignment (usually 72 hours) and is staffed by municipal or county employees. A temporary holding facility, sometimes referred to as a "police lock up" or "drunk tank", is a facility used to detain people prior to arraignment. Locations that are administrative offices only are excluded from the dataset. This definition of Jails is consistent with that used by the Department of Justice (DOJ) in their "National Jail Census", with the exception of "temporary holding facilities", which the DOJ excludes. Locations which function primarily as law enforcement offices are included in this dataset if they have holding cells. If the facility is enclosed with a fence, wall, or structure with a gate around the buildings only, the locations were depicted as "on entity" at the center of the facility. If the facility's buildings are not enclosed, the locations were depicted as "on entity" on the main building or "block face" on the correct street segment. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset. TGS has made a concerted effort to include all correctional institutions. This dataset includes non license restricted data from the following federal agencies: Bureau of Indian Affairs; Bureau of Reclamation; U.S. Park Police; Federal Bureau of Prisons; Bureau of Alcohol, Tobacco, Firearms and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 12/27/2004 and the newest record dates from 09/08/2009
Landscape Character Assessment - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Jan 5, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2016). Landscape Character Assessment - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/landscape-character-assessment2
Explore at:
Dataset updated
Jan 5, 2016
Dataset provided by
CKANhttps://ckan.org/
Description
Landscape Character Assessment - Landscape Character Type defines the boundaries of landscape parcels which have a distinctive combination of common landscape features. They provide context to conservation matters and planning policy. By accessing this data you will have been deemed to have accepted the terms and conditions of the Public Sector End User Licence - INSPIRE.
Sodium Monitoring Dataset
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Sodium Monitoring Dataset [Dataset]. https://catalog.data.gov/dataset/sodium-monitoring-dataset-72256
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The Agricultural Research Service of the US Department of Agriculture (USDA) in collaboration with other government agencies has a program to track changes in the sodium content of commercially processed and restaurant foods. This monitoring program includes these activities: Tracking sodium levels of ~125 popular foods, called "Sentinel Foods," by periodically sampling them at stores and restaurants around the country, followed by laboratory analyses. Tracking levels of "related" nutrients that could change when manufacturers reformulate their foods to reduce sodium; these related nutrients are potassium, total and saturated fat, total dietary fiber, and total sugar. Sharing the results of these monitoring activities to the public periodically in the Sodium Monitoring Dataset and USDA National Nutrient Database for Standard Reference and once every two years in the Food and Nutrient Database for Dietary Studies. The Sodium Monitoring Dataset is downloadable in Excel spreadsheet format. Resources in this dataset:Resource Title: Data Dictionary. File Name: SodiumMonitoringDataset_datadictionary.csvResource Description: Defines variables, descriptions, data types, character length, etc. for each of the spreadsheets in this Excel data file: Sentinel Foods - Baseline; Priority-2 Foods - Baseline; Sentinel Foods - Monitoring; Priority-2 Foods - Monitoring.Resource Title: Sodium Monitoring Dataset (MS Excel download). File Name: SodiumMonitoringDatasetUpdatedJuly2616.xlsxResource Description: Microsoft Excel : Sentinel Foods - Baseline; Priority-2 Foods - Baseline; Sentinel Foods - Monitoring; Priority Foods - Monitoring.
W
Emergency Medical Service Stations
wifire-data.sdsc.edu
gis-calema.opendata.arcgis.com
csv, esri rest +4
Updated May 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CA Governor's Office of Emergency Services (2019). Emergency Medical Service Stations [Dataset]. https://wifire-data.sdsc.edu/dataset/emergency-medical-service-stations
Explore at:
geojson, zip, csv, kml, html, esri restAvailable download formats
Dataset updated
May 22, 2019
Dataset provided by
CA Governor's Office of Emergency Services
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset represents Emergency Medical Services (EMS) locations in the United States and its territories. EMS Stations are part of the Fire Stations / EMS Stations HSIP Freedom sub-layer, which in turn is part of the Emergency Services and Continuity of Government Sector, which is itself a part of the Critical Infrastructure Category. The EMS stations dataset consists of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Ambulance services are included even if they only provide transportation services, but not if they are located at, and operated by, a hospital. If an independent ambulance service or EMS provider happens to be collocated with a hospital, it will be included in this dataset. The dataset includes both private and governmental entities. A concerted effort was made to include all emergency medical service locations in the United States and its territories. This dataset is comprised completely of license free data. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based upon this field, the oldest record dates from 12/29/2004 and the newest record dates from 01/11/2010.

This dataset represents the EMS stations of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1. An assessment of whether or not the total emergency medical services capability in a given area is adequate. 2. A list of resources to draw upon by surrounding areas when local resources have temporarily been overwhelmed by a disaster - route analysis can determine those entities that are able to respond the quickest. 3. A resource for Emergency Management planning purposes. 4. A resource for catastrophe response to aid in the retrieval of equipment by outside responders in order to deal with the disaster. 5. A resource for situational awareness planning and response for Federal Government events.
Definitions of degen1 coding and of character sets.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerome C. Regier; Andreas Zwick (2023). Definitions of degen1 coding and of character sets. [Dataset]. http://doi.org/10.1371/journal.pone.0023408.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0023408.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jerome C. Regier; Andreas Zwick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Definitions of degen1 coding and of character sets.
a
Data from: Public Health Departments
arc-gis-hub-home-arcgishub.hub.arcgis.com
nconemap.gov
+3more
Updated Jan 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CA Governor's Office of Emergency Services (2018). Public Health Departments [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/29c3979a34ba4d509582a0e2adf82fd3
Explore at:
Dataset updated
Jan 17, 2018
Dataset authored and provided by
CA Governor's Office of Emergency Services
Area covered

Description
State and Local Public Health Departments in the United States Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Town health officers are included in Vermont and boards of health are included in Massachusetts. Both of these types of entities are elected or appointed to a term of office during which they make and enforce policies and regulations related to the protection of public health. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Since many town health officers in Vermont work out of their personal homes, TechniGraphics represented these entities at the town hall. This is denoted in the [DIRECTIONS] field. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard HSIP fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/18/2009 and the newest record dates from 01/08/2010.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9746900.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Clear search

Close search

Google apps

Main menu

LScD (Leicester Scientific Dictionary)

BHL Optical Character Recognition (OCR) - Full Text Export (new)

Player Experience in Video Game Character Analysis: A Study of Female...

Namesakes

Dataset of books called ABC Japanese-English dictionary : an entirely new...

Game of Thrones mortality and survival dataset

Data from: Definition of character for medical education based on expert...

DDD

Greatest Comic Book Characters

Data Dictionary

OLHWD

Dataset for Dense sampling of taxa and characters improves phylogenetic...

HSIP Law Enforcement Locations in New Mexico

Simple download service (Atom) of the dataset: Linear fishing lot in...

Star Wars social network

HSIP Correctional Institutions in New Mexico

Landscape Character Assessment - Dataset - data.gov.uk

Sodium Monitoring Dataset

Emergency Medical Service Stations

Definitions of degen1 coding and of character sets.

Data from: Public Health Departments

LScD (Leicester Scientific Dictionary)See More Versions

LScD (Leicester Scientific Dictionary)