72 datasets found
  1. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  2. f

    BHL Optical Character Recognition (OCR) - Full Text Export (new)

    • smithsonian.figshare.com
    • figshare.com
    bin
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Richard; Jacqueline Dearborn (2025). BHL Optical Character Recognition (OCR) - Full Text Export (new) [Dataset]. http://doi.org/10.25573/data.21422193.v22
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset provided by
    Smithsonian Libraries and Archives
    Authors
    Joel Richard; Jacqueline Dearborn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset contains a full export of the 60+ million pages of OCR content in the Biodiversity Heritage Library, for items hosted by BHL. For contextual information and key definitions about this dataset see the Biodiversity Heritage Library Open Data Collection and the data dictionary below.

    Data Dictionary: s.si.edu/bhlocrtxt Release Date: the 17th of each month Frequency: Monthly bureauCode: 452:11 Access Level: public

  3. Z

    Player Experience in Video Game Character Analysis: A Study of Female...

    • data.niaid.nih.gov
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Guzman, Wendell; Chavez, John Xavier (2024). Player Experience in Video Game Character Analysis: A Study of Female Characters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11641622
    Explore at:
    Dataset updated
    Jun 30, 2024
    Dataset provided by
    Mapúa University
    Authors
    de Guzman, Wendell; Chavez, John Xavier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is part of the study titled "Player Experience in Video Game Character Analysis: A Study of Female Characters", conducted at Mapúa University. The research aims to integrate player experience into an existing framework for video game character analysis.

    Content

    The dataset includes:

    A partial transcript of 5 semi-structured interviews with the key informants. Originally, 8 interviews were conducted, but the audio/video recordings for 3 interviews were lost and thus their transcripts are not available.

    Significant codes presented in tabulated form.

    Data Collection Method

    Data were collected through in-depth interviews conducted via Facebook Messenger and Discord from March to April 2024. Participants were various video game players from different backgrounds and age groups, ranging from 20 to 40 years old. Due to technical issues, the recordings of 3 interviews were lost, resulting in only 5 available transcripts.

    Data Processing and Analysis

    The 5 available interviews were transcribed verbatim. Data were analyzed using thematic analysis, involving initial coding, theme development, and refinement.

    Usage data

    The dataset is organized into several sections within a single Word document (.docx). This word document has headings for navigation and a definition of terms.

    Limitations

    The dataset only includes 5 out of 8 due to technical difficulties encountered after the recording of the interview. This may impact the comprehensiveness of the findings.

    Contextual Reference

    The manuscript associated with this dataset heavily references the works "A Structural Model for Player-Characters as Semiotic Constructs." (DOI: https://doi.org/10.26503/TODIGRA.V2I2.37) and "Object, me, symbiote, other: A social typology of player-avatar relationships." (DOI:https://doi.org/10.5210/FM.V20I2.5433) which explore the foundational frameworks on video game character analysis.

    For any further information or clarifications, please contact wbdg2000@gmail.com

  4. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  5. w

    Dataset of books called ABC Japanese-English dictionary : an entirely new...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called ABC Japanese-English dictionary : an entirely new method of classification of the Chinese-Japanese characters [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=ABC+Japanese-English+dictionary+%3A+an+entirely+new+method+of+classification+of+the+Chinese-Japanese+characters
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is ABC Japanese-English dictionary : an entirely new method of classification of the Chinese-Japanese characters. It features 7 columns including author, publication date, language, and book publisher.

  6. Game of Thrones mortality and survival dataset

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reidar Lystad; Benjamin Brown (2023). Game of Thrones mortality and survival dataset [Dataset]. http://doi.org/10.6084/m9.figshare.8259680.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Reidar Lystad; Benjamin Brown
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset includes data from Game of Thrones Seasons 1–8. The dataset comprises two separate datasets and an accompanying data dictionary. The character dataset contains 359 observations (i.e. characters) and 35 variables, including information about sociodemographics, exposures, and mortality. The episode dataset contains 73 observations (i.e. episodes) and 8 variables, including information about episode running time.An earlier version of the dataset, which included data from Game of Thrones Seasons 1–7 only, was used in the following original research article: Lystad RP, Brown BT. “Death is certain, the time is not”: mortality and survival in Game of Thrones. Injury Epidemiology 2018;5:44.

  7. H

    Data from: Definition of character for medical education based on expert...

    • dataverse.harvard.edu
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yera Hur (2021). Definition of character for medical education based on expert opinions in Korea [Dataset]. http://doi.org/10.7910/DVN/S5JLIB
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Yera Hur
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    South Korea
    Description

    A single questionnaire with 3 major questions on character was distributed to medical education experts in Korea via e-mail. The questions were: “How would you define the ‘character’ that is required from a good doctor in the era of the fourth industrial revolution?”, “What are the issues of character education in current medical education (if any?)”, and “If you agree that there are any issue(s) of character education in current medical education, what possible solutions do you suggest?” The survey was distributed twice. In the first round of the survey, 145 e-mails were sent, and the response rate was 23.4% (34 responses). In the second round of survey distribution, 29 additional responses were gathered. Thus, responses from 63 medical education experts from 30 medical schools or colleges and 19 non-medical education experts were used in the final analysis.

  8. h

    DDD

    • huggingface.co
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iconic Interactive (2023). DDD [Dataset]. https://huggingface.co/datasets/IconicAI/DDD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Dataset authored and provided by
    Iconic Interactive
    Description

    Deep Dungeons and Dragons

    A dataset of long-form multi-turn and multi-character collaborative RPG stories, complete with associated character cards. This dataset comprises of 56,000 turns in 1544 stories following 9771 characters: a total of 50M Llama tokens.Each turn comprises a multi-paragraph continuation of a story from the perspective of a defined character including both dialogue and prose. This dataset is a cleaned and reformatted version of Deep Dungeons and Dragons… See the full description on the dataset page: https://huggingface.co/datasets/IconicAI/DDD.

  9. Greatest Comic Book Characters

    • kaggle.com
    zip
    Updated Oct 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Greatest Comic Book Characters [Dataset]. https://www.kaggle.com/datasets/whenamancodes/greatest-comic-book-characters
    Explore at:
    zip(610385 bytes)Available download formats
    Dataset updated
    Oct 26, 2022
    Authors
    Aman Chauhan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This folder contains data behind the story Comic Books Are Still Made By Men, For Men And About Men.

    The data comes from Marvel Wikia and DC Wikia. Characters were scraped on August 24. Appearance counts were scraped on September 2. The month and year of the first issue each character appeared in was pulled on October 6.

    Data Dictionary

    ColumnDefinition
    page_idThe unique identifier for that characters page within the wikia
    nameThe name of the character
    urlslugThe unique url within the wikia that takes you to the character
    IDThe identity status of the character (Secret Identity, Public identity, [on marvel only: No Dual Identity])
    ALIGNIf the character is Good, Bad or Neutral
    EYEEye color of the character
    HAIRHair color of the character
    SEXSex of the character (e.g. Male, Female, etc.)
    GSMIf the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters)
    ALIVEIf the character is alive or deceased
    APPEARANCESThe number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.)
    FIRST APPEARANCEThe month and year of the character's first appearance in a comic book, if available
    YEARThe year of the character's first appearance in a comic book, if available
  10. h

    OLHWD

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mission R, OLHWD [Dataset]. https://huggingface.co/datasets/Immortalman12/OLHWD
    Explore at:
    Authors
    Mission R
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The datas of Online Chinese Handwritings.

    all_datas.npy : The handwritten text line datas. (From CASIA-OLHWDB(2.0-2.2)) datas.npy : The handwritten single character datas. (From CASIA-OLHWDB (1.0-1.2)) mydict.npy : All of the character types in the single character dataset. dictionary.npy : All of the character types in the text line dataset. test_datas.npy : The handwritten single character datas for test. (From ICDAR-2013 competition database)

  11. I

    Dataset for Dense sampling of taxa and characters improves phylogenetic...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Aug 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanghui Cao; Christopher H. Dietrich; James N. Zahniser; Dmitry A. Dmitriev (2022). Dataset for Dense sampling of taxa and characters improves phylogenetic resolution among deltocephaline leafhoppers (Hemiptera: Cicadellidae: Deltocephalinae) [Dataset]. http://doi.org/10.13012/B2IDB-8842653_V2
    Explore at:
    Dataset updated
    Aug 29, 2022
    Authors
    Yanghui Cao; Christopher H. Dietrich; James N. Zahniser; Dmitry A. Dmitriev
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Science Foundation (NSF)
    Description

    The following files were used to reconstruct the phylogeny of the leafhopper subfamily Deltocephalinae, using IQ-TREE v1.6.12 and ASTRAL v 4.10.5. 1) taxon_sampling.csv: contains the sequencing ids (1st column) and the taxonomic information (2nd column) of each sample. Sequencing ids were used in the alignment files and partition files. 2)concatenated_nt.phy: concatenated nucleotide alignment used for the maximum likelihood analysis of Deltocephalinae by IQ-TREE v1.6.12. The file lists the sequences of 163,365 nucleotide positions from 429 genes in 730 samples. Hyphens are used to represent gaps. 3) concatenated_nt_partition.nex: the partitions for the concatenated nucleotide alignment. The file partitions the 163,365 nucleotide characters into 429 character sets, and defines the best substitution model for each character set. 4) concatenated_aa.phy: concatenated amino acid alignment used for the maximum likelihood analysis of Deltocephalinae by IQ-TREE v1.6.12. The file gives the sequences of 53,969 amino acids from 429 genes in 730 samples. Hyphens are used to represent gaps. 5) concatenated_aa_partition.nex: the partitions for the concatenated amino acid alignment. The file partitions the 53,969 characters into 429 character sets, and defines the best substitution model for each character set. 6) concatenated_nt_106taxa.phy: a reduced concatenated nucleotide alignment representing 107 samples x 86 genes. This alignment is used to estimate the divergence times of Deltocephalinae using MCMCTree in PAML v4.9. The file lists the sequences of 79,239 nucleotide positions from 86 genes in 107 samples. Hyphens are used to represent gaps. 7) concatenated_nt_106taxa_partition.nex: the partitions for the nucleotide alignment concatenated_nt_106taxa.phy. The file partitions the 79,239 nucleotide characters into 86 character sets, and defines the best substitution model for each character set. 8) individual_gene_alignment.zip: contains 429 FAS files, one for each of the partitioned nucleotide character sets in the concatenated_nt_partition.nex file. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12, followed by multispecies coalescent analysis using ASTRAL v 4.10.5.

  12. d

    HSIP Law Enforcement Locations in New Mexico

    • catalog.data.gov
    • gstore.unm.edu
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). HSIP Law Enforcement Locations in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-law-enforcement-locations-in-new-mexico
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Area covered
    New Mexico
    Description

    Law Enforcement Locations Any location where sworn officers of a law enforcement agency are regularly based or stationed. Law Enforcement agencies "are publicly funded and employ at least one full-time or part-time sworn officer with general arrest powers". This is the definition used by the US Department of Justice - Bureau of Justice Statistics (DOJ-BJS) for their Law Enforcement Management and Administrative Statistics (LEMAS) survey. Although LEMAS only includes non Federal Agencies, this dataset includes locations for federal, state, local, and special jurisdiction law enforcement agencies. Law enforcement agencies include, but are not limited to, municipal police, county sheriffs, state police, school police, park police, railroad police, federal law enforcement agencies, departments within non law enforcement federal agencies charged with law enforcement (e.g., US Postal Inspectors), and cross jurisdictional authorities (e.g., Port Authority Police). In general, the requirements and training for becoming a sworn law enforcement officer are set by each state. Law Enforcement agencies themselves are not chartered or licensed by their state. County, city, and other government authorities within each state are usually empowered by their state law to setup or disband Law Enforcement agencies. Generally, sworn Law Enforcement officers must report which agency they are employed by to the state. Although TGS's intention is to only include locations associated with agencies that meet the above definition, TGS has discovered a few locations that are associated with agencies that are not publicly funded. TGS deleted these locations as we became aware of them, but some may still exist in this dataset. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset; however, some personal homes are included due to the fact that the New Mexico Mounted Police work out of their homes. TGS has made a concerted effort to include all local police; county sheriffs; state police and/or highway patrol; Bureau of Indian Affairs; Bureau of Land Management; Bureau of Reclamation; U.S. Park Police; Bureau of Alcohol, Tobacco, Firearms, and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. FBI entities are intended to be excluded from this dataset, but a few may be included. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, the NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 08/14/2006 and the newest record dates from 10/23/2009

  13. e

    Simple download service (Atom) of the dataset: Linear fishing lot in...

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simple download service (Atom) of the dataset: Linear fishing lot in Saône-et-Loire [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-265262b3-4d39-43a2-b1f4-e325bdce83d5
    Explore at:
    unknownAvailable download formats
    Description

    Linear lot of the river public domain allocated to an AAPPMA The fishing lots are defined by a Prefectural Decree Article R435-16 et seq. of the Environmental Code Description of the data awarded Attribute Name Definition Type/Length Unit of Measurement Constraints ID_LOT_PECHE_L geomap identifier Character (5) CODE_HYDRO_COURS_EAU River Hydrographic Code Character (8) ID_LOT_PUBLIC fishing lot identifier Character (6) NOM_COURS_EAU Name of watercourse Character (50) Location of the fishing lot on the watercourse Character (250) X_LIM_AMONT coordinates X (L93) of the upstream limit of the floating lot Y_LIM_AMONT coordinates Y (L93) of the upstream limit of the floating lot X_LIM_AVAL coordinates X (L93) of the downstream limit of the floating lot Y_LIM_AVAL coordinates Y (L93) of the downstream limit of the floating lot Length of floating lot ASSO_PECHE fishing association that manages the lot Character (50) Com_ASSO_PECHE common location of the fishing association Character (35) NB_LICENCE_AMATEUR number of amateur fishing licences Character (20) in number or “authorised”, “unauthorised” NB_LICENCE_PRO number of fishing licences Character (20) in number or “authorised”, “unauthorised” CARPE_NUIT fishing for night carp authorised “YES” or not authorised “NO” Character (3) DATE_MISE_A_Day Date data update date

  14. Star Wars social network

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evelina Gabasova; Evelina Gabasova (2023). Star Wars social network [Dataset]. http://doi.org/10.5281/zenodo.1411479
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Evelina Gabasova; Evelina Gabasova
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Star Wars social network

    This dataset contains the social network of Star Wars characters extracted from movie scripts. In short, two characters are connected if they speak together within the same scene. The data contain characters and links from episodes I to VII.

    How the data were created is described in my blog posts:

    The associated code is available in the main Github repository evelinag/StarWars-social-network.

    Contents of the files are the following:

    • starwars-episode-N-interactions.json contains the social network extracted from Episode N, where the links between characters are defined by the times the characters speak within the same scene.

    • starwars-episode-N-mentions.json contains the social network extracted from Episode N, where the links between characters are defined by the times the characters are mentioned within the same scene.

    • starwars-episode-N-interactions-allCharacters.json is the interactions network with R2-D2 and Chewbacca added in using data from mentions network.

    • starwars-full-... contain the corresponding social networks for the whole set of 6 episodes.

    Description of networks

    The json files representing the networks contain the following information:

    Nodes

    The nodes contain the following fields:

    • name: Name of the character
    • value: Number of scenes the character appeared in
    • colour: Colour in the visualization

    Links

    Links represent connections between characters. The link information corresponds to:

    • source: zero-based index of the character that is one end of the link, the order of nodes is the order in which they are listed in the “nodes” element
    • target: zero-based index of the character that is the the other end of the link.
    • value: Number of scenes where the “source character” and “target character” of the link appeared together. Please not that the network is undirected. Which character represents the source and the target is arbitrary, they correspond only to two ends of the link.
  15. d

    HSIP Correctional Institutions in New Mexico

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). HSIP Correctional Institutions in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-correctional-institutions-in-new-mexico
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Area covered
    New Mexico
    Description

    Jails and Prisons (Correctional Institutions). The Jails and Prisons sub-layer is part of the Emergency Law Enforcement Sector and the Critical Infrastructure Category. A Jail or Prison consists of any facility or location where individuals are regularly and lawfully detained against their will. This includes Federal and State prisons, local jails, and juvenile detention facilities, as well as law enforcement temporary holding facilities. Work camps, including camps operated seasonally, are included if they otherwise meet the definition. A Federal Prison is a facility operated by the Federal Bureau of Prisons for the incarceration of individuals. A State Prison is a facility operated by a state, commonwealth, or territory of the US for the incarceration of individuals for a term usually longer than 1 year. A Juvenile Detention Facility is a facility for the incarceration of those who have not yet reached the age of majority (usually 18 years). A Local Jail is a locally administered facility that holds inmates beyond arraignment (usually 72 hours) and is staffed by municipal or county employees. A temporary holding facility, sometimes referred to as a "police lock up" or "drunk tank", is a facility used to detain people prior to arraignment. Locations that are administrative offices only are excluded from the dataset. This definition of Jails is consistent with that used by the Department of Justice (DOJ) in their "National Jail Census", with the exception of "temporary holding facilities", which the DOJ excludes. Locations which function primarily as law enforcement offices are included in this dataset if they have holding cells. If the facility is enclosed with a fence, wall, or structure with a gate around the buildings only, the locations were depicted as "on entity" at the center of the facility. If the facility's buildings are not enclosed, the locations were depicted as "on entity" on the main building or "block face" on the correct street segment. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset. TGS has made a concerted effort to include all correctional institutions. This dataset includes non license restricted data from the following federal agencies: Bureau of Indian Affairs; Bureau of Reclamation; U.S. Park Police; Federal Bureau of Prisons; Bureau of Alcohol, Tobacco, Firearms and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 12/27/2004 and the newest record dates from 09/08/2009

  16. Landscape Character Assessment - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Jan 5, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2016). Landscape Character Assessment - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/landscape-character-assessment2
    Explore at:
    Dataset updated
    Jan 5, 2016
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    Landscape Character Assessment - Landscape Character Type defines the boundaries of landscape parcels which have a distinctive combination of common landscape features. They provide context to conservation matters and planning policy. By accessing this data you will have been deemed to have accepted the terms and conditions of the Public Sector End User Licence - INSPIRE.

  17. Sodium Monitoring Dataset

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Sodium Monitoring Dataset [Dataset]. https://catalog.data.gov/dataset/sodium-monitoring-dataset-72256
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The Agricultural Research Service of the US Department of Agriculture (USDA) in collaboration with other government agencies has a program to track changes in the sodium content of commercially processed and restaurant foods. This monitoring program includes these activities: Tracking sodium levels of ~125 popular foods, called "Sentinel Foods," by periodically sampling them at stores and restaurants around the country, followed by laboratory analyses. Tracking levels of "related" nutrients that could change when manufacturers reformulate their foods to reduce sodium; these related nutrients are potassium, total and saturated fat, total dietary fiber, and total sugar. Sharing the results of these monitoring activities to the public periodically in the Sodium Monitoring Dataset and USDA National Nutrient Database for Standard Reference and once every two years in the Food and Nutrient Database for Dietary Studies. The Sodium Monitoring Dataset is downloadable in Excel spreadsheet format. Resources in this dataset:Resource Title: Data Dictionary. File Name: SodiumMonitoringDataset_datadictionary.csvResource Description: Defines variables, descriptions, data types, character length, etc. for each of the spreadsheets in this Excel data file: Sentinel Foods - Baseline; Priority-2 Foods - Baseline; Sentinel Foods - Monitoring; Priority-2 Foods - Monitoring.Resource Title: Sodium Monitoring Dataset (MS Excel download). File Name: SodiumMonitoringDatasetUpdatedJuly2616.xlsxResource Description: Microsoft Excel : Sentinel Foods - Baseline; Priority-2 Foods - Baseline; Sentinel Foods - Monitoring; Priority Foods - Monitoring.

  18. W

    Emergency Medical Service Stations

    • wifire-data.sdsc.edu
    • gis-calema.opendata.arcgis.com
    csv, esri rest +4
    Updated May 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CA Governor's Office of Emergency Services (2019). Emergency Medical Service Stations [Dataset]. https://wifire-data.sdsc.edu/dataset/emergency-medical-service-stations
    Explore at:
    geojson, zip, csv, kml, html, esri restAvailable download formats
    Dataset updated
    May 22, 2019
    Dataset provided by
    CA Governor's Office of Emergency Services
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    The dataset represents Emergency Medical Services (EMS) locations in the United States and its territories. EMS Stations are part of the Fire Stations / EMS Stations HSIP Freedom sub-layer, which in turn is part of the Emergency Services and Continuity of Government Sector, which is itself a part of the Critical Infrastructure Category. The EMS stations dataset consists of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Ambulance services are included even if they only provide transportation services, but not if they are located at, and operated by, a hospital. If an independent ambulance service or EMS provider happens to be collocated with a hospital, it will be included in this dataset. The dataset includes both private and governmental entities. A concerted effort was made to include all emergency medical service locations in the United States and its territories. This dataset is comprised completely of license free data. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based upon this field, the oldest record dates from 12/29/2004 and the newest record dates from 01/11/2010.

    This dataset represents the EMS stations of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1. An assessment of whether or not the total emergency medical services capability in a given area is adequate. 2. A list of resources to draw upon by surrounding areas when local resources have temporarily been overwhelmed by a disaster - route analysis can determine those entities that are able to respond the quickest. 3. A resource for Emergency Management planning purposes. 4. A resource for catastrophe response to aid in the retrieval of equipment by outside responders in order to deal with the disaster. 5. A resource for situational awareness planning and response for Federal Government events.


  19. Definitions of degen1 coding and of character sets.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerome C. Regier; Andreas Zwick (2023). Definitions of degen1 coding and of character sets. [Dataset]. http://doi.org/10.1371/journal.pone.0023408.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jerome C. Regier; Andreas Zwick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Definitions of degen1 coding and of character sets.

  20. a

    Data from: Public Health Departments

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • nconemap.gov
    • +3more
    Updated Jan 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CA Governor's Office of Emergency Services (2018). Public Health Departments [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/29c3979a34ba4d509582a0e2adf82fd3
    Explore at:
    Dataset updated
    Jan 17, 2018
    Dataset authored and provided by
    CA Governor's Office of Emergency Services
    Area covered
    Description

    State and Local Public Health Departments in the United States Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Town health officers are included in Vermont and boards of health are included in Massachusetts. Both of these types of entities are elected or appointed to a term of office during which they make and enforce policies and regulations related to the protection of public health. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Since many town health officers in Vermont work out of their personal homes, TechniGraphics represented these entities at the town hall. This is denoted in the [DIRECTIONS] field. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard HSIP fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/18/2009 and the newest record dates from 01/08/2010.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:
docxAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Leicester
Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Search
Clear search
Close search
Google apps
Main menu