99 datasets found
  1. a

    Building Footprints Data Dictionary

    • data-lakecountyil.opendata.arcgis.com
    • datasets.ai
    • +4more
    Updated Oct 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lake County Illinois GIS (2017). Building Footprints Data Dictionary [Dataset]. https://data-lakecountyil.opendata.arcgis.com/documents/afb5f879894a4993bc9b45998267d94d
    Explore at:
    Dataset updated
    Oct 10, 2017
    Dataset authored and provided by
    Lake County Illinois GIS
    License

    https://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/datahttps://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/data

    Area covered
    Description

    An in-depth description of the Building Footprint GIS data layer outlining terms of use, update frequency, attribute explanations, and more.

  2. Monash Helix Health Data Dictionary Word Template

    • bridges.monash.edu
    • researchdata.edu.au
    docx
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dianne Brown; Arul Earnest; Mark Lucas; Robin Thompson; Chris Macmanus; Jessica Lockery; Simone Spark (2025). Monash Helix Health Data Dictionary Word Template [Dataset]. http://doi.org/10.26180/29178593.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    Monash University
    Authors
    Dianne Brown; Arul Earnest; Mark Lucas; Robin Thompson; Chris Macmanus; Jessica Lockery; Simone Spark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As part of Monash University (Helix) Health Research Data Governance strategy a working group was established in 2019 to develop a data dictionary template for use in Health Research.This is the Word document that can be output from the excel version of the template (which is the master) It contains all the metadata (characteristics) that should be included in a health research data dictionary in a standardised format.Instructions for use are contained in the PDF.

  3. R

    Data Dictionary Assistant Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). Data Dictionary Assistant Market Research Report 2033 [Dataset]. https://researchintelo.com/report/data-dictionary-assistant-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    Data Dictionary Assistant Market Outlook



    According to our latest research, the Global Data Dictionary Assistant market size was valued at $1.2 billion in 2024 and is projected to reach $4.8 billion by 2033, expanding at a robust CAGR of 16.8% during the forecast period from 2025 to 2033. The primary factor fueling this remarkable growth is the increasing emphasis on data governance and regulatory compliance across industries worldwide. As organizations continue to harness vast volumes of data, the demand for automated, intelligent solutions like Data Dictionary Assistants to streamline metadata management and ensure data quality has surged, making them indispensable for modern enterprises. This market is also being propelled by the rapid adoption of cloud-based solutions and the proliferation of digital transformation initiatives, which have significantly expanded the scope and utility of data dictionary tools.



    Regional Outlook



    North America holds the largest share of the Global Data Dictionary Assistant market, accounting for approximately 38% of the total market value in 2024. The region's dominance can be attributed to its mature technology landscape, early adoption of advanced data management solutions, and stringent regulatory frameworks such as GDPR and CCPA that require robust data governance. Major U.S. and Canadian enterprises, particularly in sectors like BFSI, healthcare, and IT, are leading adopters of Data Dictionary Assistant solutions, leveraging them to automate metadata management and enhance compliance. The presence of leading technology vendors and a highly skilled workforce further cements North America’s leadership in the global market. Additionally, ongoing investments in artificial intelligence and automation are expected to sustain the region’s market dominance through the forecast period.



    Asia Pacific is projected to be the fastest-growing region, with a forecasted CAGR of 19.5% from 2025 to 2033. This rapid expansion is driven by accelerating digital transformation across emerging economies such as China, India, and Southeast Asian countries. The surge in cloud adoption, burgeoning e-commerce sectors, and increasing investments in IT infrastructure are significant contributors to regional growth. Governments in Asia Pacific are also implementing policies to enhance data privacy and security, which is compelling organizations to adopt advanced data governance solutions like Data Dictionary Assistants. Furthermore, the rise of local technology startups and increased foreign direct investment in digital infrastructure are catalyzing the adoption of these tools, positioning Asia Pacific as a key growth engine for the global market.



    Emerging economies in Latin America and the Middle East & Africa are witnessing gradual but steady adoption of Data Dictionary Assistant solutions. These regions face unique challenges, including limited access to skilled IT professionals, budget constraints, and varying levels of regulatory maturity. However, localized demand for data quality management and compliance, particularly in sectors such as government, BFSI, and telecommunications, is driving incremental growth. Policy reforms aimed at digitalization and the introduction of data protection regulations are expected to stimulate further adoption. Nevertheless, the pace of market expansion in these regions is somewhat tempered by infrastructure limitations and organizational resistance to change, making targeted education and capacity-building initiatives essential for unlocking their full potential.



    Report Scope





    Attributes Details
    Report Title Data Dictionary Assistant Market Research Report 2033
    By Component Software, Services
    By Deployment Mode On-Premises, Cloud
    By Organization Size Small and Medium Enterprises, Large Enterprises
    By Application
  4. n

    Data from: Development of Data Dictionary for neonatal intensive care unit:...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harpreet Singh; Ravneet Kaur; Satish Saluja; Su Cho; Avneet Kaur; Ashish Pandey; Shubham Gupta; Ritu Das; Praveen Kumar; Jonathan Palma; Gautam Yadav; Yao Sun (2020). Development of Data Dictionary for neonatal intensive care unit: advancement towards a better critical care unit [Dataset]. http://doi.org/10.5061/dryad.zkh18936f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2020
    Dataset provided by
    CHIL
    Indraprastha Institute of Information Technology Delhi
    Post Graduate Institute of Medical Education and Research
    Sir Ganga Ram Hospital
    UCSF Benioff Children's Hospital
    KLKH
    Apollo Cradle For Women & Children
    Ewha Womans University
    Lucile Packard Children's Hospital
    Authors
    Harpreet Singh; Ravneet Kaur; Satish Saluja; Su Cho; Avneet Kaur; Ashish Pandey; Shubham Gupta; Ritu Das; Praveen Kumar; Jonathan Palma; Gautam Yadav; Yao Sun
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: Critical care units (CCUs) with wide use of various monitoring devices generate massive data. To utilize the valuable information of these devices; data are collected and stored using systems like Clinical Information System (CIS), Laboratory Information Management System (LIMS), etc. These systems are proprietary in nature, allow limited access to their database and have vendor specific clinical implementation. In this study we focus on developing an open source web-based meta-data repository for CCU representing stay of patient with relevant details.

    Methods: After developing the web-based open source repository we analyzed prospective data from two sites for four months for data quality dimensions (completeness, timeliness, validity, accuracy and consistency), morbidity and clinical outcomes. We used a regression model to highlight the significance of practice variations linked with various quality indicators. Results: Data dictionary (DD) with 1447 fields (90.39% categorical and 9.6% text fields) is presented to cover clinical workflow of NICU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicator and practice variations are strongly correlated (p-value < 0.05).

    Results: Data dictionary (DD) with 1555 fields (89.6% categorical and 11.4% text fields) is presented to cover clinical workflow of a CCU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicators and practice variations are strongly correlated (p-value < 0.05).

    Conclusion: This study documents DD for standardized data collection in CCU. This provides robust data and insights for audit purposes and pathways for CCU to target practice improvements leading to specific quality improvements.

  5. Data for creating Interactive Dictionary

    • kaggle.com
    zip
    Updated Nov 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhrumil Patel (2018). Data for creating Interactive Dictionary [Dataset]. https://www.kaggle.com/borrkk/dictionary
    Explore at:
    zip(1458641 bytes)Available download formats
    Dataset updated
    Nov 16, 2018
    Authors
    Dhrumil Patel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Dhrumil Patel

    Released under CC0: Public Domain

    Contents

  6. c

    TSS Summarized Results Data Dictionary

    • s.cnmilf.com
    • catalog.data.gov
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Government-wide Policy (2025). TSS Summarized Results Data Dictionary [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/tss-summarized-results-data-dictionary
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset provided by
    Office of Government-wide Policy
    Description

    A Data Dictionary for the TSS Summarized Reports at the Building and Individual level reports.

  7. S

    The Semantic Data Dictionary – An Approach for Describing and Annotating...

    • scidb.cn
    Updated Oct 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness (2020). The Semantic Data Dictionary – An Approach for Describing and Annotating Data [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00060
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2020
    Dataset provided by
    Science Data Bank
    Authors
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    17 tables and two figures of this paper. Table 1 is a subset of explicit entries identified in NHANES demographics data. Table 2 is a subset of implicit entries identified in NHANES demographics data. Table 3 is a subset of NHANES demographic Codebook entries. Table 4 presents a subset of explicit entries identified in SEER. Table 5 is a subset of Dictionary Mapping for the MIMIC-III Admission table. Table 6 shows high-level comparison of semantic data dictionaries, traditional data dictionaries, approaches involving mapping languages, and general data integration tools. Table A1 shows namespace prefixes and IRIs for relevant ontologies. Table B1 shows infosheet specification. Table B2 shows infosheet metadata supplement. Table B3 shows dictionary mapping specification. Table B4 is a codebook specification. Table B5 is a timeline specification. Table B6 is properties specification. Table C1 shows NHANES demographics infosheet. Table C2 shows NHANES demographic implicit entries. Table C3 shows NHANES demographic explicit entries. Table C4 presents expanded NHANES demographic Codebook entries. Figure 1 is a conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the “Relation” label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the “has value” object of the column object, which is generally either an attribute or an entity. Figure 2 presents (a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

  8. Superstore

    • kaggle.com
    zip
    Updated Oct 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Elsayed (2022). Superstore [Dataset]. https://www.kaggle.com/datasets/ibrahimelsayed182/superstore
    Explore at:
    zip(167457 bytes)Available download formats
    Dataset updated
    Oct 3, 2022
    Authors
    Ibrahim Elsayed
    Description

    Context

    super Store in USA , the data contain about 10000 rows

    Data Dictionary

    AttributesDefinitionexample
    Ship ModeSecond Class
    SegmentSegment CategoryConsumer
    CountryUnited State
    CityLos Angeles
    StateCalifornia
    Postal Code90032
    RegionWest
    CategoryCategories of productTechnology
    Sub-CategoryPhones
    Salesnumber of sales114.9
    Quantity3
    Discount0.45
    Profit14.1694

    Acknowledgements

    All thanks to The Sparks Foundation For making this data set

    Inspiration

    Get the data and try to take insights. Good luck ❤️

    Don't forget to Upvote😊🥰

  9. d

    LNWB Ch03 Data Processes - data management plan

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill (2021). LNWB Ch03 Data Processes - data management plan [Dataset]. https://search.dataone.org/view/sha256%3Aa7eac4a8f4655389d5169cbe06562ea14e88859d2c4b19a633a0610ca07a329f
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill
    Description

    Overview: The Lower Nooksack Water Budget Project involved assembling a wide range of existing data related to WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. This Data Management Plan provides an overview of the data sets, formats and collaboration environment that was used to develop the project. Use of a plan during development of the technical work products provided a forum for the data development and management to be conducted with transparent methods and processes. At project completion, the Data Management Plan provides an accessible archive of the data resources used and supporting information on the data storage, intended access, sharing and re-use guidelines.

    One goal of the Lower Nooksack Water Budget project is to make this “usable technical information” as accessible as possible across technical, policy and general public users. The project data, analyses and documents will be made available through the WRIA 1 Watershed Management Project website http://wria1project.org. This information is intended for use by the WRIA 1 Joint Board and partners working to achieve the adopted goals and priorities of the WRIA 1 Watershed Management Plan.

    Model outputs for the Lower Nooksack Water Budget are summarized by sub-watersheds (drainages) and point locations (nodes). In general, due to changes in land use over time and changes to available streamflow and climate data, the water budget for any watershed needs to be updated periodically. Further detailed information about data sources is provided in review packets developed for specific technical components including climate, streamflow and groundwater level, soils and land cover, and water use.

    Purpose: This project involves assembling a wide range of existing data related to the WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. Data will be used as input to various hydrologic, climatic and geomorphic components of the Topnet-Water Management (WM) model, but will also be available to support other modeling efforts in WRIA 1. Much of the data used as input to the Topnet model is publicly available and maintained by others, (i.e., USGS DEMs and streamflow data, SSURGO soils data, University of Washington gridded meteorological data). Pre-processing is performed to convert these existing data into a format that can be used as input to the Topnet model. Post-processing of Topnet model ASCII-text file outputs is subsequently combined with spatial data to generate GIS data that can be used to create maps and illustrations of the spatial distribution of water information. Other products generated during this project will include documentation of methods, input by WRIA 1 Joint Board Staff Team during review and comment periods, communication tools developed for public engagement and public comment on the project.

    In order to maintain an organized system of developing and distributing data, Lower Nooksack Water Budget project collaborators should be familiar with standards for data management described in this document, and the following issues related to generating and distributing data: 1. Standards for metadata and data formats 2. Plans for short-term storage and data management (i.e., file formats, local storage and back up procedures and security) 3. Legal and ethical issues (i.e., intellectual property, confidentiality of study participants) 4. Access policies and provisions (i.e., how the data will be made available to others, any restrictions needed) 5. Provisions for long-term archiving and preservation (i.e., establishment of a new data archive or utilization of an existing archive) 6. Assigned data management responsibilities (i.e., persons responsible for ensuring data Management, monitoring compliance with the Data Management Plan)

    This resource is a subset of the LNWB Ch03 Data Processes Collection Resource.

  10. f

    Data dictionary for the ACTORDS 20-year follow-up study

    • auckland.figshare.com
    csv
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robyn May (2025). Data dictionary for the ACTORDS 20-year follow-up study [Dataset]. http://doi.org/10.17608/k6.auckland.28732205.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 16, 2025
    Dataset provided by
    The University of Auckland
    Authors
    Robyn May
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Metadata (data dictionary) and statistical analysis plan (including outcomes definitions for data dictionary) for the ACTORDS 20-year follow-up study. The DOI for the primary study publication is https://doi.org/10.1371/journal.pmed.1004618.Data and associated documentation for participants who have consented to future re-use of their data are available to other users under the data sharing arrangements provided by the University of Auckland’s Human Health Research Services (HHRS) platform (https://research-hub.auckland.ac.nz/subhub/human-health-research-services-platform). The data dictionary and metadata are published on the University of Auckland’s data repository Figshare, which allocates a DOI and thus makes these details searchable and available indefinitely. Researchers are able to use this information and the provided contact address (dataservices@auckland.ac.nz) to request a de-identified dataset through the HHRS Data Access Committee. Data will be shared with researchers who provide a methodologically sound proposal and have appropriate ethical approval, where necessary, to achieve the research aims in the approved proposal. Data requestors are required to sign a Data Access Agreement that includes a commitment to using the data only for the specified proposal, not to attempt to identify any individual participant, a commitment to secure storage and use of the data, and to destroy or return the data after completion of the project. The HHRS platform reserves the right to charge a fee to cover the costs of making data available, if needed, for data requests that require additional work to prepare.

  11. C

    Migration Chain: Data Dictionary and Open Data Manual

    • ckan.mobidatalab.eu
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OverheidNl (2023). Migration Chain: Data Dictionary and Open Data Manual [Dataset]. https://ckan.mobidatalab.eu/eu/dataset/immigratie-handleiding-open-data
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/pdf, http://publications.europa.eu/resource/authority/file-type/zip, http://publications.europa.eu/resource/authority/file-type/ppsxAvailable download formats
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    OverheidNl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Since 2013, the Dutch Migration Chain has had a chain-wide data dictionary, the Data Dictionary Migration Chain (GMK). The Migration Chain consists of the following organisations: - Central Agency for the Reception of Asylum Seekers - Correctional Institutions Agency, Ministry of Justice and Security - Repatriation and Departure Service, Ministry of Justice and Security - Directorate-General for Migration, Ministry of Justice and Security - Immigration and Naturalization Service , Ministry of Justice and Security - International Organization for Migration - Royal Netherlands Marechaussee - Ministry of Foreign Affairs - National Police - Council of State - Council for the Judiciary - Netherlands Council for Refugees - Seaport Police. ### Data dictionary Migration chain One of the principles in the basic starting architecture of the migration chain is that there is no difference of opinion about the meaning of the information that can be extracted from an integrated customer view. A uniform conceptual framework goes further than a glossary of the most important concepts: each shared data can be related to a concept in the conceptual framework; in the description of the concepts, the relations to each other are named. Chain parties have aligned their own conceptual frameworks with the uniform conceptual framework in the migration chain. The GMK is an overview of the common terminology used within the migration chain. This promotes a correct interpretation of the information exchanged within or reported on the processes of the migration chain. A correct interpretation of information prevents miscommunication, mistakes and errors. For users in the migration chain, the GMK is available on the non-public Rijksweb (gmk.vk.rijksweb.nl). In the context of openness and transparency, it has been decided to make the description of concepts and management information from the GMK accessible as open data. This means that the data via Data.overheid.nl is available and reusable for everyone. By making the data transparent, the Ministry also hopes that publications by and about work in the migration chain, such as the State of Migration, can be better explained and contextualised. ### Manual Manual for using the open datasets of the migration chain in Excel.

  12. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  13. a

    NYSERDA Building footprints

    • data-oswegogis.hub.arcgis.com
    Updated Nov 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oswego County GIS (2021). NYSERDA Building footprints [Dataset]. https://data-oswegogis.hub.arcgis.com/datasets/nyserda-building-footprints
    Explore at:
    Dataset updated
    Nov 30, 2021
    Dataset authored and provided by
    Oswego County GIS
    Area covered
    Description

    NYSERDA Building footprints were created as part of the New York State Flood Impact Decision Support Systems, more information on this program can be found at https://fidss.ciesin.columbia.edu/home. Footprints vary in age from county to county. The data dictionary for field descriptions can be found at https://fidss.ciesin.columbia.edu/fidss_files/documents/data-dictionary.xlsx Oswego County data is primarily sourced from Oswego County's 2009 building footprint project (contracted with Pictometry/Eagleview), Microsoft Building Footprints, and NYS flyover/LiDAR data.

  14. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  15. H

    Replication Data for: Creating and Comparing Dictionary, Word Embedding, and...

    • dataverse.harvard.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Widmann; Maximilian Wich (2022). Replication Data for: Creating and Comparing Dictionary, Word Embedding, and Transformer-based Models to Measure Discrete Emotions in German Political Text [Dataset]. http://doi.org/10.7910/DVN/C9SAIX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Tobias Widmann; Maximilian Wich
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Previous research on emotional language relied heavily on off-the-shelf sentiment dictionaries that focus on negative and positive tone. These dictionaries are often tailored to non-political domains and use bag-of-words approaches which come with a series of disadvantages. This paper creates, validates, and compares the performance of (1) a novel emotional dictionary specifically for political text, (2) locally trained word embedding models combined with simple neural-network classifiers and (3) transformer-based models which overcome limitations of the dictionary approach. All tools can measure emotional appeals associated with eight discrete emotions. The different approaches are validated on different sets of crowd-coded sentences. Encouragingly, the results highlight the strengths of novel transformer-based models, which come with easily available pre-trained language models. Furthermore, all customized approaches outperform widely used off-the-shelf dictionaries in measuring emotional language in German political discourse. This replication directory contains code and data necessary to reproduce all models, figures, and tables included in "Creating and Comparing Dictionary, Word Embedding, and Transformer-based Models to Measure Discrete Emotions in German Political Text" as well as its supplemental online appendix.

  16. Z

    Messy Spreadsheet Example for Instruction

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Curty, Renata Gonçalves (2024). Messy Spreadsheet Example for Instruction [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12586562
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    University of California, Santa Barbara
    Authors
    Curty, Renata Gonçalves
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.

  17. f

    Variable data dictionary.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar, Ashwani; Balakrishnan, Vijayakumar; Guérin, Philippe J.; Walker, Martin; Halder, Julia B.; Raja, Jeyapal Dinesh; Uddin, Azhar; Brack, Matthew; Srividya, Adinarayanan; Freitas, Luzia T.; Rahi, Manju; Singh-Phulgenda, Sauman; Basáñez, Maria-Gloria; Khan, Mashroor Ahmad; Harriss, Eli (2024). Variable data dictionary. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001377771
    Explore at:
    Dataset updated
    Jan 16, 2024
    Authors
    Kumar, Ashwani; Balakrishnan, Vijayakumar; Guérin, Philippe J.; Walker, Martin; Halder, Julia B.; Raja, Jeyapal Dinesh; Uddin, Azhar; Brack, Matthew; Srividya, Adinarayanan; Freitas, Luzia T.; Rahi, Manju; Singh-Phulgenda, Sauman; Basáñez, Maria-Gloria; Khan, Mashroor Ahmad; Harriss, Eli
    Description

    BackgroundLymphatic filariasis (LF) is a neglected tropical disease (NTD) targeted by the World Health Organization for elimination as a public health problem (EPHP). Since 2000, more than 9 billion treatments of antifilarial medicines have been distributed through mass drug administration (MDA) programmes in 72 endemic countries and 17 countries have reached EPHP. Yet in 2021, nearly 900 million people still required MDA with combinations of albendazole, diethylcarbamazine and/or ivermectin. Despite the reliance on these drugs, there remain gaps in understanding of variation in responses to treatment. As demonstrated for other infectious diseases, some urgent questions could be addressed by conducting individual participant data (IPD) meta-analyses. Here, we present the results of a systematic literature review to estimate the abundance of IPD on pre- and post-intervention indicators of infection and/or morbidity and assess the feasibility of building a global data repository.MethodologyWe searched literature published between 1st January 2000 and 5th May 2023 in 15 databases to identify prospective studies assessing LF treatment and/or morbidity management and disease prevention (MMDP) approaches. We considered only studies where individual participants were diagnosed with LF infection or disease and were followed up on at least one occasion after receiving an intervention/treatment.Principal findingsWe identified 138 eligible studies from 23 countries, having followed up an estimated 29,842 participants after intervention. We estimate 14,800 (49.6%) IPD on pre- and post-intervention infection indicators including microfilaraemia, circulating filarial antigen and/or ultrasound indicators measured before and after intervention using 8 drugs administered in various combinations. We identified 33 studies on MMDP, estimating 6,102 (20.4%) IPD on pre- and post-intervention clinical morbidity indicators only. A further 8,940 IPD cover a mixture of infection and morbidity outcomes measured with other diagnostics, from participants followed for adverse event outcomes only or recruited after initial intervention.ConclusionsThe LF treatment study landscape is heterogeneous, but the abundance of studies and related IPD suggest that establishing a global data repository to facilitate IPD meta-analyses would be feasible and useful to address unresolved questions on variation in treatment outcomes across geographies, demographics and in underrepresented groups. New studies using more standardized approaches should be initiated to address the scarcity and inconsistency of data on morbidity management.

  18. d

    Replication Data for: Introducing an Interpretable Deep Learning Approach to...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Häffner, Sonja; Hofer, Martin; Nagl, Maximilian; Walterskirchen, Julian (2023). Replication Data for: Introducing an Interpretable Deep Learning Approach to Domain-Specific Dictionary Creation: A Use Case for Conflict Prediction [Dataset]. http://doi.org/10.7910/DVN/Y5INRM
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Häffner, Sonja; Hofer, Martin; Nagl, Maximilian; Walterskirchen, Julian
    Description

    Recent advancements in natural language processing (NLP) methods have significantly improved their performance. However, more complex NLP models are more difficult to interpret and computationally expensive. Therefore, we propose an approach to dictionary creation that carefully balances the trade-off between complexity and interpretability. This approach combines a deep neural network architecture with techniques to improve model explainability to automatically build a domain-specific dictionary. As an illustrative use case of our approach, we create an objective dictionary that can infer conflict intensity from text data. We train the neural networks on a corpus of conflict reports and match them with conflict event data. This corpus consists of over 14,000 expert-written International Crisis Group (ICG) CrisisWatch reports between 2003 and 2021. Sensitivity analysis is used to extract the weighted words from the neural network to build the dictionary. In order to evaluate our approach, we compare our results to state-of-the-art deep learning language models, text-scaling methods, as well as standard, non-specialized, and conflict event dictionary approaches. We are able to show that our approach outperforms other approaches while retaining interpretability.

  19. Dictionary of English Words and Definitions

    • kaggle.com
    zip
    Updated Sep 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Dictionary of English Words and Definitions [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/dictionary-of-english-words-and-definitions
    Explore at:
    zip(6401928 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.

    Key Features:

    • Words: A diverse set of English words, including both rare and frequently used terms.
    • Definitions: Each word is accompanied by a detailed definition that explains its meaning and contextual usage.

    Total Number of Words: 42,052

    Applications

    This dataset is well-suited for a range of use cases, including:

    • Natural Language Processing (NLP): Enhance text understanding models by providing contextual meaning and word associations.
    • Vocabulary Building: Create educational tools or games that help users expand their vocabulary.
    • Lexical Studies: Perform academic research on word usage, trends, and lexical semantics.
    • Dictionary and Thesaurus Development: Serve as a resource for building dictionary or thesaurus applications, where users can search for words and definitions.

    Data Structure

    • Word: The column containing the English word.
    • Definition: The column providing a comprehensive definition of the word.

    Potential Use Cases

    • Language Learning: This dataset can be used to develop applications or tools aimed at enhancing vocabulary acquisition for language learners.
    • NLP Model Training: Useful for tasks such as word embeddings, definition generation, and contextual learning.
    • Research: Analyze word patterns, rare vocabulary, and trends in the English language.

    This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!

  20. Data from: THEORETICAL AND PRACTICAL ASPECTS OF USING COMPUTER TECHNOLOGIES...

    • zenodo.org
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alisherova Shahnoza Asqarovna; Alisherova Shahnoza Asqarovna (2024). THEORETICAL AND PRACTICAL ASPECTS OF USING COMPUTER TECHNOLOGIES AND PROGRAMS WHEN CREATING THEMATIC DICTIONARIES IN ENGLISH AND UZBEK [Dataset]. http://doi.org/10.5281/zenodo.13905968
    Explore at:
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alisherova Shahnoza Asqarovna; Alisherova Shahnoza Asqarovna
    Description

    This article explores the theoretical and practical aspects of utilizing computer technologies and programs in the creation of thematic dictionaries, specifically focusing on English and Uzbek. The study examines the advantages, limitations, and methodological considerations of applying computer-aided techniques for collecting, analyzing, and organizing linguistic data in these two languages. The analysis highlights the unique challenges and opportunities presented by the linguistic characteristics of each language, as well as the evolving landscape of dictionary creation in the digital age.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lake County Illinois GIS (2017). Building Footprints Data Dictionary [Dataset]. https://data-lakecountyil.opendata.arcgis.com/documents/afb5f879894a4993bc9b45998267d94d

Building Footprints Data Dictionary

Explore at:
Dataset updated
Oct 10, 2017
Dataset authored and provided by
Lake County Illinois GIS
License

https://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/datahttps://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/data

Area covered
Description

An in-depth description of the Building Footprint GIS data layer outlining terms of use, update frequency, attribute explanations, and more.

Search
Clear search
Close search
Google apps
Main menu