100+ datasets found
  1. PLACES and 500 Cities: Data Dictionary

    • catalog.data.gov
    • data.virginia.gov
    • +3more
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2025). PLACES and 500 Cities: Data Dictionary [Dataset]. https://catalog.data.gov/dataset/places-and-500-cities-data-dictionary-f68b6
    Explore at:
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Description

    This dataset provides a data dictionary for PLACES and 500 Cities releases. For each measure, the data dictionary provides the measure ID, measure full and short name, measure category ID and name, year of BRFSS data used to generate the estimate by release year, and frequency BRFSS collects data about the measure.

  2. S

    The Semantic Data Dictionary – An Approach for Describing and Annotating...

    • scidb.cn
    Updated Oct 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness (2020). The Semantic Data Dictionary – An Approach for Describing and Annotating Data [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00060
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2020
    Dataset provided by
    Science Data Bank
    Authors
    Sabbir M. Rashid; James P. McCusker; Paulo Pinheiro; Marcello P. Bax; Henrique Santos; Jeanette A. Stingone; Amar K. Das; Deborah L. McGuinness
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    17 tables and two figures of this paper. Table 1 is a subset of explicit entries identified in NHANES demographics data. Table 2 is a subset of implicit entries identified in NHANES demographics data. Table 3 is a subset of NHANES demographic Codebook entries. Table 4 presents a subset of explicit entries identified in SEER. Table 5 is a subset of Dictionary Mapping for the MIMIC-III Admission table. Table 6 shows high-level comparison of semantic data dictionaries, traditional data dictionaries, approaches involving mapping languages, and general data integration tools. Table A1 shows namespace prefixes and IRIs for relevant ontologies. Table B1 shows infosheet specification. Table B2 shows infosheet metadata supplement. Table B3 shows dictionary mapping specification. Table B4 is a codebook specification. Table B5 is a timeline specification. Table B6 is properties specification. Table C1 shows NHANES demographics infosheet. Table C2 shows NHANES demographic implicit entries. Table C3 shows NHANES demographic explicit entries. Table C4 presents expanded NHANES demographic Codebook entries. Figure 1 is a conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the “Relation” label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the “has value” object of the column object, which is generally either an attribute or an entity. Figure 2 presents (a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

  3. Data for creating Interactive Dictionary

    • kaggle.com
    zip
    Updated Nov 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhrumil Patel (2018). Data for creating Interactive Dictionary [Dataset]. https://www.kaggle.com/borrkk/dictionary
    Explore at:
    zip(1458641 bytes)Available download formats
    Dataset updated
    Nov 16, 2018
    Authors
    Dhrumil Patel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Dhrumil Patel

    Released under CC0: Public Domain

    Contents

  4. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  5. n

    Data from: Development of Data Dictionary for neonatal intensive care unit:...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harpreet Singh; Ravneet Kaur; Satish Saluja; Su Cho; Avneet Kaur; Ashish Pandey; Shubham Gupta; Ritu Das; Praveen Kumar; Jonathan Palma; Gautam Yadav; Yao Sun (2020). Development of Data Dictionary for neonatal intensive care unit: advancement towards a better critical care unit [Dataset]. http://doi.org/10.5061/dryad.zkh18936f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2020
    Dataset provided by
    CHIL
    Indraprastha Institute of Information Technology Delhi
    Post Graduate Institute of Medical Education and Research
    Sir Ganga Ram Hospital
    UCSF Benioff Children's Hospital
    KLKH
    Apollo Cradle For Women & Children
    Ewha Womans University
    Lucile Packard Children's Hospital
    Authors
    Harpreet Singh; Ravneet Kaur; Satish Saluja; Su Cho; Avneet Kaur; Ashish Pandey; Shubham Gupta; Ritu Das; Praveen Kumar; Jonathan Palma; Gautam Yadav; Yao Sun
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: Critical care units (CCUs) with wide use of various monitoring devices generate massive data. To utilize the valuable information of these devices; data are collected and stored using systems like Clinical Information System (CIS), Laboratory Information Management System (LIMS), etc. These systems are proprietary in nature, allow limited access to their database and have vendor specific clinical implementation. In this study we focus on developing an open source web-based meta-data repository for CCU representing stay of patient with relevant details.

    Methods: After developing the web-based open source repository we analyzed prospective data from two sites for four months for data quality dimensions (completeness, timeliness, validity, accuracy and consistency), morbidity and clinical outcomes. We used a regression model to highlight the significance of practice variations linked with various quality indicators. Results: Data dictionary (DD) with 1447 fields (90.39% categorical and 9.6% text fields) is presented to cover clinical workflow of NICU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicator and practice variations are strongly correlated (p-value < 0.05).

    Results: Data dictionary (DD) with 1555 fields (89.6% categorical and 11.4% text fields) is presented to cover clinical workflow of a CCU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicators and practice variations are strongly correlated (p-value < 0.05).

    Conclusion: This study documents DD for standardized data collection in CCU. This provides robust data and insights for audit purposes and pathways for CCU to target practice improvements leading to specific quality improvements.

  6. d

    LNWB Ch03 Data Processes - data management plan

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill (2021). LNWB Ch03 Data Processes - data management plan [Dataset]. https://search.dataone.org/view/sha256%3Aa7eac4a8f4655389d5169cbe06562ea14e88859d2c4b19a633a0610ca07a329f
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill
    Description

    Overview: The Lower Nooksack Water Budget Project involved assembling a wide range of existing data related to WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. This Data Management Plan provides an overview of the data sets, formats and collaboration environment that was used to develop the project. Use of a plan during development of the technical work products provided a forum for the data development and management to be conducted with transparent methods and processes. At project completion, the Data Management Plan provides an accessible archive of the data resources used and supporting information on the data storage, intended access, sharing and re-use guidelines.

    One goal of the Lower Nooksack Water Budget project is to make this “usable technical information” as accessible as possible across technical, policy and general public users. The project data, analyses and documents will be made available through the WRIA 1 Watershed Management Project website http://wria1project.org. This information is intended for use by the WRIA 1 Joint Board and partners working to achieve the adopted goals and priorities of the WRIA 1 Watershed Management Plan.

    Model outputs for the Lower Nooksack Water Budget are summarized by sub-watersheds (drainages) and point locations (nodes). In general, due to changes in land use over time and changes to available streamflow and climate data, the water budget for any watershed needs to be updated periodically. Further detailed information about data sources is provided in review packets developed for specific technical components including climate, streamflow and groundwater level, soils and land cover, and water use.

    Purpose: This project involves assembling a wide range of existing data related to the WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. Data will be used as input to various hydrologic, climatic and geomorphic components of the Topnet-Water Management (WM) model, but will also be available to support other modeling efforts in WRIA 1. Much of the data used as input to the Topnet model is publicly available and maintained by others, (i.e., USGS DEMs and streamflow data, SSURGO soils data, University of Washington gridded meteorological data). Pre-processing is performed to convert these existing data into a format that can be used as input to the Topnet model. Post-processing of Topnet model ASCII-text file outputs is subsequently combined with spatial data to generate GIS data that can be used to create maps and illustrations of the spatial distribution of water information. Other products generated during this project will include documentation of methods, input by WRIA 1 Joint Board Staff Team during review and comment periods, communication tools developed for public engagement and public comment on the project.

    In order to maintain an organized system of developing and distributing data, Lower Nooksack Water Budget project collaborators should be familiar with standards for data management described in this document, and the following issues related to generating and distributing data: 1. Standards for metadata and data formats 2. Plans for short-term storage and data management (i.e., file formats, local storage and back up procedures and security) 3. Legal and ethical issues (i.e., intellectual property, confidentiality of study participants) 4. Access policies and provisions (i.e., how the data will be made available to others, any restrictions needed) 5. Provisions for long-term archiving and preservation (i.e., establishment of a new data archive or utilization of an existing archive) 6. Assigned data management responsibilities (i.e., persons responsible for ensuring data Management, monitoring compliance with the Data Management Plan)

    This resource is a subset of the LNWB Ch03 Data Processes Collection Resource.

  7. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  8. C

    Migration Chain: Data Dictionary and Open Data Manual

    • ckan.mobidatalab.eu
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OverheidNl (2023). Migration Chain: Data Dictionary and Open Data Manual [Dataset]. https://ckan.mobidatalab.eu/eu/dataset/immigratie-handleiding-open-data
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/pdf, http://publications.europa.eu/resource/authority/file-type/zip, http://publications.europa.eu/resource/authority/file-type/ppsxAvailable download formats
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    OverheidNl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Since 2013, the Dutch Migration Chain has had a chain-wide data dictionary, the Data Dictionary Migration Chain (GMK). The Migration Chain consists of the following organisations: - Central Agency for the Reception of Asylum Seekers - Correctional Institutions Agency, Ministry of Justice and Security - Repatriation and Departure Service, Ministry of Justice and Security - Directorate-General for Migration, Ministry of Justice and Security - Immigration and Naturalization Service , Ministry of Justice and Security - International Organization for Migration - Royal Netherlands Marechaussee - Ministry of Foreign Affairs - National Police - Council of State - Council for the Judiciary - Netherlands Council for Refugees - Seaport Police. ### Data dictionary Migration chain One of the principles in the basic starting architecture of the migration chain is that there is no difference of opinion about the meaning of the information that can be extracted from an integrated customer view. A uniform conceptual framework goes further than a glossary of the most important concepts: each shared data can be related to a concept in the conceptual framework; in the description of the concepts, the relations to each other are named. Chain parties have aligned their own conceptual frameworks with the uniform conceptual framework in the migration chain. The GMK is an overview of the common terminology used within the migration chain. This promotes a correct interpretation of the information exchanged within or reported on the processes of the migration chain. A correct interpretation of information prevents miscommunication, mistakes and errors. For users in the migration chain, the GMK is available on the non-public Rijksweb (gmk.vk.rijksweb.nl). In the context of openness and transparency, it has been decided to make the description of concepts and management information from the GMK accessible as open data. This means that the data via Data.overheid.nl is available and reusable for everyone. By making the data transparent, the Ministry also hopes that publications by and about work in the migration chain, such as the State of Migration, can be better explained and contextualised. ### Manual Manual for using the open datasets of the migration chain in Excel.

  9. Plan ID Crosswalk PUF

    • datahub.hhs.gov
    • healthdata.gov
    csv, xlsx, xml
    Updated Oct 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.Healthcare.gov (2021). Plan ID Crosswalk PUF [Dataset]. https://datahub.hhs.gov/CMS/Plan-ID-Crosswalk-PUF/pxe9-f4nd
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Oct 8, 2021
    Dataset provided by
    HealthCare.govhttps://www.healthcare.gov/
    Description

    The Plan ID Crosswalk PUF (CW-PUF) is one of the seven files that make up the Marketplace PUF. The purpose of the CW-PUF is to map QHPs and SADPs offered through the Marketplaces in 2014 to plans that will be offered through the Marketplaces in 2015. These data either originate from the Plan Crosswalk template (i.e., template field), an Excel-based form used by issuers to describe their plans in the QHP application process, or were generated by CCIIO for use in data processing (i.e., system-generated).This data dictionary describes the variables contained in the CW-PUF. Each record relates to a mapping between a plan offered in 2014 and a plan offered in 2015 at the county or county-zip code level.

  10. Freight Analysis Framework (FAF5) Network Nodes

    • catalog.data.gov
    • geodata.bts.gov
    • +2more
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Transportation Statistics (BTS) (Point of Contact) (2025). Freight Analysis Framework (FAF5) Network Nodes [Dataset]. https://catalog.data.gov/dataset/freight-analysis-framework-faf5-network-nodes1
    Explore at:
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Bureau of Transportation Statisticshttp://www.rita.dot.gov/bts
    Description

    The Freight Analysis Framework (FAF5) - Network Nodes dataset was created from 2017 base year data and was published on April 11, 2022 from the Bureau of Transportation Statistics (BTS) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). The FAF (Version 5) Network Nodes contains 348,498 node features. All node features are topologically connected to permit network pathbuilding and vehicle assignment using a variety of assignment algorithms. The FAF Node and the FAF Link datasets can be used together to create a network. The link features in the FAF Network dataset include all roads represented in prior FAF networks, and all roads in the National Highway System (NHS) and the National Highway Freight Network (NHFN) that are currently open to traffic. Other included links provide connections between intersecting routes, and to select intermodal facilities and all U.S. counties. The network consists of over 588,000 miles of equivalent road mileage. The dataset covers the 48 contiguous States plus the District of Columbia, Alaska, and Hawaii. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1528011

  11. d

    Replication Data for: Introducing an Interpretable Deep Learning Approach to...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Häffner, Sonja; Hofer, Martin; Nagl, Maximilian; Walterskirchen, Julian (2023). Replication Data for: Introducing an Interpretable Deep Learning Approach to Domain-Specific Dictionary Creation: A Use Case for Conflict Prediction [Dataset]. http://doi.org/10.7910/DVN/Y5INRM
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Häffner, Sonja; Hofer, Martin; Nagl, Maximilian; Walterskirchen, Julian
    Description

    Recent advancements in natural language processing (NLP) methods have significantly improved their performance. However, more complex NLP models are more difficult to interpret and computationally expensive. Therefore, we propose an approach to dictionary creation that carefully balances the trade-off between complexity and interpretability. This approach combines a deep neural network architecture with techniques to improve model explainability to automatically build a domain-specific dictionary. As an illustrative use case of our approach, we create an objective dictionary that can infer conflict intensity from text data. We train the neural networks on a corpus of conflict reports and match them with conflict event data. This corpus consists of over 14,000 expert-written International Crisis Group (ICG) CrisisWatch reports between 2003 and 2021. Sensitivity analysis is used to extract the weighted words from the neural network to build the dictionary. In order to evaluate our approach, we compare our results to state-of-the-art deep learning language models, text-scaling methods, as well as standard, non-specialized, and conflict event dictionary approaches. We are able to show that our approach outperforms other approaches while retaining interpretability.

  12. Bangla Financial lexicon Sentiment dictionary

    • kaggle.com
    zip
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Ashraful Islam (2023). Bangla Financial lexicon Sentiment dictionary [Dataset]. https://www.kaggle.com/datasets/mdashrafulislam1998/bangla-financial-lexicon-data-dictionary/data
    Explore at:
    zip(60849 bytes)Available download formats
    Dataset updated
    Jul 30, 2023
    Authors
    Md. Ashraful Islam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Welcome to the Bangla Financial Lexicon Data Dictionary project!

    The financial lexicon data dictionary is a list of words used to calculate the sentiment of financial news articles. Bangla words were collected from an online Bangla dictionary API and manually categorized into 6 weighted groups. To accurately determine the sentiment of sentences, a lexicon data dictionary is required. This project's lexicon data dictionary only contains Bangla words and includes words with positive sentiment and words with negative sentiment.

    This dataset was a crucial part of our research published in the journal paper titled "Stock Market Prediction of Bangladesh Using Multivariate Long Short-Term Memory with Sentiment Identification." The paper can be accessed and cited at http://doi.org/10.11591/ijece.v13i5.pp5696-5706.

    Understanding the Categories:

    Bull words This word collection is called bull words because, from a financial standpoint, they are considered to have positive connotations. These words are typically associated with upward market trends, increasing stock prices, and overall economic growth. In this sense, bull words are viewed as desirable and are often used by financial analysts and investors to convey optimism about the state of the economy.

    Bear words Bear word list is the opposite of positive sentimental words in financial sentiment analysis. For the purpose of evaluating the sentiment around business news, every phrase on this list is regarded as having a contradictory sentiment. Bear word lists typically consist of words that are associated with downward trends in the stock market, such as recession, inflation, unemployment, and bankruptcy.

    Negative words Negative word list has words like “ন়া”, “নয়”, and “নেই” which can make a full sentence negative in the Bangla language. These negative words can have a significant impact on the overall sentiment of a sentence, even if the other words in the sentence are positive. The negative word list is a crucial tool for sentiment analysis in the Bangla language.

    Coordinating conjunction words (Co con.) In the Bangla language conjunctions like “কিন্তু”, “আদপে”, “এবং”, “অথবা” plays an important role in sentence making. They should have their own weighted effect value in sentiment analysis. By assigning weighted effect values to conjunctions in Bangla language, resulting in more accurate sentiment analysis.

    Subordinating conjunctions (Sub con.) Another kind of conjunctions list with words like "অধিকন্ত", "এমনকি", "বিশেষত". These conjunctions are often used to indicate a shift in tone or emphasis in a sentence and can play a significant role in shaping the overall sentiment. By assigning weighted effect values to these conjunctions, financial analysts can further refine their sentiment analysis, providing even more accurate insights into the sentiment of financial news and information.

    Adjectives and adverbs (Adj.) We listed some adjectives and adverbs like "সবচাইতে", "অধিক", "সর্বাধিক" as they are used to glorify the sentence sentiment more than other simple words. We categorized them into 3 weighted categories: high, medium, and low. Words with high weight have the greatest impact, words with medium weight have a moderate impact, and words with low weight have the least impact.

  13. The Online Plain Text English Dictionary (OPTED)

    • kaggle.com
    zip
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DFY Data (2021). The Online Plain Text English Dictionary (OPTED) [Dataset]. https://www.kaggle.com/datasets/dfydata/the-online-plain-text-english-dictionary-opted/discussion
    Explore at:
    zip(5072627 bytes)Available download formats
    Dataset updated
    Sep 28, 2021
    Authors
    DFY Data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    It took a while to find a version of the Webster's 1913 dictionary I could parse and create a CSV file from. This one is from OPTED and you can see the license info on their page.

    Content

    This is the full OPTED version of a Public Domain dictionary based on the Webster's Unabridged Dictionary, 1913 edition. The CSV file contains all entries, along with the character count for each word, the Part of Speech, and the Definition.

    Acknowledgements

    OPTED and Project Gutenberg

  14. Dictionary of English Words and Definitions

    • kaggle.com
    zip
    Updated Sep 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Dictionary of English Words and Definitions [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/dictionary-of-english-words-and-definitions
    Explore at:
    zip(6401928 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.

    Key Features:

    • Words: A diverse set of English words, including both rare and frequently used terms.
    • Definitions: Each word is accompanied by a detailed definition that explains its meaning and contextual usage.

    Total Number of Words: 42,052

    Applications

    This dataset is well-suited for a range of use cases, including:

    • Natural Language Processing (NLP): Enhance text understanding models by providing contextual meaning and word associations.
    • Vocabulary Building: Create educational tools or games that help users expand their vocabulary.
    • Lexical Studies: Perform academic research on word usage, trends, and lexical semantics.
    • Dictionary and Thesaurus Development: Serve as a resource for building dictionary or thesaurus applications, where users can search for words and definitions.

    Data Structure

    • Word: The column containing the English word.
    • Definition: The column providing a comprehensive definition of the word.

    Potential Use Cases

    • Language Learning: This dataset can be used to develop applications or tools aimed at enhancing vocabulary acquisition for language learners.
    • NLP Model Training: Useful for tasks such as word embeddings, definition generation, and contextual learning.
    • Research: Analyze word patterns, rare vocabulary, and trends in the English language.

    This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!

  15. E

    GlobalPhone Spanish (Latin American) Pronunciation Dictionary

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 25, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2014). GlobalPhone Spanish (Latin American) Pronunciation Dictionary [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0360/
    Explore at:
    Dataset updated
    Nov 25, 2014
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Latin America
    Description

    The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In ...

  16. o

    Data from: Data in online version of the ‘Dictionary of Medieval Latin from...

    • ora.ox.ac.uk
    Updated Jan 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashdowne, R (2016). Data in online version of the ‘Dictionary of Medieval Latin from British Sources’ (DMLBS) [Dataset]. https://ora.ox.ac.uk/objects/uuid:b9608816-7ede-4f52-a2a4-e7bc8abd7788
    Explore at:
    Dataset updated
    Jan 1, 2016
    Dataset provided by
    University of Oxford
    Authors
    Ashdowne, R
    License

    https://ora.ox.ac.uk/terms_of_usehttps://ora.ox.ac.uk/terms_of_use

    Area covered
    Britain
    Description

    The DMLBS is distinctive not only for the breadth of its coverage but also for the fact that it is wholly based on original research, i.e. on a fresh reading of medieval Latin texts for this specific purpose, where possible in the best available source, whether that be original manuscripts or modern critical editions. (The method is that used by other major dictionaries, such as the monumental Oxford English Dictionary and, for Latin, the Oxford Latin Dictionary, the Thesaurus Linguae Latinae.) In the nearly 50 years of drafting the Dictionary, different editorial practices and conventions have inevitably created a text that varies significantly from the earliest fascicules to the final ones while remaining recognizably the same underlying work. Many of these variations have been the result of conscious decisions, other simply the result of the Dictionary being the work of many people over many years.

    Work on digitizing the Dictionary began in earnest in 2009, with a move from a traditional print-based workflow to an electronic XML-based workflow, first for material already drafted on paper slips but not yet keyed as electronic data, and then subsequently with the introduction of full ab initio electronic drafting.

    However, even then the majority of the dictionary's content still existed only in print — in the thirteen fascicules (more than 2,500 three-column pages containing nearly 65,000 entries) published since 1965. Once the new workflow for the remaining material to be published was fully established within the project, work began on digitizing earlier fascicules; this work was undertaken by a specialist outside contractor, which captured these printed pages and tagged the material in accordance with the Dictionary schema. The captured material was then evaluated and corrected within the project. Plans for the project itself developing and hosting an online platform for the full dataset were discontinued in 2014 due to lack of technical support and funding, but partnerships have been established to ensure that online publication is achieved.

    Technical Overview:

    The DMLBS is held in XML according to customized XSD schemas. All data is held in unicode encoding.

    Data structure: At the heart of the DMLBS XML workflow sit the data schemas which describe and are used to constrain the structure of the data. The DMLBS uses XSD schemas. The Dictionary data is represented essentially in the form in which it has been published in print. In addition to the schema for the Dictionary text, there is a further schema for the Dictionary's complex bibliography, which is also held in XML form. The schemas in use were custom-built for the DMLBS in order to match the project’s very specific needs, ensuring that the drafted or captured text always complies with the long-standing structures and conventions of the printed dictionary by requiring, allowing or prohibiting as necessary. (Although the use of TEI encoding was seriously considered, it was clear from initial exploration that the level of customization and optimization required to bring the TEI in line with the practical production needs of the dictionary was too great to be feasible.)

    Data encoding and entry: The encoding chosen for all DMLBS data is Unicode. In addition to the Roman alphabet, with the full range of diacritics (including the macron and breve to mark vowel length), the Dictionary regularly uses Anglo-Saxon letters (such as thorn, wynn, and yogh) and polytonic Greek, along with assorted other letters and symbols. The ‘Dictionary of Medieval Latin from British Sources’ (DMLBS) was prepared by a project team of specialist researchers as a research project of the British Academy, overseen by a committee appointed by the Academy to direct its work. Initially based in London at the Public Record Office, the editorial team moved to Oxford in the early 1980s and since the late 1990s has formed part of the Faculty of Classics at Oxford University. The main aim of the DMLBS project has been to create a successor to the previous standard dictionary of medieval Latin, the Glossarium ... mediae et infimae Latinitatis, first compiled in the seventeenth century by the French scholar, Du Cange (Charles du Fresne), and a history of the project is available at http://www.dmlbs.ox.ac.uk/about-us/history-of-the-project and in Richard Ashdowne ‘Dictionary of Medieval Latin from British Sources’, British Academy Review 24 (2014), 46–53. The project has been supported financially by major research grants from the Arts & Humanities Research Council, the Packard Humanities Institute, and the OUP John Fell Research Fund, and by a small annual grant from the British Academy. It also received institutional support from the British Academy and the University of Oxford.

  17. E

    GlobalPhone Japanese Pronunciation Dictionary

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    Updated Nov 24, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). GlobalPhone Japanese Pronunciation Dictionary [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2372
    Explore at:
    audio formatAvailable download formats
    Dataset updated
    Nov 24, 2014
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT).

    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants).

    1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script.

    2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models.

    3) Dictionary Generation: Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In many cases the GlobalPhone dictionaries were compared to straight-forward grapheme-based speech recognition and to alternative sources, such as Wiktionary and usually demonstrated to be superior in terms of quality, coverage, and accuracy.

    4) Format: The format of the dictionaries is the same across languages and is straight-forward. Each line consists of one word form and its pronunciation separated by blank. The pronunciation consists of a concatenation of phone symbols separated by blanks. Both, words and their pronunciations are given in tcl-script list format, i.e. enclosed in “{}”, since phones can carry tags, indicating the tone and length of a vowel, or the word boundary tag “WB”, indicating the boundary of a dictionary unit. The WB tag can for example be included as a standard question in the decision tree questions for capturing crossword models in context-dependent modeling. Pronunciation variants are indicated by (

    5) Documentation: The pronunciation dictionaries for each language are complemented by a documentation that describes the format of the dictionary, the phone set including its mapping to the International Phonetic Alphabet (IPA), and the frequency distribution of the phones in the dictionary. Most of the pronunciation dictionaries have been successfully applied to large vocabulary speech recognition and references to publications are given when available.

  18. Data from: The Great Ape Dictionary video database

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Oct 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). The Great Ape Dictionary video database [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5600472/embed
    Explore at:
    unknown(20757)Available download formats
    Dataset updated
    Oct 30, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We study the behaviour and cognition of wild apes and other species (elephants, corvids, dogs). Our video archive is called the Great Ape Dictionary, you can find out more here www.greatapedictionary.com or about our lab group here www.wildminds.ac.uk We consider these videos to be a data ark that we would like to make as accessible as possible. While we are unable to make the original video files open access at the present time you can search this database to explore what is available, and then request access for collaborations of different kinds by contacting us directly or through our website. We label all videos in the Great Ape Dictionary video archive with basic meta-data on the location, date, duration, individuals present, and behaviour present. Version 1.0.0 contains current data from the Budongo East African chimpanzee population (n=13806 videos). These datasets are being updated regularly and new data will be incorporated here with versioning. As well as the database there is a second read.me file which contains the ethograms used for each variable coded, and a short summary of other datasets that are in preparation for subsequent version(s). If you are interested in these data please contact us. Please note that not all variables are labeled for all videos, the detailed Ethogram categories are only available for a subset of data. All videos are labelled with up to 5 Contexts (at least one, rarely 5). If you are interested in finding a good example video for a particular behaviour, search for 'Library' = Y, this indicates that this clip contains a very clear example of the behaviour.

  19. Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...

    • s.cnmilf.com
    • catalog.data.gov
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road Runoff 20250218 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-screening-causal-assessment-of-brook-trout-occurrence-and-road-runoff-20250218
    Explore at:
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Pedigree of all data and processing included in the manuscript. Open zip file then access pedigree folder for file describing all other folders, links, and data dictionary Items: NOTES: Description of work and other worksheets. Pedigree: Summary source files used to create figures and tables. DataFiles: Data files used in the R code for creating the figures and tables. DataDictionary: Data file titles in all data files Data: Data file uploaded to Science Hub Output: Files generated from R scripts Plot: Plots generated from R scripts and other software R_Scripts: Clean R scripts used to analyze the data, generate figures and tables Result: Tables generated from R scripts

  20. Airbnb Open Data

    • kaggle.com
    zip
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arian Azmoudeh (2022). Airbnb Open Data [Dataset]. https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata/code
    Explore at:
    zip(10964528 bytes)Available download formats
    Dataset updated
    Aug 1, 2022
    Authors
    Arian Azmoudeh
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    New York City Airbnb Data Cleaning

    Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.

    About Dataset

    Context

    Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City

    Content

    The following Airbnb activity is included in this New York dataset:

    Listings, including full descriptions and average review score Reviews, including unique id for each reviewer and detailed comments Calendar, including listing id and the price and availability for that day

    Data Dictionary Data dictionaries are used to provide detailed information about the contents of a dataset or database, such as the names of measured variables, their data types or formats, and text descriptions. A data dictionary provides a concise guide to understanding and using the data. https://docs.google.com/spreadsheets/d/1b_dvmyhb_kAJhUmv81rAxl4KcXn0Pymz

    Inspiration

    Learn Data Cleaning

    Data Cleaning Challenge

    Data Cleaning Practice for beginners

    Handling missing values

    Handling Outliers

    Handle inconsistent data

    Data Visualization

    Data analysis

    What can we learn about different hosts and areas?

    What can we learn from predictions? (ex: locations, prices, reviews, etc)

    Which hosts are the busiest and why?

    Acknowledgment

    This dataset is part of Airbnb Inside but I tried to make new columns and many data inconsistency issue to create a new dataset to practice data cleaning. The original source can be found here http://insideairbnb.com/explore/

    Arian Azmoudeh

    @arianazmoudeh

    https://www.linkedin.com/in/arianazmoudeh/

    i hope you enjoy it

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Centers for Disease Control and Prevention (2025). PLACES and 500 Cities: Data Dictionary [Dataset]. https://catalog.data.gov/dataset/places-and-500-cities-data-dictionary-f68b6
Organization logo

PLACES and 500 Cities: Data Dictionary

Explore at:
Dataset updated
Feb 3, 2025
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Description

This dataset provides a data dictionary for PLACES and 500 Cities releases. For each measure, the data dictionary provides the measure ID, measure full and short name, measure category ID and name, year of BRFSS data used to generate the estimate by release year, and frequency BRFSS collects data about the measure.

Search
Clear search
Close search
Google apps
Main menu