91 datasets found

Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Fee-for-Service-Provider-Data-Dictionary-u4zy-mhx3/k9u7-nmmx
Explore at:
json, csv, application/rdfxml, application/rssxml, xml, tsvAvailable download formats
Dataset updated
Apr 21, 2022
Description
This dataset tracks the updates made on the dataset "Fee-for-Service Provider Data Dictionary" as a repository for previous versions of the data and metadata.
Synthetic Suicide Prevention Dataset with SDoH
catalog.data.gov
datahub.va.gov
+2more
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://catalog.data.gov/dataset/synthetic-suicide-prevention-dataset-with-sdoh
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Data from: "A guide to using GitHub for developing and versioning data...
osti.gov
knb.ecoinformatics.org
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agarwal, Deborah A.; Bond-Lamberty, Ben; Boye, Kristin; Burrus, Madison; Cholia, Shreyas; Crow, Michael; Crystal-Ornelas, Robert; Damerow, Joan; Devarakonda, Ranjeet; Ely, Kim S.; Goldman, Amy; Heinz, Susan; Hendrix, Valerie; Kakalia, Zarine; Pennington, Stephanie; Robles, Emily; Rogers, Alistair; Simmonds, Maegen; Varadharajan, Charuleka; Velliquette, Terri; Weierbach, Helen; Weisenhorn, Pamela; Welch, Jessica N. (2021). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1780565
Explore at:
Dataset updated
Jan 1, 2021
Dataset provided by
United States Department of Energyhttp://energy.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States); Environmental Systems Science Data Infrastructure for a Virtual Ecosystem
Authors
Agarwal, Deborah A.; Bond-Lamberty, Ben; Boye, Kristin; Burrus, Madison; Cholia, Shreyas; Crow, Michael; Crystal-Ornelas, Robert; Damerow, Joan; Devarakonda, Ranjeet; Ely, Kim S.; Goldman, Amy; Heinz, Susan; Hendrix, Valerie; Kakalia, Zarine; Pennington, Stephanie; Robles, Emily; Rogers, Alistair; Simmonds, Maegen; Varadharajan, Charuleka; Velliquette, Terri; Weierbach, Helen; Weisenhorn, Pamela; Welch, Jessica N.
Description
These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub.In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files.There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.
f
Dataset for "A fast dictionary-learning-based classification scheme using...
figshare.com
bin
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saeed Mohseni seh deh (2025). Dataset for "A fast dictionary-learning-based classification scheme using undercomplete dictionaries" [Dataset]. http://doi.org/10.6084/m9.figshare.29367389.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29367389.v1
Dataset updated
Jun 19, 2025
Dataset provided by
figshare
Authors
Saeed Mohseni seh deh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a preprocessed version of the publicly available MNIST handwritten digit dataset, formatted for use in the research paper "A fast dictionary-learning-based classification scheme using undercomplete dictionaries".The data has been converted into vector form and sorted into .mat files by class label, ranging from 0 to 9. The files are formated as training and testing where the training data has X_train as vectorized images and Y_train as the corresponding labels and X_test and Y_test as the image and labels for testing dataset.**Contents:**X_train_vector_sort_MNISTY_train_MNISTX_test_vector_MNISTY_test_MNIST**Usage:**The dataset is intended for direct use with the code available at:https://github.com/saeedmohseni97/fast-udl-classification
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
PLACES and 500 Cities: Data Dictionary - tgfb-be2h - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PLACES and 500 Cities: Data Dictionary - tgfb-be2h - Archive Repository [Dataset]. https://healthdata.gov/dataset/PLACES-and-500-Cities-Data-Dictionary-tgfb-be2h-Ar/8cma-xrvb
Explore at:
application/rssxml, csv, application/rdfxml, tsv, xml, jsonAvailable download formats
Dataset updated
Jun 28, 2025
Description
This dataset tracks the updates made on the dataset "PLACES and 500 Cities: Data Dictionary" as a repository for previous versions of the data and metadata.
r
Data Dictionary for selected datasets in the Labour Market Information...
researchdata.edu.au
Updated Aug 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Employment and Workplace Relations (2019). Data Dictionary for selected datasets in the Labour Market Information Portal (LMIP) [Dataset]. https://researchdata.edu.au/data-dictionary-selected-portal-lmip/2983507
Explore at:
Dataset updated
Aug 28, 2019
Dataset provided by
data.gov.au
Authors
Department of Employment and Workplace Relations
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Area covered

Description
This file contains data dictionaries for the following datasets within LMIP (http://lmip.gov.au/):\r \r Summary Data\r Employment by Industry\r Employment by Industry Time Series\r Employment Projections by Industry\r Employment by occupation\r Unemployment Rate, Participation Rate & Employment Rate Time Series for States/Territories\r Unemployment Duration\r Population by Age Group\r Population by Age Group Time Series\r Population by Labour Force Status
n
101,702 Japanese Pronunciation Dictionary
nexdata.ai
m.nexdata.ai
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 101,702 Japanese Pronunciation Dictionary [Dataset]. https://www.nexdata.ai/datasets/pronunciation/1088?source=Github
Explore at:
Dataset updated
Nov 21, 2024
Dataset provided by
Nexdata
nexdata technology inc
Authors
Nexdata
Variables measured
Format, Language, Data content, Application scenario
Description
The data contains 101,702 entries. All words and pronunciations are produced by Japanese linguists. It can be used in the research and development of Japanese ASR technology.
Fee-for-Service Provider Data Dictionary - b224-c398 - Archive Repository
healthdata.gov
application/rdfxml +5
Updated May 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Fee-for-Service Provider Data Dictionary - b224-c398 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Fee-for-Service-Provider-Data-Dictionary-b224-c398/e794-g667
Explore at:
tsv, xml, json, csv, application/rdfxml, application/rssxmlAvailable download formats
Dataset updated
May 21, 2022
Description
This dataset tracks the updates made on the dataset "Fee-for-Service Provider Data Dictionary" as a repository for previous versions of the data and metadata.
n
Sino-Tibetan Etymological Dictionary and Thesaurus Database Software
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 13, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Bruhn; John Lowe; David Mortensen; Dominic Yu (2015). Sino-Tibetan Etymological Dictionary and Thesaurus Database Software [Dataset]. http://doi.org/10.6078/D1159Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1159Q
Dataset updated
Jan 13, 2015
Authors
Daniel Bruhn; John Lowe; David Mortensen; Dominic Yu
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
several software suites which support access to the STEDT database. These are written in PERL and PHP, and present different capabilities and dimensions of this linguistic data. This object is a compressed archive of the svn code repository for the project as of January 5, 2015. The active repository is now on GitHub at https://github.com/stedt-project/sss.
Datasets for figures and tables
datasets.ai
catalog.data.gov
+1more
57
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Datasets for figures and tables [Dataset]. https://datasets.ai/datasets/datasets-for-figures-and-tables
Explore at:
57Available download formats
Dataset updated
Aug 6, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
U.S. Environmental Protection Agency
Description
Software

Model simulations were conducted using WRF version 3.8.1 (available at https://github.com/NCAR/WRFV3) and CMAQ version 5.2.1 (available at https://github.com/USEPA/CMAQ). The meteorological and concentration fields created using these models are too large to archive on ScienceHub, approximately 1 TB, and are archived on EPA’s high performance computing archival system (ASM) at /asm/MOD3APP/pcc/02.NOAH.v.CLM.v.PX/.

Figures

Figures 1 – 6 and Figure 8: Created using the NCAR Command Language (NCL) scripts (https://www.ncl.ucar.edu/get_started.shtml). NCLD code can be downloaded from the NCAR website (https://www.ncl.ucar.edu/Download/) at no cost. The data used for these figures are archived on EPA’s ASM system and are available upon request.

Figures 7, 8b-c, 8e-f, 8h-i, and 9 were created using the AMET utility developed by U.S. EPA/ORD. AMET can be freely downloaded and used at https://github.com/USEPA/AMET. The modeled data paired in space and time provided in this archive can be used to recreate these figures.

The data contained in the compressed zip files are organized in comma delimited files with descriptive headers or space delimited files that match tabular data in the manuscript. The data dictionary provides additional information about the files and their contents.

This dataset is associated with the following publication: Campbell, P., J. Bash, and T. Spero. Updates to the Noah Land Surface Model in WRF‐CMAQ to Improve Simulated Meteorology, Air Quality, and Deposition. Journal of Advances in Modeling Earth Systems. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(1): 231-256, (2019).

GPT Detectors - Did a GPT write this?

kaggle.com

Updated Sep 20, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Sujay Kapadnis (2023). GPT Detectors - Did a GPT write this? [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/gpt-detectors

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 20, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sujay Kapadnis

Description

detectors is an R data package containing predictions from various GPT detectors. The data is based on the paper:

GPT Detectors Are Biased Against Non-Native English Writers. Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, James Zou. CellPress Patterns.

The study authors carried out a series of experiments passing a number of essays to different GPT detection models. Juxtaposing detector predictions for papers written by native and non-native English writers, the authors argue that GPT detectors disproportionately classify real writing from non-native English writers as AI-generated.

Data Cr: https://github.com/simonpcouch/detectors/

Data Dictionary

`detectors.csv`

variable	class	description
kind	character	Whether the essay was written by a "Human" or "AI".
.pred_AI	double	The class probability from the GPT detector that the inputted text was written by AI.
.pred_class	character	The uncalibrated class prediction, encoded as `if_else(.pred_AI > .5, "AI", "Human")`
detector	character	The name of the detector used to generate the predictions.
native	character	For essays written by humans, whether the essay was written by a native English writer or not. These categorizations are coarse; values of `"Yes"` may actually be written by people who do not write with English natively. `NA` indicates that the text was not written by a human.
name	character	A label for the experiment that the predictions were generated from.
model	character	For essays that were written by AI, the name of the model that generated the essay.
document_id	double	A unique identifier for the supplied essay. Some essays were supplied to multiple detectors. Note that some essays are AI-revised derivatives of others.
prompt	character	For essays that were written by AI, a descriptor for the form of "prompt engineering" passed to the model.

80k Bangla Words List (Bangla Dictionary)
kaggle.com
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MD MAHADI HASAN (2023). 80k Bangla Words List (Bangla Dictionary) [Dataset]. https://www.kaggle.com/datasets/mahadivai/spelling-checker-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MD MAHADI HASAN
Description
Around 80 thousand word lists are open for all bangla language researcher . This list are combined of this two datasets :

https://github.com/MinhasKamal/BengaliDictionary https://www.kaggle.com/datasets/rafsun/bengali-words
b
MeCab user dictionary: Nikkaji (Japan Chemical Substance Dictionary)
dbarchive.biosciencedbc.jp
Updated Jul 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). MeCab user dictionary: Nikkaji (Japan Chemical Substance Dictionary) [Dataset]. http://doi.org/10.18908/lsdba.nbdc02358-003.V002
Explore at:
Unique identifier
https://doi.org/10.18908/lsdba.nbdc02358-003.V002
Dataset updated
Jul 6, 2018
Description
A user dictionary for morphological analysis engine MeCab(http://taku910.github.io/mecab/) from J-GLOBAL science and technology terms that have linked to Japan Chemical Substance Dictionary (Nikkaji), an organic compound dictionary database prepared by the Japan Science and Technology Agency. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8.
T
Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...
data.va.gov
datahub.va.gov
application/rdfxml +5
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Synthea (2020). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://www.data.va.gov/dataset/Synthetic-Cohort-for-VHA-Innovation-Ecosystem-and-/f6q6-hh5x
Explore at:
tsv, csv, application/rdfxml, json, xml, application/rssxmlAvailable download formats
Dataset updated
Nov 9, 2020
Dataset authored and provided by
Synthea
Description
The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information.

PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
a
GTFS Pre-Rating Recaps
hub.arcgis.com
mbta-massdot.opendata.arcgis.com
+2more
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts geoDOT (2024). GTFS Pre-Rating Recaps [Dataset]. https://hub.arcgis.com/datasets/MassDOT::gtfs-pre-rating-recaps/about
Explore at:
Dataset updated
Mar 19, 2024
Dataset authored and provided by
Massachusetts geoDOT
Description
The MBTA GTFS Pre-rating Recap collection contains text files that describe a generalized MBTA schedule for a specific season. While the MBTA posts all previously published GTFS files, this collection makes it easier to find the schedule that the MBTA ran for the majority of a season instead of having to identify the "correct" one from the GTFS archive.We recommend using these files instead of the current GTFS when doing historical analyses.Data dictionary:https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs.mdTo view all previously published GTFS files, please refer to the link below:https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs-archive.mdMassDOT/MBTA shall not be held liable for any errors in this data. This includes errors of omission, commission, errors concerning the content of the data, and relative and positional accuracy of the data. This data cannot be construed to be a legal document. Primary sources from which this data was compiled must be consulted for verification of information contained in this data.
d
Statistical data from: Cholinergic modulation of hippocampally mediated...
datadryad.org
zip
Updated Sep 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Ruiz; Monica Thieu; Mariam Aly (2020). Statistical data from: Cholinergic modulation of hippocampally mediated attention and perception [Dataset]. http://doi.org/10.5061/dryad.79cnp5hs7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.79cnp5hs7
Dataset updated
Sep 29, 2020
Dataset provided by
Dryad
Authors
Nicholas Ruiz; Monica Thieu; Mariam Aly
Time period covered
2020
Description
Attention to the relations between visual features modulates hippocampal representations. Moreover, hippocampal damage impairs discrimination of spatial relations. We explore a mechanism by which this might occur: modulation by the acetylcholine system. Acetylcholine enhances afferent input to the hippocampus and suppresses recurrent connections within it. This biases hippocampal processing toward environmental input, and should improve externally-oriented, hippocampally mediated attention and perception. We examined cholinergic modulation on an attention task that recruits the hippocampus. On each trial, participants viewed two images (rooms with paintings). On “similar room” trials, they judged whether the rooms had the same spatial layout from a different perspective. On “similar art” trials, they judged whether the paintings could have been painted by the same artist. On “identical” trials, participants simply had to detect identical paintings or rooms. We predicted that cholinergic...
m
GTFS Post-Rating Recaps
gis.data.mass.gov
hub.arcgis.com
+1more
Updated Feb 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts geoDOT (2024). GTFS Post-Rating Recaps [Dataset]. https://gis.data.mass.gov/datasets/5355253d53864664a6e74142c594f16e
Explore at:
Dataset updated
Feb 6, 2024
Dataset authored and provided by
Massachusetts geoDOT
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The MBTA GTFS Post-rating Recap collection contains text files that describe MBTA schedule which includes any planned changes in service that became known after the rating began (weekend shuttle buses, change to reduced service schedule due to a snow day, etc.) for a specific season.Data dictionary:https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs.mdTo view all previously published GTFS files, please refer to the link below:https://github.com/mbta/gtfs-documentation/blob/master/reference/gtfs-archive.mdMassDOT/MBTA shall not be held liable for any errors in this data. This includes errors of omission, commission, errors concerning the content of the data, and relative and positional accuracy of the data. This data cannot be construed to be a legal document. Primary sources from which this data was compiled must be consulted for verification of information contained in this data.
Fee-for-Service Provider Data Dictionary - ds8c-j7gt - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Fee-for-Service Provider Data Dictionary - ds8c-j7gt - Archive Repository [Dataset]. https://healthdata.gov/dataset/Fee-for-Service-Provider-Data-Dictionary-ds8c-j7gt/fft8-7hnv
Explore at:
tsv, csv, xml, application/rssxml, json, application/rdfxmlAvailable download formats
Dataset updated
Aug 12, 2022
Description
This dataset tracks the updates made on the dataset "Fee-for-Service Provider Data Dictionary" as a repository for previous versions of the data and metadata.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Fee-for-Service-Provider-Data-Dictionary-u4zy-mhx3/k9u7-nmmx

Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository

Explore at:

json, csv, application/rdfxml, application/rssxml, xml, tsvAvailable download formats

Dataset updated

Apr 21, 2022

Description

This dataset tracks the updates made on the dataset "Fee-for-Service Provider Data Dictionary" as a repository for previous versions of the data and metadata.

Clear search

Close search

Google apps

Main menu

Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository

Synthetic Suicide Prevention Dataset with SDoH

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: "A guide to using GitHub for developing and versioning data...

Dataset for "A fast dictionary-learning-based classification scheme using...

LScD (Leicester Scientific Dictionary)

PLACES and 500 Cities: Data Dictionary - tgfb-be2h - Archive Repository

Data Dictionary for selected datasets in the Labour Market Information...

101,702 Japanese Pronunciation Dictionary

Fee-for-Service Provider Data Dictionary - b224-c398 - Archive Repository

Sino-Tibetan Etymological Dictionary and Thesaurus Database Software

Datasets for figures and tables

GPT Detectors - Did a GPT write this?

Data Dictionary

detectors.csv

80k Bangla Words List (Bangla Dictionary)

MeCab user dictionary: Nikkaji (Japan Chemical Substance Dictionary)

Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

GTFS Pre-Rating Recaps

Statistical data from: Cholinergic modulation of hippocampally mediated...

GTFS Post-Rating Recaps

Fee-for-Service Provider Data Dictionary - ds8c-j7gt - Archive Repository

Fee-for-Service Provider Data Dictionary - u4zy-mhx3 - Archive Repository

`detectors.csv`