9 datasets found
  1. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  2. AP Research Data: Term Limits and their Relationship with Economic and...

    • figshare.com
    xlsx
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soo Ho Hong (2024). AP Research Data: Term Limits and their Relationship with Economic and Governmental Indicators.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.25720812.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Soo Ho Hong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the AP Research project on Term Limits and their Relationship with Economic and Governmental Indicators. Used correlational analysis to compare de facto and de jure term limits with various economic and developmental variables.

  3. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  4. d

    Data from: Long-Term Agroecosystem Research in the Central Mississippi River...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Long-Term Agroecosystem Research in the Central Mississippi River Basin: Goodwater Creek Experimental Watershed and Regional Herbicide Water Quality Data [Dataset]. https://catalog.data.gov/dataset/data-from-long-term-agroecosystem-research-in-the-central-mississippi-river-basin-goodwate-a5df5
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Area covered
    Mississippi River System, Mississippi River
    Description

    The GCEW herbicide data were collected from 1991-2010, and are documented at plot, field, and watershed scales. Atrazine concentrations in Goodwater Creek Experimental Watershed (GCEW) were shown to be among the highest of any watershed in the United States based on comparisons using the national Watershed Regressions for Pesticides (WARP) model and by direct comparison with the 112 watersheds used in the development of WARP. This 20-yr-long effort was augmented with a spatially broad effort within the Central Mississippi River Basin encompassing 12 related claypan watersheds in the Salt River Basin, two cave streams on the fringe of the Central Claypan Areas in the Bonne Femme watershed, and 95 streams in northern Missouri and southern Iowa. The research effort on herbicide transport has highlighted the importance of restrictive soil layers with smectitic mineralogy to the risk of transport vulnerability. Near-surface soil features, such as claypans and argillic horizons, result in greater herbicide transport than soils with high saturated hydraulic conductivities and low smectitic clay content. The data set contains concentration, load, and daily discharge data for Devils Icebox Cave and Hunters Cave from 1999 to 2002. The data are available in Microsoft Excel 2010 format. Sheet 1 (Cave Streams Metadata) contains supporting information regarding the length of record, site locations, parameters measured, parameter units, method detection limits, describes the meaning of zero and blank cells, and briefly describes unit area load computations. Sheet 2 (Devils Icebox Concentration Data) contains concentration data from all samples collected from 1999 to 2002 at the Devils Icebox site for 12 analytes and two computed nutrient parameters. Sheet 3 (Devils Icebox SS Conc Data) contains 15-minute suspended sediment (SS) concentrations estimated from turbidity sensor data for the Devils Icebox site. Sheet 4 (Devils Icebox Load & Discharge Data) contains daily data for discharge, load, and unit area loads for the Devils Icebox site. Sheet 5 (Hunters Cave Concentration Data) contains concentration data from all samples collected from 1999 to 2002 at the Hunters Cave site for 12 analytes and two computed nutrient parameters. Sheet 6 (Hunters Cave SS Conc Data) contains 15-minute SS concentrations estimated from turbidity sensor data for the Hunters Cave site. Sheet 7 (Hunters Cave Load & Discharge Data) contains daily data for discharge, load, and unit area loads for the Hunters Cave site. [Note: To support automated data access and processing, each worksheet has been extracted as a separate, machine-readable CSV file; see Data Dictionary for descriptions of variables and their concentration units.] Resources in this dataset:Resource Title: README - Metadata. File Name: LTAR_GCEW_herbicidewater_qual.xlsxResource Description: Defines Water Quality and Sediment Load/Discharge parameters, abbreviations, time-frames, and units as rendered in the Excel file. For additional information including site information, method detection limits, and methods citations, see Metadata tab. For Definitions used in machine-readable CSV files, see Data Dictionary.Resource Title: Excel data spreadsheet. File Name: c3.jeq2013.12.0516.ds1_.xlsxResource Description: Multi-page data spreadsheet containing data as well as metadata from this study. A direct download of the data spreadsheet can be found here: https://dl.sciencesocieties.org/publications/datasets/jeq/C3.JEQ2013.12.0516.ds1/downloadResource Title: Devils Icebox Concentration Data. File Name: DevilsIceboxConcData.csvResource Description: Concentrations of herbicides, metabolites, and nutrients (extracted from the Excel tab into machine-readable CSV data).Resource Title: Devils Icebox Load and Discharge Data. File Name: DevilsIceboxLoad&Discharge.csvResource Description: Discharge and Unit Area Loads for herbicides, metabolites, and suspended sediments (extracted from Excel tab as machine-readable CSV data)Resource Title: Devils Icebox Suspended Sediment Concentration Data. File Name: DevilsIceboxSSConcData.csvResource Description: Suspended Sediment Concentration Data (extracted from Excel tab as machine-readable CSV data)Resource Title: Hunters Cave Load and Discharge Data. File Name: HuntersCaveLoad&Discharge.csvResource Description: Discharge and Unit Area Loads for herbicides, metabolites, and suspended sediments (extracted from Excel tab as machine-readable CSV data)Resource Title: Hunters Cave Suspended Sediment Concentration Data. File Name: HuntersCaveSSConc.csvResource Description: Suspended Sediment Concentration Data (extracted from Excel tab as machine-readable CSV data)Resource Title: Data Dictionary for machine-readable CSV files. File Name: LTAR_GCEW_herbicidewater_qual.csvResource Description: Defines Water Quality and Sediment Load/Discharge parameters, abbreviations, time-frames, and units as implemented in the extracted machine-readable CSV files.Resource Title: Hunters Cave Concentration Data. File Name: HuntersCaveConcData.csvResource Description: Concentrations of herbicides, metabolites, and nutrients (extracted from the Excel tab into machine-readable CSV data)

  5. National Post-acute and Long-term Care Study Adult Day Participant File

    • data.virginia.gov
    • data.cdc.gov
    html
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2025). National Post-acute and Long-term Care Study Adult Day Participant File [Dataset]. https://data.virginia.gov/dataset/national-post-acute-and-long-term-care-study-adult-day-participant-file
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Description

    The main goals of the National Post-acute and Long-term Care Study (NPALS) are to: (1) Estimate the supply of paid, regulated long-term care services providers; (2) Estimate key policy-relevant characteristics and practices of these providers; (3) Estimate the number of long-term care services users; (4) Estimate key policy-relevant characteristics of long-term care services users; (5) Produce national and state estimates where feasible within confidentiality and reliability standards; (6) Compare across provider sectors; and (7) Monitor trends over time.

    NPALS used a two-stage probability-based sample design. In the first stage, a stratified random sample of providers were selected among adult day service centers (ADSCs); in the second stage, current services users (participants in ADSCs) were randomly selected.

    The provider questionnaire included survey items on provider characteristics such as ownership, size, services offered, selected practices, and staffing; questions about aggregate user characteristics (age and race) were included. The services user datasets include user demographics, health conditions, limitations with activities of daily living, number of prescription medications, adverse events, and services used. This is the services user or participant level data file.

  6. u

    Daily Minimum and Maximum Temperature and Precipitation for Long Term...

    • data.ucar.edu
    • rda-web-prod.ucar.edu
    • +1more
    netcdf
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Climate and Global Dynamics Division, National Center for Atmospheric Research, University Corporation for Atmospheric Research (2024). Daily Minimum and Maximum Temperature and Precipitation for Long Term Stations from the U.S. COOP Data [Dataset]. https://data.ucar.edu/dataset/daily-minimum-and-maximum-temperature-and-precipitation-for-long-term-stations-from-the-u-s-coo
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Aug 4, 2024
    Dataset provided by
    Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
    Authors
    Climate and Global Dynamics Division, National Center for Atmospheric Research, University Corporation for Atmospheric Research
    Time period covered
    Jan 1, 1950 - Dec 31, 2004
    Area covered
    United States
    Description

    Daily minimum and maximum temperature and precipitation have been extracted for long term stations from the U.S. COOP station network. The data has been extracted for the period 1950 through 2004 for all stations which have a low percentage of missing data (generally less than 5%, analyzed independently by variable for each station). Data is available for stations from 46 states.

  7. National Post-acute and Long-term Care Study Adult Day Participant File

    • healthdata.gov
    application/rdfxml +5
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cdc.gov (2025). National Post-acute and Long-term Care Study Adult Day Participant File [Dataset]. https://healthdata.gov/CDC/National-Post-acute-and-Long-term-Care-Study-Adult/gvwc-iqjg
    Explore at:
    tsv, application/rdfxml, csv, json, xml, application/rssxmlAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    data.cdc.gov
    Description

    The main goals of the National Post-acute and Long-term Care Study (NPALS) are to: (1) Estimate the supply of paid, regulated long-term care services providers; (2) Estimate key policy-relevant characteristics and practices of these providers; (3) Estimate the number of long-term care services users; (4) Estimate key policy-relevant characteristics of long-term care services users; (5) Produce national and state estimates where feasible within confidentiality and reliability standards; (6) Compare across provider sectors; and (7) Monitor trends over time.

    NPALS used a two-stage probability-based sample design. In the first stage, a stratified random sample of providers were selected among adult day service centers (ADSCs); in the second stage, current services users (participants in ADSCs) were randomly selected.

    The provider questionnaire included survey items on provider characteristics such as ownership, size, services offered, selected practices, and staffing; questions about aggregate user characteristics (age and race) were included. The services user datasets include user demographics, health conditions, limitations with activities of daily living, number of prescription medications, adverse events, and services used. This is the services user or participant level data file.

  8. u

    Data from: Long Term Ecological Research (LTER) Florida Coastal Everglades...

    • agdatacommons.nal.usda.gov
    • geodata.nal.usda.gov
    bin
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evelyn Gaiser (2023). Long Term Ecological Research (LTER) Florida Coastal Everglades (FCE) Core Research Data Table of Contents (DTOC) [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Long_Term_Ecological_Research_LTER_Florida_Coastal_Everglades_FCE_Core_Research_Data_Table_of_Contents_DTOC_/24665235
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Florida International University
    Authors
    Evelyn Gaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset links to the Long Term Ecological Research (LTER) Florida Coastal Everglades (FCE) Core Research Data Table of Contents (DTOC). The DTOC contains links to 173 individual datasets, which may be queried from the DTOC page. FCE Core Research Data are long-term data sets that address FCE LTER objectives and hypotheses, and that are supported primarily by LTER funds. All data are provided with accompanying metadata. Metadata includes details about the data including how, when and by whom a particular set of data was collected, and information regarding the data format. The FCE practice of dataset versioning has been discontinued as of March 2013. All long-term data will have new data appended to the file and the accompanying metadata will be updated. FCE data may be freely downloaded with as few restrictions as possible. Consultation or collaboration with the original investigators is strongly encouraged. Please keep the dataset originator informed of any plans to use the dataset, and include the dataset's proper citation and Digital Object Identifier (DOI) found under the 'How to cite these data' on the dataset's summary table. Resources in this dataset:Resource Title: GeoData catalog record. File Name: Web Page, url: https://geodata.nal.usda.gov/geonetwork/srv/eng/catalog.search#/metadata/FlCoastalEverglades_eaa_2015_March_19_1527

  9. Project - eAtlas extension: Data management for environmental research (NESP...

    • researchdata.edu.au
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Institute of Marine Science (AIMS); Lawrey E; Lawrey E (2024). Project - eAtlas extension: Data management for environmental research (NESP TWQ 5.15) [Dataset]. https://researchdata.edu.au/project-eatlas-extension-twq-515/1882770
    Explore at:
    Dataset updated
    2024
    Dataset provided by
    Australian Institute Of Marine Sciencehttp://www.aims.gov.au/
    Australian Ocean Data Network
    Authors
    Australian Institute of Marine Science (AIMS); Lawrey E; Lawrey E
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    The eAtlas is a web delivery platform for environmental research data that focuses on data management and data visualisation. As part of the National Environmental Science Program (NESP) the eAtlas was responsible for coordinating the publication of data generated by all research projects in the NESP Tropical Water Quality (TWQ) hub. The focus of the eAtlas was to:

    Actively engage projects on data management issues.
    
    Provide in-depth review of final datasets to ensure quality data publications suitable for future reuse.
    
    Provide permanent hosting and publication of the hub datasets and metadata.
    
    Develop and host visualisations of spatial datasets for users to quickly assess the suitability of the data for their research, and for environmental managers to view without specialist tools.
    
    Provide a web platform for creating project centric websites that highlight stories based around research project data
    

    The data management under the NESP TWQ hub was more successful than previous research programs that the eAtlas has been associated with over the last 12 years. A greater percentage of data products from research projects were captured and published to a high standard. As of 7 June 2021, 94 datasets were published from the NESP TWQ hub which is significantly more than the 49 datasets from the previous National Environmental Research Program Tropical Ecosystem (NERP TE) program in 2011 – 2014, and 14 datasets from the Marine and Tropical Science Research Facility (MTSRF) in 2008 - 2010.

    As part of the project final reporting a comparison was made between the size of the NESP TWQ metadata records, as measured by word count of the title, abstract and lineage, to those of similar environmental datasets in other data repositories, including the AODN, CSIRO, NESP MB hub and JCU. The aim of this analysis was to determine which aspects of the data management workflow used on NESP TWQ projects contributed to the level of detail in the metadata records. The spreadsheet associated with the word count analysis is available for download. More detail on the methods are available in the NESP-TWQ 5.15 final report (awating publication on the https://nesptropical.edu.au/ website).

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Leicester
Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Search
Clear search
Close search
Google apps
Main menu