100+ datasets found

Example of how to manually extract incubation bouts from interactive plots...
figshare.com
txt
Updated Jan 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2066784.v1
Dataset updated
Jan 22, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
f
R and Python libraries for the standardization of data extraction and...
datasetcatalog.nlm.nih.gov
figshare.com
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zwiggelaar, Reyer; Spick, Matt; Harrison, Charlie; Suchak, Tulsi; Aliu, Anietie E.; Geifman, Nophar (2025). R and Python libraries for the standardization of data extraction and analysis from NHANES. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002102076
Explore at:
Dataset updated
May 8, 2025
Authors
Zwiggelaar, Reyer; Spick, Matt; Harrison, Charlie; Suchak, Tulsi; Aliu, Anietie E.; Geifman, Nophar
Description
R and Python libraries for the standardization of data extraction and analysis from NHANES.
Primary reporting of studies.
plos.figshare.com
xlsx
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Emanuel Zurrer; Amelia Elaine Cannon; Ewoud Ewing; David Brüschweiler; Julia Bugajska; Bernard Friedrich Hild; Marianna Rosso; Daniel Salo Reich; Benjamin Victor Ineichen (2024). Primary reporting of studies. [Dataset]. http://doi.org/10.1371/journal.pone.0311358.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311358.s002
Dataset updated
Nov 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Wolfgang Emanuel Zurrer; Amelia Elaine Cannon; Ewoud Ewing; David Brüschweiler; Julia Bugajska; Bernard Friedrich Hild; Marianna Rosso; Daniel Salo Reich; Benjamin Victor Ineichen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and methodsSystematic reviews, i.e., research summaries that address focused questions in a structured and reproducible manner, are a cornerstone of evidence-based medicine and research. However, certain steps in systematic reviews, such as data extraction, are labour-intensive, which hampers their feasibility, especially with the rapidly expanding body of biomedical literature. To bridge this gap, we aimed to develop a data mining tool in the R programming environment to automate data extraction from neuroscience in vivo publications. The function was trained on a literature corpus (n = 45 publications) of animal motor neuron disease studies and tested in two validation corpora (motor neuron diseases, n = 31 publications; multiple sclerosis, n = 244 publications).ResultsOur data mining tool, STEED (STructured Extraction of Experimental Data), successfully extracted key experimental parameters such as animal models and species, as well as risk of bias items like randomization or blinding, from in vivo studies. Sensitivity and specificity were over 85% and 80%, respectively, for most items in both validation corpora. Accuracy and F1-score were above 90% and 0.9 for most items in the validation corpora, respectively. Time savings were above 99%.ConclusionsOur text mining tool, STEED, can extract key experimental parameters and risk of bias items from the neuroscience in vivo literature. This enables the tool’s deployment for probing a field in a research improvement context or replacing one human reader during data extraction, resulting in substantial time savings and contributing towards the automation of systematic reviews.
f
Data extraction.
datasetcatalog.nlm.nih.gov
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ezechi, Oliver Chukwujekwu; Olagunju, Mobolaji Timothy; Lawal, Ishak Kayode; Abodunrin, Olunike Rebecca; Eleje, George Uchenna; Ferife, Vivienne; Ola, Oluwabukola Mary; Lukwa, Akim Tafadzwa; Akinsolu, Folahanmi Tomiwa (2025). Data extraction. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002081623
Explore at:
Dataset updated
May 15, 2025
Authors
Ezechi, Oliver Chukwujekwu; Olagunju, Mobolaji Timothy; Lawal, Ishak Kayode; Abodunrin, Olunike Rebecca; Eleje, George Uchenna; Ferife, Vivienne; Ola, Oluwabukola Mary; Lukwa, Akim Tafadzwa; Akinsolu, Folahanmi Tomiwa
Description
HPV self-sampling has the potential to improve early detection of cervical cancer among women living with HIV (WLHIV), but its acceptability varies, creating implementation challenges, especially in sub-Saharan Africa. This study aims to assess the acceptability of HPV self-sampling among WLHIV. We searched PubMed, Web of Science, CINAHL, Academic Medical Ultimate, Cochrane databases, and Google Scholar. The review protocol was registered with PROSPERO (CRD42022299781). Inclusion criteria were based on population, intervention, comparison, and outcome. Statistical analysis was done with R Studio version 4.3.2, and data abstraction was performed in Microsoft Excel. The analysis included 14 studies on the acceptability of HPV self-sampling among WLHIV. The overall acceptability rate was 73%. The pooled data showed that 94% felt comfortable with self-sampling, 72% found it easy to use, 10% reported pain, 14% felt embarrassed, and 41% felt confident about the process. The study found that a majority of WLHIV accepted HPV self-sampling, a higher rate than in the general female population. Many participants had concerns about the method’s efficacy. This indicates that while WLHIV generally views self-sampling positively, additional education and support are needed to improve their confidence in its accuracy and reliability.
Friends - R Package Dataset
kaggle.com
zip
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
Explore at:
zip(2018791 bytes)Available download formats
Dataset updated
Nov 11, 2024
Authors
Lucas Yukio Imafuko
Description
The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

"The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

Content

friends.csv - Contains the scenes and lines for each character, including season and episodes.

friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.

friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

Uses

Text mining, sentiment analysis and word statistics.

Data visualizations.
f
Data extraction form.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jen, Rachel; Moore, Linn E.; Rowe, Brian H.; MacLean, Joanna E.; Campbell, Sandra M.; Laratta, Cheryl R.; Pendharkar, Sachin R. (2023). Data extraction form. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001060281
Explore at:
Dataset updated
Jul 3, 2023
Authors
Jen, Rachel; Moore, Linn E.; Rowe, Brian H.; MacLean, Joanna E.; Campbell, Sandra M.; Laratta, Cheryl R.; Pendharkar, Sachin R.
Description
BackgroundLong-term noninvasive positive airway pressure (PAP) treatment is effective treatment for sleep-related breathing disorders and chronic hypercarbic respiratory failure secondary to chronic obstructive pulmonary disease (COPD). PAP treatment may be delivered as continuous positive airway pressure or noninvasive ventilation. Success in initiating PAP treatment and barriers to its use in adult patients with COPD are largely unknown. This systematic review aims to identify the acceptance of and adherence to PAP treatment prescribed for long-term use in adult patients with COPD and to summarize variables associated with these measures.MethodsSeven online electronic databases will be searched by an experienced medical librarian to identify records containing the concepts “obstructive airways disease” and “noninvasive positive airway pressure” and “acceptance” or “adherence”. Randomized and non-randomized studies of interventions will be included. Citation lists from relevant articles will be reviewed, and experts will be contacted regarding unpublished studies. Abstracts from key conferences between 2018–2023 and Google Scholar search results will be reviewed for inclusion. Titles, abstracts and full texts will be reviewed independently for inclusion by two reviewers. Data extraction will be completed by one author using a pre-established form and primary outcomes confirmed by a second author. Methodological quality will be evaluated. If sufficient data are available for meta-analysis, a pooled summary statistic for the primary outcome will be calculated using a random-effects generic inverse-variance meta-analysis, weighted proportion or weighted medians-based approach. Subgroup analysis will explore clinically meaningful sources of heterogeneity. Variables that are associated with acceptance and adherence will be described.DiscussionLong-term PAP treatment is a complex intervention prescribed to patients with COPD for several indications. Synthesis of the evidence on success with PAP treatment and variables associated with acceptance or adherence will inform program and policy development for supporting patients with COPD who are prescribed this therapy.Trial registrationSystematic review registration: This protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO) on July 13, 2021 (registration number CRD42021259262), with revisions submitted on April 17, 2023.
f
Table of rcprd functions.
plos.figshare.com
xls
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin (2025). Table of rcprd functions. [Dataset]. http://doi.org/10.1371/journal.pone.0327229.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327229.t001
Dataset updated
Aug 19, 2025
Dataset provided by
PLOS ONE
Authors
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Clinical Practice Research Datalink (CPRD) is a large and widely used resource of electronic health records from the UK, linking primary care data to hospital data, death registration data, cancer registry data, deprivation data and mental health services data. Extraction and management of CPRD data is a computationally demanding process and requires a significant amount of work, in particular when using R. The rcprd package simplifies the process of extracting and processing CPRD data in order to build datasets ready for statistical analysis. Raw CPRD data is provided in thousands of.txt files, making querying this data cumbersome and inefficient. rcprd saves the relevant information into an SQLite database stored on the hard drive which can then be queried efficiently to extract required information about individuals. rcprd follows a four-stage process: 1) Definition of a cohort, 2) Read in medical/prescription data and save into an SQLite database, 3) Query this SQLite database for specific codes and tests to create variables for each individual in the cohort, 4) Combine extracted variables into a dataset ready for statistical analysis. Functions are available to extract common variable types (e.g., history of a condition, or time until an event occurs, relative to an index date), and more general functions for database queries, allowing users to define their own variables for extraction. The entire process can be done from within R, with no knowledge of SQL required. This manuscript showcases the functionality of rcprd by running through an example using simulated CPRD Aurum data. rcprd will reduce the duplication of time and effort among those using CPRD data for research, allowing more time to be focused on other aspects of research projects.
f
Data Extraction File.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhaydel, Abdalrahman S.; Allan, Leen M.; Saeed, Anwaar; Abu Hantash, Noor R.; Qasem, Issam M.; Younis, Osama M.; Alghwyeen, Wasfi F. (2025). Data Extraction File. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002094208
Explore at:
Dataset updated
Apr 15, 2025
Authors
Dhaydel, Abdalrahman S.; Allan, Leen M.; Saeed, Anwaar; Abu Hantash, Noor R.; Qasem, Issam M.; Younis, Osama M.; Alghwyeen, Wasfi F.
Description
BackgroundAngiopoietin-like protein 4 (ANGPTL4) plays a crucial role in processes such as angiogenesis, inflammation, and metabolism. Despite numerous studies suggesting its involvement in cancer, a definitive role remains unclear. We introduce the first comprehensive meta-analysis and pan-cancer bioinformatics study on ANGPTL4, aiming to unravel its implications across various cancer types.MethodsModerate-to high-quality observational studies were retrieved from PubMed, Scopus, and Embase. A meta-analysis was conducted using the R package “meta.” Survival analysis was performed using GEPIA2 and TIMER2.0. Immune infiltration, mutational burden, and drug resistance analyses was done via GSCAlite. Co-expression and gene set enrichment analyses (GSEA) were carried out using cBioportal and enrichr, respectively.ResultsIncreased ANGPTL4 expression was linked to worse tumor grade (OR = 1.51, P = 0.023), stage (OR = 2.42, P < 0.001), lymph node metastasis (OR = 1.76, P = 0.012), vascular invasion (OR = 2.16, P = 0.01), and lymphatic invasion (OR = 2.20, P < 0.001). Furthermore, ANGPTL4 expression was linked to worse OS (HR = 1.40, 95% CI: 1.29,1.50, P = 0.0001). Single gene level analysis revealed that ANGPTL4 upregulated epithelial-to-mesenchymal transition (EMT) in 23 different cancers. Immune infiltration varied between cancer types, but increased infiltration of cancer-associated fibroblasts was observed in most cancers. Mutation analysis revealed increased alterations in TP53 and CDKN2A in cohorts with ANGPTL4 alterations. GSEA of co-expressed genes revealed involvement in hypoxia, EMT, VEGF-A complex, TGF-B pathways, and extracellular matrix organization.ConclusionsANGPTL4 plays a significant role in tumor progression via its positive regulation of EMT and angiogenesis, while possibly harboring a TGF-B dependent role in systemic metastasis. Therefore, ANGPTL4 is a suitable target for future drug development.
w
Dataset of books called An introduction to data analysis in R : hands-on...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=An+introduction+to+data+analysis+in+R+%3A+hands-on+coding%2C+data+mining%2C+visualization+and+statistics+from+scratch
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v1
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Feature Extraction
kaggle.com
zip
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction
Explore at:
zip(11300375 bytes)Available download formats
Dataset updated
Sep 4, 2019
Authors
Jason
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

There are three files in this dataset:

Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

Columns in train_features.RDS & test_features.RDS:

id_code - image id

diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

n - number of persistent homology components detected from the image

fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
E
Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude...
ceicdata.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com, Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude Petroleum & Natural Gas [Dataset]. https://www.ceicdata.com/en/ecuador/registered-activity-level-index-inar/registered-activity-level-index-inar-mq-en-extraction-of-crude-petroleum--natural-gas
Explore at:
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2017 - Feb 1, 2018
Area covered
Ecuador
Variables measured
Economic Activity
Description
Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude Petroleum & Natural Gas data was reported at 110.749 2002=100 in Aug 2018. This records a decrease from the previous number of 117.635 2002=100 for Jul 2018. Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude Petroleum & Natural Gas data is updated monthly, averaging 98.034 2002=100 from Jan 2003 (Median) to Aug 2018, with 188 observations. The data reached an all-time high of 449.515 2002=100 in Apr 2010 and a record low of 13.501 2002=100 in Feb 2013. Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude Petroleum & Natural Gas data remains active status in CEIC and is reported by National Institute of Statistics and Census. The data is categorized under Global Database’s Ecuador – Table EC.A025: Registered Activity Level Index: INA-R.
t
Extraction of the Structure Functions and R=Sigma-L/Sigma-T from Deep...
service.tib.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Extraction of the Structure Functions and R=Sigma-L/Sigma-T from Deep Inelastic e p and e d Cross-Sections - Vdataset - LDM in NFDI4Energy [Dataset]. https://service.tib.eu/ldm_nfdi4energy/ldmservice/dataset/inspirehep_8098270e-a432-4c93-8e3a-de9f30be0025
Explore at:
Description
DATA USES SLAC 8 GEV SPECTROMETER,USING ANGLES OF 15,18,19,26,34 DEG. RANGE OF VARIABLES PLAB 4.5 TO 18.0 GEV , NU 2.1 TO 13.4 GEV , Q2 1.0 TO 16.0 GEV2 , X 0.1 TO 0.8 , W 1.88 TO 4.84 GEV. ONLY DATA ON R=SIGL/SIGT AND THE STRUCTURE FUNCTIONS IN THIS RECORD. THESE DATA ARE FROM EXPERIMENT E 87 AT SLAC. THE RANGES OF VARIABLES ARE E LAB 4.5 TO 18.0 GEV THETA LAB 6,10,15,18,19,26,34 DEGREES NU 0.301 TO 16.804 GEV Q2 0.145 TO 21.78 GEV2 X 0.022 TO 0.965 W 1.104 TO 5.563 GEV THE DATA HAS BEEN PROCESSED BY SLAC RADIATIVE CORRECTIONS AND WAS OBTAINED THROUGH G.C.FOX AT CALTECH.
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
Data from: A dataset to model Levantine landcover and land-use change...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10396148
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Kempf; Michael Kempf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 16, 2023
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
f
Data extraction table.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pan, Hanwen; Huang, Zhuo; Liao, Qiwei; Lan, Yuxiang; Wu, Peiyi; Guo, Yixin (2025). Data extraction table. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001321564
Explore at:
Dataset updated
Feb 24, 2025
Authors
Pan, Hanwen; Huang, Zhuo; Liao, Qiwei; Lan, Yuxiang; Wu, Peiyi; Guo, Yixin
Description
BackgroundThe newly approved biologics and Janus kinase inhibitors (JAKi) for pediatric atopic dermatitis (AD) offer additional options for clinical treatment. However, the efficacy and safety differences compared to the first approved biologic, dupilumab, remain unclear. Therefore, a network meta-analysis was conducted to evaluate these differences and identify potentially superior agents.MethodsThis systematic review was PROSPERO-registered (CRD42024583658). Randomized controlled trials involving pediatric patients (<18 years old) published in PubMed, Embase, Web of Science, and the Cochrane Library up to October 27, 2024 were searched and screened. RevMan software was utilized for quality assessment, and meta-analysis was performed using R version 4.4.1. Efficacy measures included the Investigator’s Global Assessment (IGA), the Numeric Rating Scale for Itch (NRS), and the Eczema Area and Severity Index (EASI). The results of these measures were expressed as odds ratios (OR), while treatment rankings of different interventions were determined using the P-score.ResultThis study included 11 trials involving 7 agents and 2,352 pediatric patients. The results indicated that dupilumab (300 mg) showed better outcomes than placebo in IGA-0/1 (OR = 4.68, 95% CI: 2.53–8.63), NRS-4 (OR = 6.75, 95% CI: 3.85–11.86), and all EASI outcomes. Tralokinumab may be the most effective option for alleviating pruritus (P-score for NRS-4, 0.8447). Upadacitinib (30 mg) performed best in IGA-0/1 (P-score, 0.9414), EASI-90 (P-score, 0.9926), and EASI-75 (P-score, 0.9707). Dupilumab (300 mg) had a higher risk of nasopharyngitis compared to placebo (OR = 2.15, 95%CI: 1.04–4.43). Compared to both placebo and dupilumab (300 mg), adverse event rates were higher with upadacitinib (15 mg and 30 mg), and upper respiratory tract infection risk was elevated with baricitinib (2 mg and 4 mg) and tralokinumab (300 mg).ConclusionThe efficacy of dupilumab for pediatric AD remains substantial, while other agents including upadacitinib, delgocitinib, and tralokinumab also present certain advantages. Future clinical trials may necessitate further evaluation of safety concerns.
p
AZTI, Marine Research
pigma.org
sextant.ifremer.fr
www:link
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ifremer (2024). AZTI, Marine Research [Dataset]. https://www.pigma.org/geonetwork/srv/api/records/38ccabab-9d66-4470-bda8-3482629066ac
Explore at:
www:linkAvailable download formats
Dataset updated
Sep 10, 2024
Dataset provided by
Ifremer
Area covered
Description
This is an short tutorial to show how to download OBIS/GBIF occurrence data for multiple species. The code has been written to be used in H2020 Mission Atlantic (No 862428) Project Task 3.4.

The tutorial has been developed by Mireia Valle (github profile: MireiaValle, email: mvalle@azti.es) based on original code for sourcing OBIS and GBIF from Guillem Chust (email: gchust@azti.es) and some adaptations from Eduardo Ramirez.

Affiliation: AZTI, Marine Research, Basque Research and Technology Alliance (BRTA). Txatxarramendi ugartea z/g, 48395 Sukarrieta - Bizkaia, Spain
E
Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities...
ceicdata.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities Incidental to Oil & Gas Extraction Excluding Surveying [Dataset]. https://www.ceicdata.com/en/ecuador/registered-activity-level-index-inar/registered-activity-level-index-inar-mq-en-service-activities-incidental-to-oil--gas-extraction-excluding-surveying
Explore at:
Dataset updated
Jan 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2017 - Feb 1, 2018
Area covered
Ecuador
Variables measured
Economic Activity
Description
Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities Incidental to Oil & Gas Extraction Excluding Surveying data was reported at 77.203 2002=100 in Apr 2018. This records a decrease from the previous number of 121.817 2002=100 for Mar 2018. Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities Incidental to Oil & Gas Extraction Excluding Surveying data is updated monthly, averaging 96.591 2002=100 from Jan 2003 (Median) to Apr 2018, with 184 observations. The data reached an all-time high of 283.131 2002=100 in Apr 2006 and a record low of 36.909 2002=100 in May 2008. Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities Incidental to Oil & Gas Extraction Excluding Surveying data remains active status in CEIC and is reported by National Institute of Statistics and Census. The data is categorized under Global Database’s Ecuador – Table EC.A025: Registered Activity Level Index: INA-R.

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1

Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.2066784.v1

Dataset updated

Jan 22, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Martin Bulla

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

Clear search

Close search

Google apps

Main menu

Example of how to manually extract incubation bouts from interactive plots...

R and Python libraries for the standardization of data extraction and...

Primary reporting of studies.

Data extraction.

Friends - R Package Dataset

Content

Uses

Data extraction form.

Table of rcprd functions.

Data Extraction File.

Dataset of books called An introduction to data analysis in R : hands-on...

LSC (Leicester Scientific Corpus)

LScDC Word-Category RIG Matrix

Feature Extraction

Reddit r/AskScience Flair Dataset

Ecuador Registered Activity Level Index: INA-R: MQ: EN: Extraction of Crude...

Extraction of the Structure Functions and R=Sigma-L/Sigma-T from Deep...

LScDC (Leicester Scientific Dictionary-Core)

Data from: A dataset to model Levantine landcover and land-use change...

Data extraction table.

AZTI, Marine Research

Ecuador Registered Activity Level Index: INA-R: MQ: EN: Service Activities...

Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA