22 datasets found

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Vaccine Trials Networkhttp://www.hvtn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
s.cnmilf.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Staten Island, New York
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
Cheltenham Crime Data
kaggle.com
zip
Updated Jul 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Chirico (2018). Cheltenham Crime Data [Dataset]. https://www.kaggle.com/mchirico/chtpd
Explore at:
zip(1084459 bytes)Available download formats
Dataset updated
Jul 8, 2018
Authors
Mike Chirico
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Cheltenham PA, Crime Data

Cheltenham is a home rule township bordering North Philadelphia in Montgomery County. It has a population of about 37,000 people. You can find out more about Cheltenham on wikipedia.

Cheltenham's Facebook Groups. contains postings on crime and other events in the community.

Getting Started

Reading Data is a simple python script for getting started.

If you prefer to use R, there is an example Kernel here.

Proximity to Philadelphia

This township borders on Philadelphia, which may or may not influence crime in the community. For Philadelphia crime patterns, see the Philadelphia Crime Dataset.

Reference

Data was obtained from socrata.com
H
Data from: Critical Search: A procedure for guided reading in large-scale...
dataverse.harvard.edu
Updated Jan 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Guldi (2019). Critical Search: A procedure for guided reading in large-scale textual corpora [Dataset]. http://doi.org/10.7910/DVN/BJNAPD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/BJNAPD
Dataset updated
Jan 4, 2019
Dataset provided by
Harvard Dataverse
Authors
Jo Guldi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.

Dataset of psychophysiological data from children with learning difficulties...

openneuro.org

Updated May 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

César E. Corona-González; Claudia Rebeca De Stefano-Ramos; Juan Pablo Rosado-Aíza; David I. Ibarra-Zarate; Fabiola R. Gómez-Velázquez; Luz María Alonso-Valerdi (2025). Dataset of psychophysiological data from children with learning difficulties who strengthen reading and math skills through assistive technology [Dataset]. http://doi.org/10.18112/openneuro.ds006260.v1.0.0

Explore at:

Unique identifier

https://doi.org/10.18112/openneuro.ds006260.v1.0.0

Dataset updated

May 26, 2025

Dataset provided by

OpenNeurohttps://openneuro.org/

Authors

César E. Corona-González; Claudia Rebeca De Stefano-Ramos; Juan Pablo Rosado-Aíza; David I. Ibarra-Zarate; Fabiola R. Gómez-Velázquez; Luz María Alonso-Valerdi

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

README

Authors

César E. Corona-González, Claudia Rebeca De Stefano-Ramos, Juan Pablo Rosado-Aíza, Fabiola R Gómez-Velázquez, David I. Ibarra-Zarate, Luz María Alonso-Valerdi

Contact person

César E. Corona-González

https://orcid.org/0000-0002-7680-2953

a00833959@tec.mx

Project name

Psychophysiological data from Mexican children with learning difficulties who strengthen reading and math skills by assistive technology

Year that the project ran

2023

Brief overview of the tasks in the experiment

The current dataset consists of psychometric and electrophysiological data from children with reading or math learning difficulties. These data were collected to evaluate improvements in reading or math skills resulting from using an online learning method called Smartick.

The psychometric evaluations from children with reading difficulties encompassed: spelling tests, where 1) orthographic and 2) phonological errors were considered, 3) reading speed, expressed in words read per minute, and 4) reading comprehension, where multiple-choice questions were given to the children. The last 2 parameters were determined according to the standards from the Ministry of Public Education (Secretaría de Educación Pública in Spanish) in Mexico. On the other hand, group 2 assessments embraced: 1) an assessment of general mathematical knowledge, as well as 2) the hits percentage, and 3) reaction time from an arithmetical task. Additionally, selective attention and intelligence quotient (IQ) were also evaluated.

Then, individuals underwent an EEG experimental paradigm where two conditions were recorded: 1) a 3-minute eyes-open resting state and 2) performing either reading or mathematical activities. EEG recordings from the reading experiment consisted of reading a text aloud and then answering questions about the text. Alternatively, EEG recordings from the math experiment involved the solution of two blocks with 20 arithmetic operations (addition and subtraction). Subsequently, each child was randomly subcategorized as 1) the experimental group, who were asked to engage with Smartick for three months, and 2) the control group, who were not involved with the intervention. Once the 3-month period was over, every child was reassessed as described before.

Description of the contents of the dataset

The dataset contains a total of 76 subjects (sub-), where two study groups were assessed: 1) reading difficulties (R) and 2) math difficulties (M). Then, each individual was subcategorized as experimental subgroup (e), where children were compromised to engage with Smartick, or control subgroup (c), where they did not get involved with any intervention.

Every subject was followed up on for three months. During this period, each subject underwent two EEG sessions, representing the PRE-intervention (ses-1) and the POST-intervention (ses-2).

The EEG recordings from the reading difficulties group consisted of a resting state condition (run-1) and while performing active reading and reading comprehension activities (run-2). On the other hand, EEG data from the math difficulties group was collected from a resting state condition (run-1) and when solving two blocks of 20 arithmetic operations (run-2 and run-3). All EEG files were stored in .set format. The nomenclature and description from filenames are shown below:

Nomenclature	Description
sub-	Subject
M	Math group
R	Reading group
c	Control subgroup
e	Experimental subgroup
ses-1	PRE-intervention
ses-2	POST-Intervention
run-1	EEG for baseline
run-2	EEG for reading activity, or the first block of math
run-3	EEG for the second block of math

Example: the file sub-Rc11_ses-1_task-SmartickDataset_run-2_eeg.set is related to: - The 11th subject from the reading difficulties group, control subgroup (sub-Rc11). - EEG recording from the PRE-intervention (ses-1) while performing the reading activity (run-2)

Independent variables

Study groups:
- Reading difficulties
  - Control: children did not follow any intervention
  - Experimental: Children used the reading program of Smartick for 3 months
- Math difficulties
  - Control: children did not follow any intervention
  - Experimental: Children used the math program of Smartick for 3 months
Condition:
- PRE-intervention: first psychological and electroencephalographic evaluation
- POST-intervention: second psychological and electroencephalographic evaluation

Dependent variables

Psychometric data from the reading difficulties group:
- Orthographic_ERR: number of orthographic errors.
- Phonological_ERR: number of phonological errors.
- Selective_Attention: score from the selective attention test.
- Reading_Speed: reading speed in words per minute.
- Comprehension: score on a reading comprehension task.
- GROUP: C for the control group, E for the experimental group.
- GENDER: M for male, F for Female.
- AGE: age at the beginning of the study.
- IQ: intelligence quotient.
Psychometric data from the math difficulties group:
- WRAT4: score from the WRAT-4 test.
- hits: hits during the EEG acquisition [%].
- RT: reaction time during the EEG acquisition [s].
- Selective_Attention: score from the selective attention test.
- GROUP: C for the control Group, E for the experimental group.
- GENDER: M for male, F for female.
- AGE: age at the beginning of the study.
- IQ: intelligence quotient.

Psychometric data can be found in the 01_Psychometric_Data.xlsx file

Engagement percentage within Smartick (only for experimental group)
- These values represent the engagement percentage through Smartick.
- Students were asked to get involved with the online method for learning for 3 months, 5 days a week.
- Greater values than 100% denote participants who regularly logged in more than 5 days weekly.

Engagement percentage be found in the 05_SessionEngagement.xlsx file

Methods

Subjects

Seventy-six Mexican children between 7 and 13 years old were enrolled in this study.

Information about the recruitment procedure

The sample was recruited through non-profit foundations that support learning and foster care programs.

Apparatus

g.USBamp RESEARCH amplifier

Initial setup

Explain the task to the participant.
Sign informed consent.
Set up electrodes.

Task details

The stimuli nested folder contains all stimuli employed in the EEG experiments.

Level 1 - Math: Images used in the math experiment. - Reading: Images used in the reading experiment.

Level 2 - Math * POST_Operations: arithmetic operations from the POST-intervention.
* PRE_Operations: arithmetic operations from the PRE-intervention. - Reading * POST_Reading1: text 1 and text-related comprehension questions from the POST-intervention. * POST_Reading2: text 2 and text-related comprehension questions from the POST-intervention. * POST_Reading3: text 3 and text-related comprehension questions from the POST-intervention. * PRE_Reading1: text 1 and text-related comprehension questions from the PRE-intervention. * PRE_Reading2: text 2 and text-related comprehension questions from the PRE-intervention. * PRE_Reading3: text 3 and text-related comprehension questions from the PRE-intervention.

Level 3 - Math * Operation01.jpg to Operation20.jpg: arithmetical operations solved during the first block of the math

96 wells fluorescence reading and R code statistic for analysis
zenodo.org
bin, csv, doc, pdf
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JVD Molino; JVD Molino (2024). 96 wells fluorescence reading and R code statistic for analysis [Dataset]. http://doi.org/10.5281/zenodo.1119285
Explore at:
doc, csv, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1119285
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
JVD Molino; JVD Molino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m²s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.

Info

ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3

barplot_R.R -> code to generate bar plot in R statistic 3.3.3

boxplotv2.R -> code to generate boxplot in R statistic 3.3.3

pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.

who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.

who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.

Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content

ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.

Consider citing our work.

Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
p
Trends in Reading and Language Arts Proficiency (2011-2022):...
publicschoolreview.com
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). Trends in Reading and Language Arts Proficiency (2011-2022): Ferguson-Florissant R-II School District vs. Missouri [Dataset]. https://www.publicschoolreview.com/missouri/ferguson-florissant-r-ii-school-district/2912010-school-district
Explore at:
Dataset updated
Mar 7, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ferguson-Florissant R-II School District, Missouri
Description
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Ferguson-Florissant R-II School District vs. Missouri
Z
Newcastle polysomnography and accelerometer data
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent van Hees; Sarah Charman; Kirstie Anderson (2020). Newcastle polysomnography and accelerometer data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1160409
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Netherlands eScience Center, Amsterdam
5. Regional Sleep Service, Freeman Hospital, Newcastle
Newcastle University, Newcastle
Authors
Vincent van Hees; Sarah Charman; Kirstie Anderson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Newcastle PSG+Accelerometer study 2015

This data set contains 55 .bin files, 28 .txt files, and one .csv file, which were collected in Newcastle upon Tyne (UK) to evaluate an accelerometer-based algorithm for sleep classification. The data come form a a single night polysomnography recording in 28 sleep clinic patients. A description of the experimental protocol can be found in this open access PLoSONE paper from 2015: https://doi.org/10.1371/journal.pone.0142533.

Polysomnography

Sleep scores derived from polysomnography are stored in the .txt files. Each file represents a time series (one night) of one participant. The resolution of the scoring is 30 seconds. Participants are numbered. The participant number is included in the file names as “mecsleep01_...”. pariticpants_info.csv is a dictionary of participant number, diagnosis, age, and sex.

Accelerometer data

Accelerometer data from brand GENEActiv (https://www.activinsights.com) are stored in .bin files. Per participant two accelerometers were used: One accelerometer on each wrist (left and right). The right wrist from participant 10 is missing, hence the total number of 55 bin files. The tri-axial (three axis) accelerometers were configured to record at 85.7 Hertz. The accelerometer data can be read with R package GENEAread https://cran.r-project.org/web/packages/GENEAread/index.html. Additional information on the accelerometer can be found on the manufacturers product website: https://www.activinsights.com/resources-support/geneactiv/downloads-software/, including a description of the binary file structure on page 27 of this (pdf) file: https://49wvycy00mv416l561vrj345-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/geneactiv_instruction_manual_v1.2.pdf. The participant number and the body side on which the accelerometer is worn are included in the file names as “MECSLEEP01_left wrist...”.

Participant information

The .csv file as included in this dataset contains a dictionary of the participant numbers, sleep disorder diagnosis, participant age at the time of measurement, and sex.

Example processing

The code we used ourselves to process this data can be found in this GitHub repository: https://github.com/wadpac/psg-ncl-acc-spt-detection-eval. Note that we use R package GGIR: https://cran.r-project.org/web/packages/GGIR/, which calls R package GENEAread for reading the binary data.
p
Trends in Reading and Language Arts Proficiency (2011-2022): Oak Grove...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Reading and Language Arts Proficiency (2011-2022): Oak Grove Middle School vs. Missouri vs. Oak Grove R-VI School District [Dataset]. https://www.publicschoolreview.com/oak-grove-middle-school-profile/64075
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Oak Grove R-VI School District, Missouri
Description
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Oak Grove Middle School vs. Missouri and Oak Grove R-VI School District
E
Code for dealing with data format CARIBIC_NAmes_v02
edmond.mpdl.mpg.de
edmond.mpg.de
txt, type/x-r-syntax
Updated Mar 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walter, David; Walter, David (2022). Code for dealing with data format CARIBIC_NAmes_v02 [Dataset]. http://doi.org/10.17617/3.WDVSU7
Explore at:
type/x-r-syntax(75684), txt(76894), txt(128015), txt(132902)Available download formats
Unique identifier
https://doi.org/10.17617/3.WDVSU7
Dataset updated
Mar 3, 2022
Dataset provided by
Edmond
Authors
Walter, David; Walter, David
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
R- and Igor-Code for reading and writing data files of format "CARIBIC_NAmes_v02". See "https://doi.org/10.17617/3.10" for the file format description. That file format has been used predominantly within projects CARIBIC and ATTO, for example in "https://doi.org/10.17617/3.3r". The code files of this dataset can be used with software R ("r-project.org") or Igor Pro ("https://www.wavemetrics.com/").
u
Data from: Benzoxazinoids in roots and shoots of cereal rye (Secale cereale)...
agdatacommons.nal.usda.gov
s.cnmilf.com
+1more
application/csv
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clifford P. Rice; Briana A. Otte; Matthew Kramer; Harry H. Schomberg; Steven B. Mirsky; Katherine L. Tully (2025). Data from: Benzoxazinoids in roots and shoots of cereal rye (Secale cereale) and their fates in soil after cover crop termination [Dataset]. http://doi.org/10.15482/USDA.ADC/1526330
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1526330
Dataset updated
Nov 21, 2025
Dataset provided by
Ag Data Commons
Authors
Clifford P. Rice; Briana A. Otte; Matthew Kramer; Harry H. Schomberg; Steven B. Mirsky; Katherine L. Tully
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Cover crops provide many agroecosystem services, including weed suppression, which is partially exerted through release of allelopathic benzoxazinoid (BX) compounds. This research characterizes (1) changes in concentrations of BX compounds in shoots, roots, and soil at three growth stages (GS) of cereal rye (Secale cereale L.), and (2) their degradation over time following termination. Concentrations of shoot dominant BX compounds, DIBOA-glc and DIBOA, were least at GS 83 (boot). The root dominant BX compound, HMBOA-glc, concentration was least at GS 54 (elongation). Rhizosphere soil BX concentrations were 1000 times smaller than in root tissues. Dominant compounds in soil were HMBOA-glc and HMBOA. Concentrations of BX compounds were similar for soil near root crowns and between-rows. Soil BX concentrations following cereal rye termination declined exponentially over time in three of four treatments: incorporated shoots (S) and roots (R), no-till S+R (cereal rye rolled flat), and no-till R (shoots removed), but not in no-till S. On the day following cereal rye termination, soil concentrations of HMBOA-glc and HMBOA in these three treatments increased above initial concentrations. Concentrations of these two compounds decreased the fastest while DIBOA-glc declined the slowest (half-life of 4 d in no-till S+R soil). Placement of shoots on the surface of an area where cereal rye had not grown (no-till S) did not increase soil concentrations of BX compounds. The short duration and complex dynamics of BX compounds in soil prior to and following termination illustrate the limited window for enhancing weed suppression by cereal rye allelochemicals; valuable information for programs breeding for enhanced weed suppression. In addition to the data analyzed for this article, we also include the R code. Resources in this dataset:Resource Title: BX data following termination. File Name: FinalBXsForMatt-20200908.csvResource Description: For each sample, gives the time, depth, location, and plot treatment, and then the compound concentrations. This is the principal data set analyzed with the R (anal2-cleaned.r) code, see that code for use.Resource Title: BX compounds from 3rd sampling time before termination. File Name: soil2-20201123.csvResource Description: These data are for comparison with the post termination data. They were taken at the 3rd sampling time (pre-termination), a day prior to termination. Each sample is identified with a treatment, date, and plot location, in addition to the BX concentrations. See R code (anal2-cleaned.r) for how this file is used.Resource Title: Soil location (within row versus between row) values of BX compounds. File Name: s2b.csvResource Description: Each row gives the average BX compound for each soil location (within row versus between row) for the second sample for each plot. These data are combined with bx3 (the data set read in from the file , "FinalBXsForMatt-20200908.csv"). See R (anal2-cleaned.r) code for use.Resource Title: R code for analysis of the decay (post-termination) BX data.. File Name: anal2-cleaned.rResource Description: This is the R code used to analyze the termination data. It also creates and writes out some data subsets (used for analysis and plots) that are later read in.Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Tissue BX compounds. File Name: tissues20210728b.csvResource Description: Data file holding results from a tissue analysis for BX compounds, in ug, from shoots and roots, and at various sampling times. Read into the R file, anal1-cleaned.r where it is used in a statistical analysis and to create figures.Resource Title: BX compounds from soil with a live rye cover crop. File Name: soil2-20201214.csvResource Description: BX compounds (in ng/g dry wt), by treatment, sampling time, date, and plot ID. These are data are read into the R program, anal1-cleaned.r, for analysis and to create figures. These are soil samples taken from locations with a live rye plant cover crop.Resource Title: R code for BX analyses of soil under rye and plant tissues. File Name: anal1-cleaned.rResource Description: R code for analysis of the soil BX compounds under a live rye cover crop at different growing stages, and for the analysis of tissue BX compounds. In addition to statistical analyses, code in this file creates figures, also some statistical output that is used to create a file that is later read in for figure creation (s2-CLD20220730-Stage.csv).Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Description of data files for anal2-cleaned.r. File Name: readme2.txtResource Description: Describes the input files used in the R code in anal2-cleaned.r, including descriptions and formats for each field. The file also describes some output (results) files that were uploaded to this site. This is a plain ASCII text file.Resource Title: Estimates produced by anal2-cleaned.r from statistical modeling.. File Name: Estimates20201110.csvResource Description: Estimates produced by anal2-cleaned.r from statistical modeling (see readme2.txt)Resource Title: Summary statistics from anal2-cleaned.r. File Name: CV20210412.csvResource Description: Summary statistics from anal2-cleaned.r, used for plotsResource Title: Data summaries (same as CV20210412.csv), rescaled. File Name: RESCALE-20210412.csvResource Description: Same as "CV20210412.csv" except log of data have been rescaled to minimum at least zero and maximum one, see readme2.txtResource Title: Statistical summaries for different stages. File Name: s2-CLD20220730-Stage.csvResource Description: Statistical summaries used for creating a figure (not used in paper), used in anal1-cleaned.r; data for soil BX under living rye.Resource Title: Description of data files for anal1-cleaned.r. File Name: readme1.txtResource Description: Contains general descriptions of data imported into anal1-cleaned.r, and a description of each field. Also contains some descriptions of files output by anal1-cleaned.r, used to create tables or figures.
B
Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be...
ceicdata.com
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2022). Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) [Dataset]. https://www.ceicdata.com/en/brazil/hs85-electrical-machinery-and-equipment-and-parts-thereof-others-imports-value/imports-ncm-fob-discs-flaser-readsystwhich-may-be-recordoncecdr
Explore at:
Dataset updated
Dec 15, 2022
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 2006 - Jan 1, 2007
Area covered
Brazil
Variables measured
Merchandise Trade
Description
Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data was reported at 1.129 USD mn in Jan 2007. This records a decrease from the previous number of 7.819 USD mn for Dec 2006. Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data is updated monthly, averaging 1.129 USD mn from Jan 2004 (Median) to Jan 2007, with 37 observations. The data reached an all-time high of 8.012 USD mn in Oct 2006 and a record low of 0.102 USD mn in Jan 2004. Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data remains active status in CEIC and is reported by Special Secretariat for Foreign Trade and International Affairs. The data is categorized under Brazil Premium Database’s Foreign Trade – Table BR.NCM: HS85: Electrical Machinery and Equipment and Parts Thereof; Others: Imports: Value.
t
Yifan Gao, Lidong Bing, Wang Chen, Michael R. Lyu, Irwin King (2025)....
service.tib.eu
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Yifan Gao, Lidong Bing, Wang Chen, Michael R. Lyu, Irwin King (2025). Dataset: DQG dataset for reading comprehension. https://doi.org/10.57702/d3icbchw [Dataset]. https://service.tib.eu/ldmservice/dataset/dqg-dataset-for-reading-comprehension
Explore at:
Dataset updated
Jan 3, 2025
Description
The dataset for Difﬁculty-controllable Question Generation (DQG) for reading comprehension, prepared by the authors.
Data from: A dataset to model Levantine landcover and land-use change...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10396148
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Kempf; Michael Kempf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 16, 2023
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

EEA Air Quality In-Situ Measurement Station Data

zenodo.org

bin, zip

Updated Dec 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Johannes Heisig; Johannes Heisig (2024). EEA Air Quality In-Situ Measurement Station Data [Dataset]. http://doi.org/10.5281/zenodo.13220430

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13220430

Dataset updated

Dec 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Johannes Heisig; Johannes Heisig

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This dataset is a value-added product based on 'Up-to-date air quality station measurements', administered by the European Environmental Agency (EEA) and collected by its member states. The original hourly measurement data (NO2, SO2, O3, PM10, PM2.5 in µg/m³) was reshaped, gapfilled and aggregated to different temporal resolutions, making it ready to use in time series analysis or spatial interpolation tasks.

Reproducible code for accessing and processing this data and notebooks for demonstration can be found on Github.

Accessing and pre-processing hourly data

Hourly data was retrieved through the API of the EEA Air Quality Download Service. Measurements (single files per station and pollutant) were joined to create a single time series per station with observations for multiple pollutants. As PM2.5 data is sparse but correlates well with PM10, gapfilling was performed according to methods described in Horálek et al., 2023¹. Validity and verification flags from the original data were passed on for quality filtering. Reproducible computational notebooks using the R programming language are available for the data access and the gapfilling procedure.

Temporal aggregates

Data was aggregated to three coarser temporal resolutions: day, month, and year. Coverage (ratio of non-missing value) was calculated for each pollutant and temporal increment. A threshold of 75% was applied to generate reliable aggregates. All pollutants were aggregated by their aritmethic mean. Additionally, two pollutants were aggregated using a percentile method, which has shown to be more appropriate for mapping applications. PM10 was summarized using the 90.41th percentile. Daily O3 was further summarized as the maximum of the 8-hour running mean. Based thereon, monthly and annual O3 was aggregated using the 93.15th percentile of the daily maxima. For more details refer to the reproducible computational notebook on temporal aggregation.

Data columns

column	hourly	daily	monthly	annual	description
Air.Quality.Station.EoI.Code	x	x	x	x	Unique station ID
Countrycode	x	x	x	x	Two-letter ISO country code
Start	x				Start time of (hourly) measurement period
	x	x	x	x	One of NO2; SO2; O3; O3_max8h_93.15; PM10; PM10_90.41; PM2.5 in µg/m³
Validity_	x				Validity flag of the respective pollutant
Verification_	x				Verification flag of the respective pollutant
filled_PM2.5	x				Flag indicating if PM2.5 value is measured or supplemented through gapfilling (boolean)
year		x	x	x	Year (2015-2023)
cov.year_		x		x	Data coverage throughout the year (0-1)
month		x	x		Month (1-12)
cov.month_		x	x		Data coverage throughout the month (0-1)
doy		x			Day of year (0-366)
cov.day_		x			Data coverage throughout the day (0-1)

Station meta data

To avoid redundant information and optimize file size, some relevant meta data is not stored in the air quality data tables, but rather seperately (in a file named "EEA_stations_meta_table.parquet"). This includes type and area of measurement stations, as well as their coordinates.

column	description
Air.Quality.Station.EoI.Code	Unique station ID (required for join)
Countrycode	Two-letter ISO country code
Station.Type	One of "background", "industrial", or "traffic"
Station.Area	One of "urban", "suburban", "rural", "rural-nearcity", "rural-regional", "rural-remote"
Longitude & Latitude	Geographic coordinates of the station

Parquet file format

This dataset is shipped in [Parquet files. Hourly and aggregated data are distributed in four individual datasets. Daily and hourly data are partitioned by `Countrycode` (one file per country) to enable reading smaller subsets. Monthly and annual data files are small (> 20Mb) and stored in a single file each. Parquet is a relatively new and very memory-efficient format, that differs from traditional tabular file formats (e.g. CSV) in the sense that it is binary and cannot be opened and displayed by common tabular software (e.g. MS Excel, Libre Office, etc.). Users rather have to use an Apache Arrow implementation, for example in Python, R, C++, or another scripting language. Reading the data there is straight forward (click to see the code samples below).

R code:

# required libraries
library(arrow)
library(dplyr)

# read air quality and meta data
aq = read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet")
meta = read_parquet("EEA_stations_meta_table.parquet")

# join the two for further analysis
aq_meta = inner_join(aq, meta, by = join_by(Air.Quality.Station.EoI.Code))

Python code:

# required libraries
import pandas as pd

# read air quality and meta data
aq = pd.read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet")
meta = pd.read_parquet("EEA_stations_meta_table.parquet")

# join the two for further analysis
aq_meta = aq.merge(meta,on = ["Air.Quality.Station.EoI.Code", "Countrycode"])

t
Y. Lan, B. J. Theobald, R. Harvey (2024). Dataset: View independent computer...
service.tib.eu
resodate.org
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Y. Lan, B. J. Theobald, R. Harvey (2024). Dataset: View independent computer lip-reading. https://doi.org/10.57702/u92i5cay [Dataset]. https://service.tib.eu/ldmservice/dataset/view-independent-computer-lip-reading
Explore at:
Dataset updated
Dec 16, 2024
Description
View independent computer lip-reading.
Data from: Growth dynamics of bottlenose dolphins (Tursiops truncatus) at...
zenodo.org
zip
Updated Dec 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leah Crowe; Leah Crowe; Matthew Schofield; Matthew Schofield; Stephen Dawson; Stephen Dawson; William Rayment; William Rayment (2024). Growth dynamics of bottlenose dolphins (Tursiops truncatus) at the southern extreme of their range [Dataset]. http://doi.org/10.5281/zenodo.14563512
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14563512
Dataset updated
Dec 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leah Crowe; Leah Crowe; Matthew Schofield; Matthew Schofield; Stephen Dawson; Stephen Dawson; William Rayment; William Rayment
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 2024
Description
Version release associated with GitHub repository: https://github.com/leahcrowe-otago/FBD_measurements/tree/main

Workflow:

Identify when in video dolphin surfaces, ID that dolphin

Screen grab the best image of the dolphin above that shows rostrum to tail at/near the surface

used VLC player to go frame by frame

Add screen grab filename to the time and ID from #1

Run "./scripts/Altitudes for snapshots" for each day of droning

this adds a check to make sure GPS of the LiDAR and the metadata from the drone match up

finds altitude for each snapshot

matches up to IDs and outputs file to enter measurements

identifies if altimeter reading meets criteria per Dickson et al. 2020 ("issues" column)

Populate the "altperimage_*" file with measurements (.csvs from Whalelength are later merged in "./scripts/whalength_file_merge.R")

filter by issue == "N"

hide columns C, G, H, and J to AC

Run each resulting image through Whalength

I1P Video stills, -1.5 cm offset

Run "measurements_demo.Rmd" to wrangle data, incorporate life history info ("demo"graphy data), and create some supplementary figures

calls "whalength_file_merge.R"

merges all Whalelength outputs

creates Fig. S4, calibration measurements

Supplementary Figures S5

data output: see data repo for metadata details

individual i measurement data at each capture occasion j for model

individual life history data

Model run in the "./scripts/stan/runstan_allo_mv_t0.R" file: 'run' the 'stan' code of 'allo'metric measurement data in a 'm'ulti'v'ariate model with fixed age at length zero (t0)

data formatting for model

calls the Stan model

main model: "./scripts/stan/vb_mod_all0_t0.stan"

for the supplementary sex/pod effect model: "./scripts/stan/sex_pod_effects.R" runs the model specified in "./scripts/stan/vb_mod_all0_t0_sexpod.stan"

initial values for von Bertalanffy model (init_vb)

fit the model (fit_vb)

save results

"./scripts/Results.R"

read data

summarise results

create table S2

create Figs. 2–4, S6, S7

"Supplementary.qmd" creates the suppmat file

creates supplementary tables

Section 1.3 created from './scripts/Results_age_adjusted.R'

Fig. S1 map created by calling "./scripts/ms_map.R"

Data: "./data/Measurements/Data for review"

Metadata-description.pdf: describes the data in the following files:

ij_1.csv

ij_2.csv

ij_3.csv

ij_ID.csv
p
Trends in Reading and Language Arts Proficiency (2011-2022): Seneca R-VII...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Reading and Language Arts Proficiency (2011-2022): Seneca R-VII School District vs. Missouri [Dataset]. https://www.publicschoolreview.com/missouri/seneca-r-vii-school-district/2927900-school-district
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Seneca R-VII School District
Description
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Seneca R-VII School District vs. Missouri
p
Trends in Reading and Language Arts Proficiency (2011-2022): Ozark R-VI...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Reading and Language Arts Proficiency (2011-2022): Ozark R-VI School District vs. Missouri [Dataset]. https://www.publicschoolreview.com/missouri/ozark-r-vi-school-district/2923430-school-district
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ozark R-VI School District
Description
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Ozark R-VI School District vs. Missouri
p
Trends in Reading and Language Arts Proficiency (2011-2022): Daniel Young...
publicschoolreview.com
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). Trends in Reading and Language Arts Proficiency (2011-2022): Daniel Young Elementary School vs. Missouri vs. Blue Springs R-IV School District [Dataset]. https://www.publicschoolreview.com/daniel-young-elementary-school-profile
Explore at:
Dataset updated
Oct 26, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Blue Springs R-IV School District
Description
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Daniel Young Elementary School vs. Missouri and Blue Springs R-IV School District

Facebook

Twitter

Click to copy link

Link copied

Cite

Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.w3r2280w0

Dataset updated

Dec 7, 2023

Dataset provided by

HIV Vaccine Trials Networkhttp://www.hvtn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
PEPFAR

Authors

Dylan Westfall; Mullins James

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Clear search

Close search

Google apps

Main menu

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

Cheltenham Crime Data

Cheltenham PA, Crime Data

Getting Started

Proximity to Philadelphia

Reference

Data from: Critical Search: A procedure for guided reading in large-scale...

Dataset of psychophysiological data from children with learning difficulties...

README

Authors

Contact person

Project name

Year that the project ran

Brief overview of the tasks in the experiment

Description of the contents of the dataset

Independent variables

Dependent variables

Methods

Subjects

Information about the recruitment procedure

Apparatus

Initial setup

Task details

96 wells fluorescence reading and R code statistic for analysis

Trends in Reading and Language Arts Proficiency (2011-2022):...

Newcastle polysomnography and accelerometer data

Newcastle PSG+Accelerometer study 2015

Polysomnography

Accelerometer data

Participant information

Example processing

Trends in Reading and Language Arts Proficiency (2011-2022): Oak Grove...

Code for dealing with data format CARIBIC_NAmes_v02

Data from: Benzoxazinoids in roots and shoots of cereal rye (Secale cereale)...

Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be...

Yifan Gao, Lidong Bing, Wang Chen, Michael R. Lyu, Irwin King (2025)....

Data from: A dataset to model Levantine landcover and land-use change...

EEA Air Quality In-Situ Measurement Station Data

Introduction

Accessing and pre-processing hourly data

Temporal aggregates

Data columns

Station meta data

Parquet file format

Y. Lan, B. J. Theobald, R. Harvey (2024). Dataset: View independent computer...

Data from: Growth dynamics of bottlenose dolphins (Tursiops truncatus) at...

Workflow:

Trends in Reading and Language Arts Proficiency (2011-2022): Seneca R-VII...

Trends in Reading and Language Arts Proficiency (2011-2022): Ozark R-VI...

Trends in Reading and Language Arts Proficiency (2011-2022): Daniel Young...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispeciesSee More Versions

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies